Building a generalist agent that can interact with the world is an ultimate goal for humans, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.
We propose schema-based instruction and design a series of schemas (e.g., descriptions of tasks, visual observation, and navigation history), based on the characteristics of embodied tasks. Benefitting from this design, we are able to train a unified model on the data collected for diverse tasks, thereby enabling our model to address a wide spectrum of tasks, ranging from vision-language navigation and object localization, to 3D question answering, trajectory summarization, embodied question answering.
With only a single model, NaviLLM has achieved new state-of-the-art results simultaneously on multiple benchmarks, i.e. CVDN, SOON, and ScanQA, and demonstrated comparable performance to latest models on R2R and REVERIE. Additionally, it also won the first place on CVDN leaderboard and the second place on ScanQA leaderboard.
Install the MatterPort 3D simulator. Please add the simulator path to yout python path.
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
Set up the Java Development Kit (JDK), if you want to enable METEOR while evaluating ScanQA. Otherwise, please comment out the related code.
export JAVA_HOME=$jdk_path
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
Create the conda environment and install the requirements.
conda create --name navillm python=3.8.16
conda activate navillm
pip install -r requirements.txt
The data directory is structed as follows. Please download the processed data and features from One Drive.
data
├── connectivity
├── CVDN
├── LLaVA
├── SOON
├── R2R
├── REVERIE
├── EQA
├── eva_features
│ ├── mp3d_EVA02-CLIP-L-14-336.hdf5
│ ├── scanqa_EVA02-CLIP-L-14-336.hdf5
│ └── coco_EVA02-CLIP-L-14-336.hdf5
├── obj_features
│ ├── reverie_obj_feat
│ └── soon_obj_feat
├── models
└── Vicuna-7B
1. Orinal Datasets
2. Image Features
The image features are extracted with EVA-CLIP-02-Large (428M). And we also provide scripts used for extracting features from MP3D, ScanQA, COCO at scripts/data_tools. To use EVA-CLIP-02, please install the corresponding environment following the instruction of th original reposity.
cd scripts/data_tools
sh extract_features_mp3d.sh # for Matterport3D
# sh extract_features_scanqa.sh # for ScanQA
# sh extract_features_coco.sh # for COCO
3. Object Features
We leverage the object features extracted from ViT-B16 by HM3DAutoVLN, and put the processed features of REVERIE and SOON at data/obj_features. You could either disable the object features by removing the flag --enable_og
.
4. Models
The LLM is built upon Vicuna-7B-v1.1. Please download the pre-trained model and put it at data/models. Using Vicuna-7B-v0 will have a certain degree of decrease compared the original results (#7).
We release the model checkpoints and corresponding training logs as follows.
Log | Cost | CVDN | SOON | R2R | REVERIE | ScanQA | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Time (day) | GP | SR | SPL | SR | SPL | SR | SPL | EM | Rouge-L | ||
model_without_pretrain | here | ~1.5 | 5.91 | 35.44 | 28.09 | 67 | 58 | 44.56 | 36.63 | 23.3 | 38.2 |
model_with_pretrain | here | ~3 | 6.16 | 38.33 | 29.24 | 67 | 59 | 42.15 | 35.68 | 22.1 | 37.6 |
Previous works have consistently shown notable improvements after pre-training on augmented data from 2R and REVERIE. However, in our experiment, we find only a slight enhancement on R2R, CVDN, and SOON after pre-training. We speculate that the quality of the data may play a more crucial role than its quantity for our method.
1. Pretraining: The model is trained for 10,000 steps in the pretraining stage with a batch size of 64. In the pre-training stage, we perform teacher forcing training on the combined dataset from CVDN, SOON, R2R, REVERIE, ScanQA, and augmented data from R2R and REVERIE.
sh scripts/pretrain.sh
2. Multi-task Tuning with Pretraining: The model is trained for 5,000 steps in the multi-task fine-tuning stage with a batch size of 64. In the multi-task fine-tuning stage, we alternate between teacher forcing and student forcing on the combined dataset from CVDN, SOON, R2R, REVERIE, ScanQA, and LLaVA-23k.
sh scripts/multi_w_pretrain.sh
3. Multi-task Tuning without Pretraining:
Since the performance of direct multi-task finetuning is comparable to the two-stage training, we recommend multi-task finetuning without pretraining here. It takes approximately 20 hours with 8 Nvidia A100 GPUs.
sh scripts/multi_wo_pretrain.sh
4. Inference: During the testing phase, we employ a sampling strategy with a temperature of 0.01 for action generation in the SOON and REVERIE tasks, to encourage more exploration. For other tasks, we opt for a greedy strategy in generating actions.
sh scripts/evaluation/eval_cvdn.sh # eval_soon.sh/eval_r2r.sh/eval_reverie.sh/eval_scanqa.sh
We would like to thank MatterPort 3D for their contributions to the open-sourced platform and community. Additionally, this work benefits from DUET, HM3DAutoVLN, and VLN-SIG. Thanks for their awesome works!
If you find our NaviLLM useful for your research, please consider giving this repository a star and citing our paper as follows:
@inproceedings{zheng2024towards,
title={Towards learning a generalist model for embodied navigation},
author={Zheng, Duo and Huang, Shijia and Zhao, Lin and Zhong, Yiwu and Wang, Liwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13624--13634},
year={2024}
}