AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots
[Zhaxizhuoma]()1,†, [Pengan Chen]()1,2,†, [Ziniu Wu]()1,3,†, [Jiawei Sun]()1, [Dong Wang]()1, [Peng Zhou]()2, [Nieqing Cao]()4, [Yan Ding]()1,* [Bin Zhao]()1,5, [Xuelong Li]()1,6
1Shanghai Artificial Intelligence Laboratory, 2The University of Hong Kong, 3University of Bristol, 4Xi’an Jiaotong-Liverpool University, 5Northwestern Polytechnical University, 6Institute of Artificial Intelligence, China Telecom Corp Ltd
†Equal contribution, *Corresponding author: Yan Ding [yding25 (at) binghamton.edu]
[Project page] [Paper] [Code] [Video]
This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminder itself. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders—such as personalized preferences, corrective guidance, and contextual assistance—into structured that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects relevant historical interactions as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in a real-world household environment. A multimodal dataset with 1,500 entries derived from volunteer reminder was used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting 65% improvement and over four times greater effectiveness.
Create a Virtual Environment and Install Dependencies
conda create -n AlignBot python=3.11
conda activate AlignBot
pip install -r requirements.txt
If you'd like to train LLaVA, this guide will help you get started using the LLaMA Factory framework.
Install LLaMA Factory:
conda activate AlignBot
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
Use pip install --no-deps -e . to resolve package conflicts.
Fine-Tuning with LLaMA Board GUI
llamafactory-cli webui
or also you can use the following 3 commands to run LoRA fine-tuning, inference and merging of the model:
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
Models can be fine-tuned via gui and commands. There are detailed parameter adjustments in the gui, and the fine-tuned model will be saved in LLaMA-Factory/saves. Before training, need to store the entire content of the training dataset in LLaMA-Factory/data and add dataset description in dataset_info.json. How to fill in the dataset description can refer to config/dataset_info.json
Deploy with OpenAI-style API and vLLM by LLaMA Factory
API_PORT=8000 llamafactory-cli api /AlignBot/config/llavaapi_config.yaml
The original model path and the fine-tuned model path need to be filled in llavaapi_config.yaml.
For more details on training LLaVA using LLaMA Factory, please visit the official https://github.com/hiyouga/LLaMA-Factory/
Use the following commands to run model
main.py --mode llava --img use_url
--mode
: Selects the execution mode.
llava
: Runs the model with reminder form LLaVA api.with_memory
: Runs the model with memory.no_memory
: Runs the model without memory.--img
: Specifies how to handle images.
use_url
: Upload or reference images via URL.use_base64
: Base64-encoded images.This repository is released under the MIT license. See LICENSE for additional details.