zxzm-zak / AlignBot

5 stars 3 forks source link

AlignBot Code Repository

Releases license Linux platform

AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

[Zhaxizhuoma]()1,†, [Pengan Chen]()1,2,†, [Ziniu Wu]()1,3,†, [Jiawei Sun]()1, [Dong Wang]()1, [Peng Zhou]()2, [Nieqing Cao]()4, [Yan Ding]()1,* [Bin Zhao]()1,5, [Xuelong Li]()1,6

1Shanghai Artificial Intelligence Laboratory, 2The University of Hong Kong, 3University of Bristol, 4Xi’an Jiaotong-Liverpool University, 5Northwestern Polytechnical University, 6Institute of Artificial Intelligence, China Telecom Corp Ltd

†Equal contribution, *Corresponding author: Yan Ding [yding25 (at) binghamton.edu]

[Project page] [Paper] [Code] [Video]

Abstract

This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminder itself. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders—such as personalized preferences, corrective guidance, and contextual assistance—into structured that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects relevant historical interactions as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in a real-world household environment. A multimodal dataset with 1,500 entries derived from volunteer reminder was used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting 65% improvement and over four times greater effectiveness.

🛠️ Installation Steps

Create a Virtual Environment and Install Dependencies

conda create -n AlignBot python=3.11
conda activate AlignBot
pip install -r requirements.txt

⚙️ LLaVA Training with LLaMA Factory

If you'd like to train LLaVA, this guide will help you get started using the LLaMA Factory framework.

  1. Install LLaMA Factory:

    conda activate AlignBot
    git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
    cd LLaMA-Factory
    pip install -e ".[torch,metrics]"

    Use pip install --no-deps -e . to resolve package conflicts.

  2. Fine-Tuning with LLaMA Board GUI

    llamafactory-cli webui

    or also you can use the following 3 commands to run LoRA fine-tuning, inference and merging of the model:

    llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
    llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
    llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

    Models can be fine-tuned via gui and commands. There are detailed parameter adjustments in the gui, and the fine-tuned model will be saved in LLaMA-Factory/saves. Before training, need to store the entire content of the training dataset in LLaMA-Factory/data and add dataset description in dataset_info.json. How to fill in the dataset description can refer to config/dataset_info.json

  3. Deploy with OpenAI-style API and vLLM by LLaMA Factory

    API_PORT=8000 llamafactory-cli api /AlignBot/config/llavaapi_config.yaml

    The original model path and the fine-tuned model path need to be filled in llavaapi_config.yaml.

  4. For more details on training LLaVA using LLaMA Factory, please visit the official https://github.com/hiyouga/LLaMA-Factory/

🦾 Getting Started

Use the following commands to run model

main.py --mode llava --img use_url

🏷️ License

This repository is released under the MIT license. See LICENSE for additional details.