xf-zhao / Matcha-agent

Official implementation of Matcha-agent, https://arxiv.org/abs/2303.08268
https://matcha-agent.github.io/
22 stars 2 forks source link
active-perception agent deep-learning large-language-models llm-robotics llms matcha matcha-agent model multimodality planning robot robotics
Official Implementation of Matcha Agent 🍡~πŸ€– ![](https://img.shields.io/badge/License-Apache_2.0-green) ![](https://img.shields.io/badge/Status-Full_Release-blue) ![https://github.com/xf-zhao/Matcha-agent/releases/tag/v1.0](https://img.shields.io/badge/version-v1.0-blue) ![](https://img.shields.io/badge/Paper-Arxiv-blue) ![](https://img.shields.io/badge/Conference-IROS'23-forestgreen) ---

πŸ”” News

Contents

πŸŽ₯ Demo Video

Matcha-agent demo

πŸ”¨ Install Dependencies

πŸ•Ή Robotic

The experimental task is designed on top of RLBench, but with a replacement of our own NICOL robot, a desktop-based humanoid robot.

Install RLBench and NICOL Robot

git clone git@github.com:xf-zhao/Matcha-agent.git
# option 1: manually install coppeliasim v4.4 and
cd Matcha-agent && pip install -r NICOL/requirements.txt

# option 2: inside docker
docker build --progress=plain -t matcha-agent:latest .
docker container run -it --privileged --gpus all --net=host --entrypoint="" -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=$DISPLAY matcha-agent /bin/bash

Run NICOL demo with RLBench tasks

python3 NICOL/demo.py

πŸŒ‡ Vision

The visual detection is done with ViLD, an open-vocabulary detection model. Despite of the simplicity of the vision in our demo, we use ViLD with a consideration of better generalization.

Install ViLD requirements

Since the library dependencies of ViLD may highly conflict with other packages installed, we encourage to install ViLD model within a separated environment and launch it as a http server.

conda create -n vild python=3.9
conda activate vild
pip install -r requirements.txt
# Download weights
gsutil cp -r gs://cloud-tpu-checkpoints/detection/projects/vild/colab/image_path_v2 ./

Launch Flask server for ViLD

sh launch_vild_server.sh

The ViLD server will be ready under: 0.0.0.0:8848/api/vild

πŸ”‰ Sound

The sound module requires PyTorch, TorchAudio and other sound related packages that may conflict with the robotic and vision configurations. Like for vision module, we also deploy this module within an independent environment.

Install sound module requirements

conda create -n sound python=3.9
conda activate sound
pip install -r requirements.txt

Offline Neural Network Training for Sound Classification.

We train a sound classification neural network.

python train.py

This training process includes

Launch sound module as a server

sh launch_sound_server.sh

The sound server will be ready under: 0.0.0.0:8849/api/sound

πŸ¦™ Large Language Models (LLMs) Configuration

In the original Matcha-agent paper, we use openai API text-davinci-003 and text-ada-001 as the backend LLMs. Nowadays, there are many open-sourced LLMs available. In the version v1.0 release, we use Vicuna-13b model followed with this FastChat doc.

Note that the LLM is worked in a completions mode instead of chat completions mode, i.e. no role-plays since we manually introduce roles in the prompts.

🍡~πŸ€– Run Matcha-agent

python main.py

Optional parameters:

🐞 Error Debuging

⭐ Acknowledgement

The 3D mesh of NICOL robot configurations of the robot can be found in the *.ttt file. We thank seed robotics for authorizing us sharing and making the RH8D hand models publicly available in this repertory.

πŸ”— Citation

@misc{zhao2023chat,
      title={Chat with the Environment: Interactive Multimodal Perception Using Large Language Models}, 
      author={Xufeng Zhao and Mengdi Li and Cornelius Weber and Muhammad Burhan Hafez and Stefan Wermter},
      year={2023},
      eprint={2303.08268},
      archivePrefix={arXiv},
      primaryClass={cs.RO}
}