ModaVerse: Efficiently Transforming Modalities with LLMs

ModaVerse

🎆🎆🎆 ~~Visit our online demo here.~~

TODO

[x] 2024.04.07: Release the code of ModaVerse with version ModaVerse-7b-v0
[ ] Customize Diffusion Model Zoo
[ ] Add step-by-step data preparation instructions & instrution dataset
[ ] Training with custom data
[ ] Instruction generation scripts
[ ] Updating ModaVerse in versions with different setting

Installation

conda create -n modaverse python=3.9 -y
conda activate modaverse
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone --recursive https://github.com/xinke-wang/ModaVerse.git

cd ModaVerse
pip install -r requirements.txt
pip install -e .

rm -rf ImageBind/requirements.txt
cp requirements.txt ImageBind/requirements.txt

cd ImageBind
pip install -e .
cd ..

Prepare Pretrained Models

mkdir .checkpoints && cd .checkpoints

Follow these instructions to obtain and apply Vicuna's 7b-v0 delta weights to the LLaMA pretrained model.

Then, download the ModaVerse pretrained model from one of the following sources:

Model	Foundation LLM	HuggingFace	GoogleDrive	Box
ModaVerse-7b-v0	Vicuna-7b-V0	Model	Model	Model
ModaVerse-chat	Coming Soon

Next, manually download the ImageBind model, or it will be automatically downloaded to .checkpoints/ when running the ModaVerse code. Finally, place all the weights in the .checkpoints/ folder, following the structure below:

.checkpoints/
    ├── 7b_v0
    │   ├── config.json
    │   ├── generation_config.json
    │   ├── model-00001-of-00003.safetensors
    │   ├── model-00002-of-00003.safetensors
    │   ├── model-00003-of-00003.safetensors
    │   ├── model.safetensors.index.json
    │   ├── special_tokens_map.json
    │   ├── tokenizer_config.json
    │   └── tokenizer.model
    ├── imagebind_huge.pth
    └── ModaVerse-7b
        ├── added_tokens.json
        ├── config.json
        ├── config.py
        ├── pytorch_model.pt
        ├── special_tokens_map.json
        ├── tokenizer_config.json
        └── tokenizer.model

Usage

A simple example of using the model is as follows:

from modaverse.api import ModaVerseAPI

ModaVerse = ModaVerseAPI()

# Only Text Instruction
text_instruction = 'Please generate an audio that a dog is barking.'
ModaVerse(text_instruction)

# With Multi-modal Input
text_instruction = 'Please generate an audio of the sound for the animal in the image.'
ModaVerse(text_instruction, ['assets/media/image/cat.jpg'])

The output is saved in the output folder by default.

Running inference with fully equipped generators for three-modality diffusion models may require at least 40 GB of GPU memory. If you lack sufficient memory, consider setting meta_response_only=True to receive only the meta response from the model. And customize the parser and generator to fit your needs.

ModaVerse = ModaVerseAPI(meta_response_only=True)

Running the Demo

python demo.py

Citation

If you find ModaVerse useful in your research or applications, please consider cite:

@article{wang2024modaverse,
  title={ModaVerse: Efficiently Transforming Modalities with LLMs},
  author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Acknowledgements

We would like to thank the authors of the following repositories for their valuable contributions: ImageBind, MiniGPT-4, Vicuna, Stable Diffusion, AudioLDM, NextGPT, VideoFusion

xinke-wang / ModaVerse

readme