xinke-wang / ModaVerse

[CVPR2024] ModaVerse: Efficiently Transforming Modalities with LLMs
19 stars 1 forks source link

ModaVerse: Efficiently Transforming Modalities with LLMs

License

ModaVerse

πŸŽ†πŸŽ†πŸŽ† Visit our online demo here.

TODO

Installation

conda create -n modaverse python=3.9 -y
conda activate modaverse
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia

git clone --recursive https://github.com/xinke-wang/ModaVerse.git

cd ModaVerse
pip install -r requirements.txt
pip install -e .

rm -rf ImageBind/requirements.txt
cp requirements.txt ImageBind/requirements.txt

cd ImageBind
pip install -e .
cd ..

Prepare Pretrained Models

mkdir .checkpoints && cd .checkpoints

Follow these instructions to obtain and apply Vicuna's 7b-v0 delta weights to the LLaMA pretrained model.

Then, download the ModaVerse pretrained model from one of the following sources:

Model Foundation LLM HuggingFace GoogleDrive Box
ModaVerse-7b-v0 Vicuna-7b-V0 Model Model Model
ModaVerse-chat Coming Soon

Next, manually download the ImageBind model, or it will be automatically downloaded to .checkpoints/ when running the ModaVerse code. Finally, place all the weights in the .checkpoints/ folder, following the structure below:

.checkpoints/
    β”œβ”€β”€ 7b_v0
    β”‚   β”œβ”€β”€ config.json
    β”‚   β”œβ”€β”€ generation_config.json
    β”‚   β”œβ”€β”€ model-00001-of-00003.safetensors
    β”‚   β”œβ”€β”€ model-00002-of-00003.safetensors
    β”‚   β”œβ”€β”€ model-00003-of-00003.safetensors
    β”‚   β”œβ”€β”€ model.safetensors.index.json
    β”‚   β”œβ”€β”€ special_tokens_map.json
    β”‚   β”œβ”€β”€ tokenizer_config.json
    β”‚   └── tokenizer.model
    β”œβ”€β”€ imagebind_huge.pth
    └── ModaVerse-7b
        β”œβ”€β”€ added_tokens.json
        β”œβ”€β”€ config.json
        β”œβ”€β”€ config.py
        β”œβ”€β”€ pytorch_model.pt
        β”œβ”€β”€ special_tokens_map.json
        β”œβ”€β”€ tokenizer_config.json
        └── tokenizer.model

Usage

A simple example of using the model is as follows:

from modaverse.api import ModaVerseAPI

ModaVerse = ModaVerseAPI()

# Only Text Instruction
text_instruction = 'Please generate an audio that a dog is barking.'
ModaVerse(text_instruction)

# With Multi-modal Input
text_instruction = 'Please generate an audio of the sound for the animal in the image.'
ModaVerse(text_instruction, ['assets/media/image/cat.jpg'])

The output is saved in the output folder by default.

Running inference with fully equipped generators for three-modality diffusion models may require at least 40 GB of GPU memory. If you lack sufficient memory, consider setting meta_response_only=True to receive only the meta response from the model. And customize the parser and generator to fit your needs.

ModaVerse = ModaVerseAPI(meta_response_only=True)

Running the Demo

python demo.py

image

Citation

If you find ModaVerse useful in your research or applications, please consider cite:

@article{wang2024modaverse,
  title={ModaVerse: Efficiently Transforming Modalities with LLMs},
  author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Acknowledgements

We would like to thank the authors of the following repositories for their valuable contributions: ImageBind, MiniGPT-4, Vicuna, Stable Diffusion, AudioLDM, NextGPT, VideoFusion