πππ Visit our online demo here.
ModaVerse-7b-v0
conda create -n modaverse python=3.9 -y
conda activate modaverse
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
git clone --recursive https://github.com/xinke-wang/ModaVerse.git
cd ModaVerse
pip install -r requirements.txt
pip install -e .
rm -rf ImageBind/requirements.txt
cp requirements.txt ImageBind/requirements.txt
cd ImageBind
pip install -e .
cd ..
mkdir .checkpoints && cd .checkpoints
Follow these instructions to obtain and apply Vicuna's 7b-v0
delta weights to the LLaMA pretrained model.
Then, download the ModaVerse pretrained model from one of the following sources:
Model | Foundation LLM | HuggingFace | GoogleDrive | Box |
---|---|---|---|---|
ModaVerse-7b-v0 | Vicuna-7b-V0 | Model | Model | Model |
ModaVerse-chat | Coming Soon |
Next, manually download the ImageBind model, or it will be automatically downloaded to .checkpoints/
when running the ModaVerse code. Finally, place all the weights in the .checkpoints/
folder, following the structure below:
.checkpoints/
βββ 7b_v0
β βββ config.json
β βββ generation_config.json
β βββ model-00001-of-00003.safetensors
β βββ model-00002-of-00003.safetensors
β βββ model-00003-of-00003.safetensors
β βββ model.safetensors.index.json
β βββ special_tokens_map.json
β βββ tokenizer_config.json
β βββ tokenizer.model
βββ imagebind_huge.pth
βββ ModaVerse-7b
βββ added_tokens.json
βββ config.json
βββ config.py
βββ pytorch_model.pt
βββ special_tokens_map.json
βββ tokenizer_config.json
βββ tokenizer.model
A simple example of using the model is as follows:
from modaverse.api import ModaVerseAPI
ModaVerse = ModaVerseAPI()
# Only Text Instruction
text_instruction = 'Please generate an audio that a dog is barking.'
ModaVerse(text_instruction)
# With Multi-modal Input
text_instruction = 'Please generate an audio of the sound for the animal in the image.'
ModaVerse(text_instruction, ['assets/media/image/cat.jpg'])
The output is saved in the output
folder by default.
Running inference with fully equipped generators for three-modality diffusion models may require at least 40 GB of GPU memory. If you lack sufficient memory, consider setting meta_response_only=True
to receive only the meta response from the model. And customize the parser and generator to fit your needs.
ModaVerse = ModaVerseAPI(meta_response_only=True)
python demo.py
If you find ModaVerse useful in your research or applications, please consider cite:
@article{wang2024modaverse,
title={ModaVerse: Efficiently Transforming Modalities with LLMs},
author={Wang, Xinyu and Zhuang, Bohan and Wu, Qi},
journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}
We would like to thank the authors of the following repositories for their valuable contributions: ImageBind, MiniGPT-4, Vicuna, Stable Diffusion, AudioLDM, NextGPT, VideoFusion