An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.
Resources: [🤗HuggingFace]
See more details in ModelZoo.md.
Name | ViT | LLM | Weights | MME | SEED | SQA | MMB | MMB-CN | TextVQA | GQA |
---|---|---|---|---|---|---|---|---|---|---|
llava-next-vicuna-7b | CLIP-L-336 | Vicuna-7B | SFT | 1519 | 70.2 | 70.1 | 67.4 | 60.6 | 64.9 | 64.2 |
open-llava-next-vicuna-7b | CLIP-L-336 | Vicuna-7B | PT, SFT | 1540 | 71.1 | 70.7 | 68.5 | 60.7 | 67.2 | 64.3 |
llava-next-llama3-8b | CLIP-L-336 | LLaMA3-8B | SFT | 1591 | 72.7 | 73.4 | 72.6 | 69.0 | 65.0 | 65.5 |
open-llava-next-llama3-8b | CLIP-L-336 | LLaMA3-8B | PT, SFT | 1552 | 74.4 | 77.3 | 74.4 | 70.4 | 69.8 | 65.9 |
Clone this repository and navigate to Open-LLaVA-NeXT folder
git clone https://github.com/xiaoachen98/Open-LLaVA-NeXT.git
cd Open-LLaVA-NeXT
Install Package
conda create -n llava-next python=3.10 -y
conda activate llava-next
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Install additional packages for training
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
You should follow this instruction Data.md to manage the training datasets.
Open-LLaVA-NeXT training consists of two stages: (1) feature alignment stage: use 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: finetune the entire model with 1M completely open source data. Detailed data statics is provided in Visual Instruction Tuning. We take the Vicuna-v1.5-7B variant as example to present the training and evaluation details.
Open-LLaVA-NeXT series are trained on A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. And utilizing DeepSpeed ZeRO-3 can further reduce the memory requirements. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
We use a same set of hyperparameters as LLaVA in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
Hyperparameter | Global Batch Size | Projector lr | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
Open-LLaVA-NeXT-7B | 256 | 1e-3 | 1 | 4096 | 0 |
Hyperparameter | Global Batch Size | LLM lr | Projector lr | Vision Tower lr | Epochs | Max length | Weight decay |
---|---|---|---|---|---|---|---|
Open-LLaVA-NeXT-7B | 128 | 2e-5 | 2e-5 | 2e-6 | 1 | 4096 | 0 |
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.
Pretrain takes around 5 hours for Open-LLaVA-NeXT-7B on 16 x A100 (80G).
Training script with DeepSpeed ZeRO-2: pretrain.sh
.
--mm_projector_type mlp2x_gelu
: the two-layer MLP vision-language connector.--vision_tower openai/clip-vit-large-patch14-336
: CLIP ViT-L/14 336px.Training script with DeepSpeed ZeRO-2: finetune.sh
.
New options to note:
--unfreeze_mm_vision_tower True
: finetune vision tower.--mm_vision_tower_lr 2e-6
: learning rate of vision tower.--image_aspect_ratio anyres
: Process an image with variable resolutions.--mm_patch_merge_type spatial_unpad
: This unpads a PyTorch tensor of a padded and resized image, and by inserting learnable newline vectors into image tokens, the model becomes aware of two-dimensional spatial information. This is used to process image token.See Evaluation.md.
If you find this project useful in your research, please consider cite:
@misc{chen2024open,
title={Open-LLaVA-NeXT: An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.},
author={Chen, Lin and Xing, Long},
howpublished = {\url{https://github.com/xiaoachen98/Open-LLaVA-NeXT}},
year={2024},
doi={10.5281/zenodo.13935471}
}