xiaoachen98 / Open-LLaVA-NeXT

An open-source implementation for training LLaVA-NeXT.
326 stars 16 forks source link
chatbot chatgpt gpt-4 gpt4o large-multimodal-models llama llama3 llava llava-next multi-modality multimodal vision-language-model visual-language-learning

Open-LLaVA-NeXT

An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.

Resources: [🤗HuggingFace]

💡 Highlights

🤖 Model Zoo

See more details in ModelZoo.md.

Name ViT LLM Weights MME SEED SQA MMB MMB-CN TextVQA GQA
llava-next-vicuna-7b CLIP-L-336 Vicuna-7B SFT 1519 70.2 70.1 67.4 60.6 64.9 64.2
open-llava-next-vicuna-7b CLIP-L-336 Vicuna-7B PT, SFT 1540 71.1 70.7 68.5 60.7 67.2 64.3
llava-next-llama3-8b CLIP-L-336 LLaMA3-8B SFT 1591 72.7 73.4 72.6 69.0 65.0 65.5
open-llava-next-llama3-8b CLIP-L-336 LLaMA3-8B PT, SFT 1552 74.4 77.3 74.4 70.4 69.8 65.9

👨‍💻 ToDo

🔧 Install

  1. Clone this repository and navigate to Open-LLaVA-NeXT folder

    git clone https://github.com/xiaoachen98/Open-LLaVA-NeXT.git
    cd Open-LLaVA-NeXT
  2. Install Package

    conda create -n llava-next python=3.10 -y
    conda activate llava-next
    pip install --upgrade pip  # enable PEP 660 support
    pip install -e .
  3. Install additional packages for training

    pip install -e ".[train]"
    pip install flash-attn --no-build-isolation

Data Preparation

You should follow this instruction Data.md to manage the training datasets.

Training Overview

Open-LLaVA-NeXT training consists of two stages: (1) feature alignment stage: use 558K subset of the LAION-CC-SBU dataset to connect a frozen pretrained vision encoder to a frozen LLM; (2) visual instruction tuning stage: finetune the entire model with 1M completely open source data. Detailed data statics is provided in Visual Instruction Tuning. We take the Vicuna-v1.5-7B variant as example to present the training and evaluation details.

Open-LLaVA-NeXT series are trained on A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. And utilizing DeepSpeed ZeRO-3 can further reduce the memory requirements. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Hyperparameters

We use a same set of hyperparameters as LLaVA in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

  1. Pretraining
Hyperparameter Global Batch Size Projector lr Epochs Max length Weight decay
Open-LLaVA-NeXT-7B 256 1e-3 1 4096 0
  1. Finetuning
Hyperparameter Global Batch Size LLM lr Projector lr Vision Tower lr Epochs Max length Weight decay
Open-LLaVA-NeXT-7B 128 2e-5 2e-5 2e-6 1 4096 0

Pretrain

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions here.

Pretrain takes around 5 hours for Open-LLaVA-NeXT-7B on 16 x A100 (80G).

Training script with DeepSpeed ZeRO-2: pretrain.sh.

Visual Instruction Tuning

  1. Prepare data You should follow the instructions for data preparation in Data.
  2. Prepare MLP projectors You may download our pretrained projectors in Model Zoo, or specify your own MLP projector after pre-training.
  3. Start training Visual instruction tuning takes around 20 hours for Open-LLaVA-NeXT-7B on 16x A100 (80G).

Training script with DeepSpeed ZeRO-2: finetune.sh.

New options to note:

Evaluation

See Evaluation.md.

Citation

If you find this project useful in your research, please consider cite:

@misc{chen2024open,
  title={Open-LLaVA-NeXT: An open-source implementation of LLaVA-NeXT series for facilitating the large multi-modal model community.},
  author={Chen, Lin and Xing, Long},
  howpublished = {\url{https://github.com/xiaoachen98/Open-LLaVA-NeXT}},
  year={2024},
  doi={10.5281/zenodo.13935471}
}

❤️ Acknowledgments