zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
Apache License 2.0
122 stars 8 forks source link
finetuning foundation-models instruction-tuning large-language-model large-multimodal-models llava llava-next multimodal multimodal-large-language-models qwen-vl vision-language visual-instruction-tuning

Enabling the finetuning of the latest Large Multimodal Models

Maintainers: Jingyang Zhang, Yueqian Lin @ Duke CEI

We also thank staff from 🤗huggingface, especially Raushan Turganbay, for their generous discussions and feedbacks on this project.

About

More and more large multimodal models (LMMs) are being released from time to time, but the finetuning of these models is not always straightforward. This codebase aims to provide a unified, minimal structure for LMM finetuning. Key design ideas include:

The codebase is quite flexible. Despite being at an early stage, it already supports the finetuning of various types of LMMs, including:

See supported_models.md for the full list of supported models. More models are coming on the way. For training strategy, 1) full-finetuning, 2) lora, and 3) q-lora are supported for the LLM component, while 1) full-finetuning and 2) lora are supported for the vision encoder/backbone.

TODOS:

:raising_hand: If you would like to have a model available, feel free to open an issue.

What's different from other training frameworks, e.g., LLaMA-Factory, xtuner, swift? These are great projects/frameworks with large scale and high-degree optimization. However, due to their scale and complexity, they could be less transparent and less easy to get started (e.g., I personally feel quite lost when trying to use those frameworks, with a bunch of questions like "how should I format my data"). This codebase (lmms-finetune) is instead designed to be lightweight and simple, meaning that it's much more likely for you to quickly get started and be able to know almost every detail of the training process if you want. In other words, this is a minimal workable codebase that supports LMM finetuning, while facilitating quick experiments, flexible modifications, and easy integrations of new models.

News

Installation

# clone this repo
git clone https://github.com/zjysteven/lmms-finetune.git

# set up a conda environment
conda create -n lmms-finetune python=3.10 -y
conda activate lmms-finetune
## this will install the latest version of torch
## feel free to change it to a specific version
python -m pip install -r requirements.txt

## optionally install flash attention
python -m pip install --no-cache-dir --no-build-isolation flash-attn

Usage

A workable example training run (of LLaVA-NeXT-Video-7B) is showcased in this colab notebook, which is a good starting point to get a sense of how to use this codebase. The following sections provide a more detailed guide on how to finetune a model.

0. See if the model you want to finetune is supported Browse [supported_models.md](docs/supported_models.md). Or run `python supported_models.py`, which will for example show things like ``` Supported models: Model ID : HuggingFace Path ------------------------------------------------ llava-1.5-7b : llava-hf/llava-1.5-7b-hf llava-1.5-13b : llava-hf/llava-1.5-13b-hf llava-next-video-7b : llava-hf/LLaVA-NeXT-Video-7B-hf llava-next-video-7b-32k : llava-hf/LLaVA-NeXT-Video-7B-32K-hf llava-next-video-34b : llava-hf/LLaVA-NeXT-Video-34B-hf llava-interleave-qwen-0.5b : llava-hf/llava-interleave-qwen-0.5b-hf llava-interleave-qwen-7b : llava-hf/llava-interleave-qwen-7b-hf qwen-vl-chat : Qwen/Qwen-VL-Chat ``` :raised_hand: Don't see the one you want? Check out this [guide](docs/add_new_model.md) for step-by-step instructions on how to add a new model.
1. Prepare your finetuning data Similar to LLaVA, we expect the data to be in a json file containing a list of dictionaries, where each dictionary is a sample. ```json [ { "system_prompt": "You are a helpful assistant.", "video": "path/to/video1.mp4", "conversations": [ { "from": "human", "value": "
2. Perform finetuning Modify the sample training bash script, [example_video.sh](./example_scripts/example_video.sh) or [example_image.sh](example_image.sh) (there are no differences other than different model ID and dataset filepath), to specify arguments including the target model, data path, etc. There are comments that explain each argument's meaning. Then simply kick off the training by running the bash script `bash example_scripts/example_video.sh` or `bash example_scripts/example_image.sh`. Note that to exactly run the provided [example_video.sh](./example_scripts/example_video.sh), you will need to download the video clips from ShareGPT4Video; see [here](example_data/videos/ego4d/README.md) for instructions. :chart_with_upwards_trend:*If you prefer graphical interface*, simply run `python webui.py` to lauch the gradio interface for finetuning.
3. Inference with finetuned model The key here is to correctly load the finetuned model, after that everything is the same as how you would do inference with the corresponding model from huggingface. Refer to the [inference documentation](docs/inference.md) for more details. Again you can refer to [this colab](https://colab.research.google.com/drive/139XypY8_wdLgyLXYE_Zve7Hjd809fVpK?usp=sharing) for a complete example.

Acknowledgements

We want to thank the huggingface team for actively integrating newest models in the transformers library. Also, the example finetuning scripts (e.g., this, this, and this) made by HF staff, Niels Rogge and Raushan Turganbay, are very helpful and lay the foundation for this codebase.

The codebase borrows from, is inspired by, or builds upon the following code, repos, and/or libraries: LLaVA, Qwen, transformers, etc.