A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Apache License 2.0
162
stars
21
forks
source link
Enabling finetuning of vision encoder and projector #14
As the title says, now 1) full-finetuning + 2) LoRA for vision encoder, and 1) full-finetuning for vision projector are supported. At the moment the vision projector is rather lightweight compared with the vision encoder and the LLM within the model, so it should be fine that only full-finetuning is supported for vision projector.
The updates also fix #11, where previously due to the partial string matching, the linear layers within the CLIP's ViT will also be included as lora target modules.