zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Apache License 2.0
162 stars 21 forks source link

Enabling finetuning of vision encoder and projector #14

Closed zjysteven closed 3 months ago

zjysteven commented 3 months ago
  1. As the title says, now 1) full-finetuning + 2) LoRA for vision encoder, and 1) full-finetuning for vision projector are supported. At the moment the vision projector is rather lightweight compared with the vision encoder and the LLM within the model, so it should be fine that only full-finetuning is supported for vision projector.
  2. The updates also fix #11, where previously due to the partial string matching, the linear layers within the CLIP's ViT will also be included as lora target modules.