zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
Apache License 2.0
115 stars 8 forks source link

Issue: Currently all vision modules are freezed for simplicity #3

Closed matbee-eth closed 1 month ago

matbee-eth commented 1 month ago

Curious why you made that decision?

zjysteven commented 1 month ago
  1. LLaVA series models to my knowledge freeze vision backbone all the time, and I read a few times somewhere saying that for most models finetuning vision modules leads to worse performance.
  2. Currently the memory cost is already a bit high due to HF's implementation (https://github.com/huggingface/transformers/blob/0fdea8607d7e01eb0e38a1ebeb7feee30a22f0cf/src/transformers/models/llava/modeling_llava.py#L425), which we are actively working on.
  3. Like the title said it's mostly just for simplicity at the moment since the project is at an early stage.

That said, we will definitely try to support tuning vision modules as well.

fedshyvana commented 1 month ago

I think one option is to wrap the vision modules with HF PEFT to support e.g. low-rank updates to the vision module. See xtuner implementation: https://github.com/InternLM/xtuner/blob/main/xtuner/model/llava.py

zjysteven commented 1 month ago

Yes that is the idea. Thank you for sharing your thoughts @fedshyvana

zjysteven commented 1 month ago

Supported with #14. In the example scripts there are added arguments like TRAIN_VISION_ENCODER, USE_VISION_LORA, TRAIN_VISION_PROJECTOR. Feel free to try it out.