zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Apache License 2.0
162 stars 21 forks source link

Is it possible to train just the projector not the LLM and the encoder? #35

Closed TonyJiang17 closed 2 weeks ago

zjysteven commented 2 months ago

Currently we don't support that (assuming one always tunes the LLM). We can support finetuning only the encoder in the near future (this strategy is a bit weird to me though). For the time being you can approximate this with super small lora hyperparameters (r and alpha) for LLM.

TonyJiang17 commented 2 months ago

Gotcha, thanks! @zjysteven, a follow up question.

Is it possible to finetune using Lora and do further lora finetuning on top of that already finetuned model? If so, how would that be done? I tried it and it gave me an error saying many modules are not found. I wonder if it's cuz the peft modules were not merged in in the first finetuning. Thanks in advance!

zjysteven commented 2 months ago

Yes it is supported. See https://github.com/zjysteven/lmms-finetune/issues/29#issuecomment-2313703559 and the discussions in that issue. If those do not help, you can share more details of your exact case and we can go from there to solve it.

TonyJiang17 commented 2 months ago

thanks @zjysteven I actually tried that, doing further training on top of an already finetuned model by providing the checkpoint. The issue arises after the second training is complete and the saved final model couldn't be ingested by

LlavaNextVideoForConditionalGeneration.from_pretrained

I am in the middle of trying something. After the first lora finetuning is complete. I would merge the lora weights into the base model. And then use the merged model to do the secondary finetuning. The final model of this process seems to be ingestible by LlavaNextVideoForConditionalGeneration.from_pretrained but the model seems collapsed. I am gonna try to look into that more if that's a mistake on my part or not.

zjysteven commented 2 months ago

I see. Yeah it is something we haven't thoroughly tested, and what you described makes sense to me. Theoretically though if we manually merge lora weights before each new round of finetuning, it should be working smoothly. Will add this merging part to the codebase in the near future.

Meanwhile by "collapsed" what do you exactly mean? I can also see if I have some inputs on helping diagnose this.

TonyJiang17 commented 1 month ago

@zjysteven I need to test again to make sure, but what I mean by "collapsed" is that the final model started to return gibberish.

The secondary finetuning I am trying to do is a classification question, but I am simply training as an instruction instead of adding another classification head. So in the training set I would give it the prompt containing the options and the gpt output is the right class for that sample. Do you think it's okay? any tips? or you think I should still use a classification head. If that's the case, how can I add a classification head on top of llava-next-video. Thanks!

zjysteven commented 1 month ago

"Prompt containing options" should be fine, as there are training instances formatted like this in llava-1.5; not exactly sure if this is the case for llava-next-video, but I would assume so.

One tip from llava-1.5 is that they find adding an instructing prompt like "Answer the question with a single word/option" at the end of the prompt can help with the performance (see Sec. 3.2 in "Improved Baselines with Visual Instruction Tuning"). I would suggest to try this.

Another thought I have is that you probably want to be careful with the training epochs, learning rate, and lora hyperparams when doing the second-round finetuning for classification. If this round is overly trained (like high learning rate, high lora_r) then I could see the possibility of model overfitting on doing classification, while forgetting how to follow other instructions.

TonyJiang17 commented 1 month ago

Hi @zjysteven just a qq and again thanks for this amazing work.

Can we save a checkpoint for each epoch or base on how many steps? I did ran into the issue that after training for too many epochs on a relatively small dataset, it started to overfit i think...

zjysteven commented 2 weeks ago

@TonyJiang17 Sorry I missed your last question. Yes you can configure it by certain arguments of huggingface trainer like "save_strategy" and "save_steps" https://huggingface.co/docs/transformers/v4.45.2/en/main_classes/trainer#transformers.TrainingArguments.save_strategy