zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
Apache License 2.0
122 stars 8 forks source link

Any plans to support Qwen2-VL? #36

Closed KevinH48264 closed 3 days ago

KevinH48264 commented 1 week ago

I believe the format should be similar to Qwen-VL, but wondering if there are plans to support Qwen2-VL as the latest open source LMM?

zjysteven commented 1 week ago

Yes it is in our plan!

KevinH48264 commented 1 week ago

Oh that's awesome, is there an ETA on it?

zjysteven commented 1 week ago

@linyueqian will be implementing it as I'm on the job market. He will have a better idea on the ETA.

linyueqian commented 6 days ago

Since the chat template of the Qwen2-VL in Huggingface does not support the assistant mask yet, we have opened a PR to see if the hf staff can help merge the change. The eta may vary.

KevinH48264 commented 6 days ago

Ah I see, I'm assuming it supports mainly just masking the final assistant message?

zjysteven commented 5 days ago

@KevinH48264 That's right. Being able to mask the final assistant message is the easiest way to accurately construct training labels.

KevinH48264 commented 5 days ago

Does this mean that if I only cared about masking the final assistant message, I could integrate Qwen2-VL in HuggingFace to this repo right now?

zjysteven commented 5 days ago

I may not fully understand the question, but there is always another option of manually masking the assistant message (or equivalently, manually constructing labels; which might be a bit cumbersome though). So yes you could definitely integrate Qwen2-VL. What we are trying to do is to use built-in function of huggingface's chat template for achieving that.

linyueqian commented 3 days ago

@KevinH48264 I just updated our codebase to include Qwen2-VL. Feel free to try and see if it works.