zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Apache License 2.0
162 stars 21 forks source link

LLaVA-Next repo demo #16

Closed sarisel closed 2 months ago

sarisel commented 2 months ago

Thanks for the great repository. I finetuned a LLaVA-Next-Video model and I was wondering if it is possible to infer it via LLaVA Next demo script. I am aware of the Colab code for inference but I think it would be helpful to have a demo script that can be used to infer over a single image/video. Also, what is the difference between llava-hf/LLaVA-NeXT-Video-7B-hf and lmms-lab/LLaVA-NeXT-Video-7B?

zjysteven commented 2 months ago

Hi, thank you for trying it out. The two questions are related and I will answer the second one first which should make it clear.

  1. Difference between llava-hf/LLaVA-NeXT-Video-7B-hf and lmms-lab/LLaVA-NeXT-Video-7B. The latter is the official model from the authors of LLaVA-NeXT (let's call it official version), while the first is a reimplemented/reproduced version by the huggingface team (let's call it HF version). The difference just lies in how one implements/organizes different modules/components within the model, but the overall architecture and model weights are the same. The HF version is arguably more unified in terms of many things, including model loading, chat template, inferencing etc. This is why in this repo for every model we are using the HF version rather than the original official version, as otherwise we will need to painfully tailor the finetuning code for each different model.

  2. Because of what we discussed above, it would not be possible to inference the model finetuned here using the LLaVA Next demo script you linked (because it is for the official version rather than the HF version). You would need to refer to the HF version inference examples here https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf, which is actually simpler and more transparent in my opinion.

Hope this helps.

sarisel commented 2 months ago

Thanks for the clarification. The HF code seems to be working with the LoRA adapter (even though I have not merged the adapter back to the base model yet:)) with a minor caveat that the generations seem to cut off in the middle of a sentence compared to the original model's generations. I tested this on a couple of videos so far so this may be coincidental. Also, the base model seems to have many configuration json files in huggingface cache. Since the processor is loaded from the base model, I only copied config.json to the finetuned model directory but wanted to confirm if this is the correct way.

zjysteven commented 2 months ago

For the actual generation there could be many affecting factors (e.g., training data, training hyperparams, inference hyperparams like temperature/topp/topk). Not sure if I can help much here.

For loading the model, you can refer to inference.md. Basically for LoRA training, thanks to the integration of PEFT in transformers library, you don't actually need to copy anything from the base model to your finetuned model directory. The adapter_config.json of your finetuned model will have an attribute of base_model_name_or_path which will help automatically load the base model (and everything else such as the processor) for you.