A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Hi !! We were trying to LORA finetune LLaVa interleave for a autocompletion task on a dataset (DialogCC) that might contain many images (>10) per conversation.
Is it possible to reduce the number of tokens for each image ?
We wanted the truncation side of the tokenizer to be left, will setting it cause any alignment issue with the images ?
For example, say the data is this
"image": ["1.jpg", "2.jpg"],
"conversations": [
{
"from": "human",
"value": "<image> what is difference between the prev image and this <image>"
},
...
and, left truncation makes it
...difference between the prev image and this <image>
I don’t think there is an easy way to reduce the number of image tokens (say by changing configurations), as the model is pretrained with very specific preprocessing that is closely related to how the vision encoder (SigLIP in the case of llava-interleave) is trained. There are many research works though that try to prune vision tokens, which you can look at.
I would imagine there will be misalignment, and this behavior is beyond the control of lmms-finetune (it is defined by the forward pass of llava-interleave implemented by huggingface transformers). Maybe you can try doing a quick test by inferencing with your example with left truncation.
Hi !! We were trying to LORA finetune LLaVa interleave for a autocompletion task on a dataset (DialogCC) that might contain many images (>10) per conversation.
We wanted the truncation side of the tokenizer to be left, will setting it cause any alignment issue with the images ?
For example, say the data is this
and, left truncation makes it
then, the
<image>
will contain1.jpg
or2.jpg
?