Input truncation in collator

zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.

Apache License 2.0

156 stars 19 forks source link

Input truncation in collator #49

Open ashwinpra opened 4 days ago

ashwinpra commented 4 days ago

Hi! I had a few doubts regarding the truncation being done in the data collator.

For instance, in collators/llava_onevision.py:

https://github.com/zjysteven/lmms-finetune/blob/b3a68751d4631e7de5441f1c81cde982119991a4/collators/llava_onevision.py#L133-L141

What issues were faced when truncation was set to True?

As a follow-up, in the manual truncation: https://github.com/zjysteven/lmms-finetune/blob/b3a68751d4631e7de5441f1c81cde982119991a4/collators/llava_onevision.py#L186-L189

When you directly truncate the tensor, isn't it possible that if there's an image present on the right side, it could get truncated unevenly? For instance, let's say there's an image that extends from index 1000 to 2000, and you truncate it till index 1500. Wouldn't such an instance result in an error?

Thanks in advance!

zjysteven commented 4 days ago

What issues were faced when truncation was set to True?

We rely on return_assistant_tokens_mask of apply_chat_template to automatically slice assistant tokens, which is useful for constructing labels. As indicated in the comment, when truncation=True in apply_chat_template, the returned assistant tokens mask is somehow wrong when I tried earlier. You can try it yourself to confirm.

an image present on the right side, it could get truncated unevenly?

This is indeed possible, but 1) it wouldn't cause a technical error that would fail the training, and 2) this behavior (some image tokens getting truncated) simply cannot be avoided if it really happens.

Let me know if this makes sense.

ashwinpra commented 1 day ago

Hi, that clears some of my doubts, but I have a related query.

Let's say my prompt is something like this: <image><image>\nWhat is the difference between the two images?, and the two images are a.jpg and b.jpg.

Now say I'm doing left truncation, and the image tokens corresponding to the first <image> tag get truncated. Could the second <image> tag accidentally get replaced by the tensor of a.jpg, instead of b.jpg?

https://github.com/zjysteven/lmms-finetune/blob/86895101a7f794c47cb3acc1061d0a148bc0b1df/collators/llava_onevision.py#L225-L230 I can see that the truncated prompt (input_ids) and the image tensors (vision_inputs) are returned separately. How are they processed while training?

zjysteven commented 1 day ago

This is a very specific (corner) case which I don't know for sure either. You can look at the source code of transformers to see how vision embeddings are processed during training https://github.com/huggingface/transformers/blob/f73f5e62e2383c1cb6975fca70082d6dc51ec6f2/src/transformers/models/llava_onevision/modeling_llava_onevision.py#L667-L693