LlaVa interleave for AutoCompletion

sm745052 commented 3 weeks ago

Hi !! We were trying to LORA finetune LLaVa interleave for a autocompletion task on a dataset (DialogCC) that might contain many images (>10) per conversation.

Is it possible to reduce the number of tokens for each image ?

We wanted the truncation side of the tokenizer to be left, will setting it cause any alignment issue with the images ?

For example, say the data is this

"image": ["1.jpg", "2.jpg"],
"conversations": [
  {
      "from": "human",
      "value": "<image> what is difference between the prev image and this <image>"
  },
  ...

and, left truncation makes it

...difference between the prev image and this <image>

then, the <image> will contain 1.jpg or 2.jpg ?

zjysteven commented 3 weeks ago

I don’t think there is an easy way to reduce the number of image tokens (say by changing configurations), as the model is pretrained with very specific preprocessing that is closely related to how the vision encoder (SigLIP in the case of llava-interleave) is trained. There are many research works though that try to prune vision tokens, which you can look at.
I would imagine there will be misalignment, and this behavior is beyond the control of lmms-finetune (it is defined by the forward pass of llava-interleave implemented by huggingface transformers). Maybe you can try doing a quick test by inferencing with your example with left truncation.

zjysteven commented 2 weeks ago

Closing for now. Feel free to reopen if there are more questions.

zjysteven / lmms-finetune

LlaVa interleave for AutoCompletion #44