yfzhang114 / SliME

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Apache License 2.0
138 stars 7 forks source link

Training Data Pipeline #10

Open lixu6-alt opened 3 weeks ago

lixu6-alt commented 3 weeks ago

Hi authors:

Thanks for your impressive work! Nowadays I am working on an idea of using the text instruction to guide the fusion of visual tokens, but I am confused of how to process the multi-turn conversations in the training set (like the samples in LLaVA-665K dataset). In the multi-turn case, you have multiple text instructions to deal with. I noticed that you have already used a text-guided router in the model architecture, so I am wondering how you deal with this issue? Do I have to split each multi-turn sample into multiple single-turn samples or there is a moe efficient way to work it out?

Thanks!

yfzhang114 commented 3 weeks ago

You only need to mask all the answers and system prompts, reserving all the questions as the query text. I think splitting each multi-turn sample into multiple single-turn samples is inefficiency. A intuitive example is that, if the user asked two questions about the image, we need to preserve the image details that related to all the questions.

lixu6-alt commented 3 weeks ago

@yfzhang114 Thanks for your timely response!! I get your thought, but I am concerning if the text encoder can effectively extract the semantic information from multiple questions at once. For esample, by checking the conversation contents in LLaVA-665K, I noticed that most of the questions do not relate to each other although they are corresponding to the same image. Currently, I have splitted the multi-turn samples of LLaVA-665K so the sample size have boosted from 665K to a number above 3200K, causing the training so inefficient. Still looking for a better solution :)

yfzhang114 commented 3 weeks ago

The text encoder is expected to handle multiple questions, with each question just consisting of keywords. It can accommodate a variety of keywords for enhanced processing. However, since this is still an experimental area, further testing and experimentation are necessary to validate these capabilities.

lixu6-alt commented 3 weeks ago

@yfzhang114 Yea, maybe pre-processing the instructions is necessary in this case. As you said, maybe we can use a LLM to extract the keywords from each instruction, then we integrate these keywords and put them to the text encoder just once.