Brilliant work! - Githubissues

lucasjinreal commented 3 months ago

Hi, it looks an extremly tiny trick can make image MLLM do video good, still have 2 questions wanna to disuccus.

Does the MVbench tested as well? How does MVBench work?
Have u consider do some deeper reason, why D1 and D2 works even better than some other method?

whwu95 commented 3 months ago

Thank you for your interest in my work.

I am currently conducting more experiments with various MLLMs, and I plan to include MVBench as well. I expect to update the results on arXiv in June.
My perspective on why D1 and D2 work better than some other methods is already detailed in the paper. Essentially, the visual tokens generated by the offline image LLM’s visual encoder can be directly "understood" by the LLM. Temporal averaging, however, can distort the feature of each frame's visual tokens, especially as the number of frames increases. This distortion can impair the LLM's understanding of the visual tokens.

lucasjinreal commented 3 months ago

@whwu95 thanks for the reply, the T-N-D meaning on T get ND is ituitively distorted features, but why does D2 achrives better than D1?

Please try MVBench, since it is multiple choices with only one single accurate anweser, shouldn't be effected by ChatGPT3.5's choice. If got any results, feel free to pin me!

whwu95 commented 3 months ago

Sure.

whwu95 / FreeVA

Brilliant work! #5