Closed lucasjinreal closed 3 months ago
Thank you for your interest in my work.
I am currently conducting more experiments with various MLLMs, and I plan to include MVBench as well. I expect to update the results on arXiv in June.
My perspective on why D1 and D2 work better than some other methods is already detailed in the paper. Essentially, the visual tokens generated by the offline image LLM’s visual encoder can be directly "understood" by the LLM. Temporal averaging, however, can distort the feature of each frame's visual tokens, especially as the number of frames increases. This distortion can impair the LLM's understanding of the visual tokens.
@whwu95 thanks for the reply, the T-N-D meaning on T get ND is ituitively distorted features, but why does D2 achrives better than D1?
Please try MVBench, since it is multiple choices with only one single accurate anweser, shouldn't be effected by ChatGPT3.5's choice. If got any results, feel free to pin me!
Sure.
Hi, it looks an extremly tiny trick can make image MLLM do video good, still have 2 questions wanna to disuccus.