Closed lucasjinreal closed 4 months ago
OK... I have tried your trick actually doesn't work for most VLMs.
Firstly, the TND D1 trick is essentially the same as llava's multi image does, caculate mean on all token features is the stupidest way and of course it will harm the final result.
I have tried D2 it doesn't work at all.
Using llava's multi image inputs, same as your D1, it's the normalest and most resonable way. But this shouldn't be sort of new way, just llava's default way... D2, doesn't work actaully.
May I ask the meaning of 5 dims of the images
respectively when using multiple images?
LLava supports multiple images by default, what if send T,N,D into LLM without any aggregation?
OK... I have tried your trick actually doesn't work for most VLMs.
Firstly, the TND D1 trick is essentially the same as llava's multi image does, caculate mean on all token features is the stupidest way and of course it will harm the final result.
I have tried D2 it doesn't work at all.
Using llava's multi image inputs, same as your D1, it's the normalest and most resonable way. But this shouldn't be sort of new way, just llava's default way... D2, doesn't work actaully.
In my paper, I never claimed that dense aggregation (whether D1 or D2) is a novel method. I haven't tested LLaVA's multiple image input way. My goal was to explore various aggregation methods and to point out that using sparse aggregation (such as the commonly used temporal average pooling) can harm performance. Like D2, I encourage compressing spatial tokens to involve more frames for zero-shot video QA. In fact, I attempted to further compress each frame's visual tokens to 1/4, thereby increasing the number of frames by 4 times. This approach resulted in further improvements in some scenarios.
I have also experimented with multiple other MLLMs (LLaVA-1.6, InternVL, InstructBLIP) and found that dense aggregation consistently outperforms spatial aggregation. The updated results will be published on arXiv in June.
The comparsion should be fair, if you compare with LLava-Next-Video than it should be worthy to do, otherwise is meaningless.
As far as I test, all D1 or 2 failed beats vanilla llava's images input without any aggregation but simple add 8 frames as input, gets the best result. Got MVBench 49.8 easily without training on mulitple images or videos.
I'm puzzled. If you're referring to LLaVA-1.5, I believe that using 8 frames would exceed the default token limit, which should cause the results to fail (as shown in the Table 1(b)). Additionally, I don't understand what you mean by "unfair," as I simply pointed out an overlooked discovery. This work was completed before LLaVA-Next-Video, and I'm glad they made similar findings. Thank you for providing the MVBench results.
Yes, llava1.5 would execeed, but nowadays, we all using resampler. I only have 114 tokens per image. It's a good notice but actually not very general, at least what I have tested is so, And also LLava1.5 can made mulit images input direclty as long as you trained a version with 4096 length. Same as llavaNextVideo does.
LLava supports multiple images by default, what if send T,N,D into LLM without any aggregation?