How does it compare with just directly send T,N,D into LLM?

lucasjinreal commented 5 months ago

LLava supports multiple images by default, what if send T,N,D into LLM without any aggregation?

lucasjinreal commented 5 months ago

OK... I have tried your trick actually doesn't work for most VLMs.

Firstly, the TND D1 trick is essentially the same as llava's multi image does, caculate mean on all token features is the stupidest way and of course it will harm the final result.

I have tried D2 it doesn't work at all.

Using llava's multi image inputs, same as your D1, it's the normalest and most resonable way. But this shouldn't be sort of new way, just llava's default way... D2, doesn't work actaully.

cocoshe commented 5 months ago

May I ask the meaning of 5 dims of the images respectively when using multiple images?

LLava supports multiple images by default, what if send T,N,D into LLM without any aggregation?

whwu95 commented 5 months ago

OK... I have tried your trick actually doesn't work for most VLMs.

Firstly, the TND D1 trick is essentially the same as llava's multi image does, caculate mean on all token features is the stupidest way and of course it will harm the final result.

I have tried D2 it doesn't work at all.

Using llava's multi image inputs, same as your D1, it's the normalest and most resonable way. But this shouldn't be sort of new way, just llava's default way... D2, doesn't work actaully.

In my paper, I never claimed that dense aggregation (whether D1 or D2) is a novel method. I haven't tested LLaVA's multiple image input way. My goal was to explore various aggregation methods and to point out that using sparse aggregation (such as the commonly used temporal average pooling) can harm performance. Like D2, I encourage compressing spatial tokens to involve more frames for zero-shot video QA. In fact, I attempted to further compress each frame's visual tokens to 1/4, thereby increasing the number of frames by 4 times. This approach resulted in further improvements in some scenarios.
I have also experimented with multiple other MLLMs (LLaVA-1.6, InternVL, InstructBLIP) and found that dense aggregation consistently outperforms spatial aggregation. The updated results will be published on arXiv in June.

lucasjinreal commented 5 months ago

The comparsion should be fair, if you compare with LLava-Next-Video than it should be worthy to do, otherwise is meaningless.

As far as I test, all D1 or 2 failed beats vanilla llava's images input without any aggregation but simple add 8 frames as input, gets the best result. Got MVBench 49.8 easily without training on mulitple images or videos.

whwu95 commented 5 months ago

I'm puzzled. If you're referring to LLaVA-1.5, I believe that using 8 frames would exceed the default token limit, which should cause the results to fail (as shown in the Table 1(b)). Additionally, I don't understand what you mean by "unfair," as I simply pointed out an overlooked discovery. This work was completed before LLaVA-Next-Video, and I'm glad they made similar findings. Thank you for providing the MVBench results.

lucasjinreal commented 5 months ago

Yes, llava1.5 would execeed, but nowadays, we all using resampler. I only have 114 tokens per image. It's a good notice but actually not very general, at least what I have tested is so, And also LLava1.5 can made mulit images input direclty as long as you trained a version with 4096 length. Same as llavaNextVideo does.

whwu95 / FreeVA

How does it compare with just directly send T,N,D into LLM? #6