whwu95 / FreeVA

FreeVA: Offline MLLM as Training-Free Video Assistant
Apache License 2.0
48 stars 0 forks source link

experiment setting of table3,4 #2

Closed yfpeng1234 closed 5 months ago

yfpeng1234 commented 5 months ago

Dear author, Your work is really interesting. But I have a puzzle regarding table3,4, where you compare the video understanding ability before and after video SFT. For offline (or initialized videoChatGPT), did you use the aggregation method of their original paper or the dense aggregation you proposed? Thanks!

whwu95 commented 5 months ago

Hi yuanfang,

Thank you for your interest in my work and for your question regarding Tables 3 and 4 in our paper.

In Table 4, I employed the aggregation method as described in the original VideoChatGPT paper (ie, temporal mean pooling + spatial mean pooling). I simply loaded the parameters before and after applying the SFT, both provided by the official code. Interestingly, I found that the results post-SFT did not surpass the performance observed prior to SFT.

In Table 3, I utilized temporal mean pooling and conducted my own training of the SFT.

I hope this clarifies your query. Please do not hesitate to reach out if you have further questions.

yfpeng1234 commented 5 months ago

Thanks so much for your prompt reply. Your paper is really thought-provoking. Look forward to your future work!