Closed yfpeng1234 closed 5 months ago
Hi yuanfang,
Thank you for your interest in my work and for your question regarding Tables 3 and 4 in our paper.
In Table 4, I employed the aggregation method as described in the original VideoChatGPT paper (ie, temporal mean pooling + spatial mean pooling). I simply loaded the parameters before and after applying the SFT, both provided by the official code. Interestingly, I found that the results post-SFT did not surpass the performance observed prior to SFT.
In Table 3, I utilized temporal mean pooling and conducted my own training of the SFT.
I hope this clarifies your query. Please do not hesitate to reach out if you have further questions.
Thanks so much for your prompt reply. Your paper is really thought-provoking. Look forward to your future work!
Dear author, Your work is really interesting. But I have a puzzle regarding table3,4, where you compare the video understanding ability before and after video SFT. For offline (or initialized videoChatGPT), did you use the aggregation method of their original paper or the dense aggregation you proposed? Thanks!