Closed zhengrongz closed 6 months ago
We utilized the kinetics-400 label set for action recognition. To ensure the fairness of our experiments, we did not collect possible actions that may occur in VQA-related benchmarks and meticulously designed a dedicated label set.
ok, got it! Thank you!
Hi! Thanks for your excellent work! As far as I know, Internvideo calculates similarity between video feature and text feature, and it need an action label set to do this classification task. But in next-qa or other video qa dataset, they don't have a set to record the actions that occur in the dataset. So I wonder how do you use InternVideo as the AR model? Looking forward to your reply!