z-x-yang / DoraemonGPT

Official repository of DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
BSD 3-Clause "New" or "Revised" License
75 stars 5 forks source link

How do you use InternVideo as Action recognition model? #3

Closed zhengrongz closed 6 months ago

zhengrongz commented 6 months ago

Hi! Thanks for your excellent work! As far as I know, Internvideo calculates similarity between video feature and text feature, and it need an action label set to do this classification task. But in next-qa or other video qa dataset, they don't have a set to record the actions that occur in the dataset. So I wonder how do you use InternVideo as the AR model? Looking forward to your reply!

z-x-yang commented 6 months ago

We utilized the kinetics-400 label set for action recognition. To ensure the fairness of our experiments, we did not collect possible actions that may occur in VQA-related benchmarks and meticulously designed a dedicated label set.

zhengrongz commented 6 months ago

ok, got it! Thank you!