whwu95 / Text4Vis

【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
MIT License
204 stars 15 forks source link

Zero-shot Video Performance. #4

Closed XiaoBuL closed 1 year ago

XiaoBuL commented 1 year ago

Hello,

Thanks for your job!

I've used the Kinetics-400 pre-trained model (ViT-L/14 with 8 frames downloaded from https://drive.google.com/file/d/1tGfE6HDjTGZ7-y6XM7D6UJAx1Esj-q7u/view?usp=share_link) to perform a cross-dataset zero-shot evaluation on UCF-101 dataset. And I get the following results:

-----Full-classes Evaluation------ Overall Top1 57.928% Top5 88.071% -----Half-classes Evaluation----- Top1: mean 69.553%, std 6.283% Top5: mean 92.737%, std 1.957%

And in your paper, the performance of zero-shot video recognition is that:

image

It's different and far below the performance that your paper reports. And I want to know which model is used for zero-shot video recognition.

Thanks,

MS

whwu95 commented 1 year ago

Hi, Thank you so much for your interest in our work!

I have re-run the zero-shot evaluation using the following command: sh scripts/run_test_zeroshot.sh configs/ucf101/ucf_zero_shot.yaml k400-vitl-14-f8.pt The results are consistent with the paper.

image

And I have uploaded the test log for your reference. ucf_zeroshot.log

Please let me know if you have any questions or comments. I'm happy to help in any way I can.

Wenhao

XiaoBuL commented 1 year ago

Hi,

Thanks for your reply!

I've found that I didn't load the pretrained model and get the expected results.

Now it's OK! Thanks!

MS