Question about `clean` results of K400 & SSv2

BinhuiXie commented 1 year ago

Hi @wlin-at

Excellent works!

Here are some questions about the results in Table 1. In your paper, the clean refers to the performance of the model on the original validation set. Specifically, for K400 & SSv2, the results are 75.32 and 66.36, respectively. But the results of VideoSwin repo give higher results.

Could you help me out? I really appreciate any help you can provide.

wlin-at commented 1 year ago

Hi @wlin-at

Excellent works!

Here are some questions about the results in Table 1. In your paper, the clean refers to the performance of the model on the original validation set. Specifically, for K400 & SSv2, the results are 75.32 and 66.36, respectively. But the results of VideoSwin repo give higher results.

Could you help me out? I really appreciate any help you can provide.

Hi thanks for the interest in the work! VideoSwin uses 4x3 = 12 views during inference and the final score is computed as the average score over all the views. In our inference, we only take 1x1 view (center crop, uniformly sample one clip) for inference on clean, and for test time adaptation for efficient implementation. The implementation details are given in both papers.

BinhuiXie commented 1 year ago

thanks a lot

wlin-at / ViTTA

Question about `clean` results of K400 & SSv2 #5