yeliudev / R2-Tuning

🌀 R^2-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding (ECCV 2024)
http://arxiv.org/abs/2404.00801
BSD 3-Clause "New" or "Revised" License
62 stars 1 forks source link

About feature extraction. #10

Closed EdenGabriel closed 3 months ago

EdenGabriel commented 3 months ago

Hi, guys, thanks for your interesting work. I have an question about extract_feat.py.

When I use "ViT-B / 32" to extract visual features, the result is 768 dimensions, but text features are 512 dimensions. Is this normal? BTW, I remember that "ViT-B / 32" corresponds to 512 dimensions.

Thanks.

yeliudev commented 3 months ago

Yes this is normal. According to the CLIP's paper, dimensions for vision and language branches are 768 and 512, respectively.

EdenGabriel commented 3 months ago

oops, i see. Thanks for your reply.