Hi, guys, thanks for your interesting work.
I have an question about extract_feat.py.
When I use "ViT-B / 32" to extract visual features, the result is 768 dimensions, but text features are 512 dimensions. Is this normal?
BTW, I remember that "ViT-B / 32" corresponds to 512 dimensions.
Hi, guys, thanks for your interesting work. I have an question about extract_feat.py.
When I use "ViT-B / 32" to extract visual features, the result is 768 dimensions, but text features are 512 dimensions. Is this normal? BTW, I remember that "ViT-B / 32" corresponds to 512 dimensions.
Thanks.