Open Dijkstra111111 opened 5 months ago
Hi, the NLP text encoder is also feasible. However, the text encoders in vision-language pertained models such as CLIP can generate text embeddings that are more aligned or closed to the visual information, benefiting the cross-modal learning.
Hello, I want to ask if you have reproduced the results of the article, and I want to ask about the process of data preprocessing.
Hello,I would like to ask why you use ViT-B/16 as a text encoder. Why not use NLP models as a text encoder? Thank you very much.