zhixiongz / CLIP4CMR

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval
41 stars 3 forks source link

A question about extracting text features with CLIP #9

Open rainman503 opened 1 year ago

rainman503 commented 1 year ago

Hi, thank you for your great work. I have a question about the text preprocessing for CLIP. The maximum input length for CLIP is 77 tokens, but most of the texts in the dataset are longer than 77 tokens. How do you preprocess these texts before extracting features with CLIP?