whwu95 / Text4Vis

【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective
MIT License
202 stars 15 forks source link

On the issue of modeling #23

Closed qq1332427275 closed 3 months ago

qq1332427275 commented 3 months ago

Hello author! I admire your work and like it, I would like to ask if the text encoder and image encoder of the model in training are frozen or trainable?

whwu95 commented 3 months ago

Thank you for your interest. We pre-extract text features for all category names using the offline text encoder from CLIP to serve as the classifier. Then, we train using the standard cross-entropy loss, so only the vision encoder requires training.