Closed qq1332427275 closed 3 months ago
Thank you for your interest. We pre-extract text features for all category names using the offline text encoder from CLIP to serve as the classifier. Then, we train using the standard cross-entropy loss, so only the vision encoder requires training.
Hello author! I admire your work and like it, I would like to ask if the text encoder and image encoder of the model in training are frozen or trainable?