questions about the attention head

winycg / CLIP-KD

[CVPR-2024] Official implementations of CLIP-KD: An Empirical Study of CLIP Model Distillation

63 stars 0 forks source link

questions about the attention head #4

Closed xushilin1 closed 5 months ago

xushilin1 commented 8 months ago

In the Feature Distillation section, you use MSE loss to reduce the distance between the text embeddings of the student and teacher. I notice that you use different attention-heads (8 for teacher and 6 for student).

I want to know if it's better to maintain consistency in the attention head, like forcefully changing the student's head to 8.

winycg commented 7 months ago

Sorry for the late response. In the default setup, we use the final output text embeddings for feature distillation. If the embedding size between the teacher and student dismatches, we apply a projection head to transform the student embedding to match the size of the teacher's embeding.

winycg commented 5 months ago

Hi, we have released code and pretrained models. Please feel free to ask any questions if you find any problems.