winycg / CLIP-KD

[CVPR-2024] Official implementations of CLIP-KD: An Empirical Study of CLIP Model Distillation
74 stars 1 forks source link

initializing the student network with pre-trained weights #13

Open RuixiangZhao opened 1 month ago

RuixiangZhao commented 1 month ago

Hi Chuangguang,

Great work and thanks for sharing your code!

I have a question regarding the student networks in your method. From what I’ve seen, the student networks are all trained from scratch. I’m wondering if you have tried initializing the student network with pre-trained weights before starting the distillation process? For example, using the pre-trained CLIP-L/14 to distill CLIP-B/32 pre-trained from OpenAI.

If I wanted to use CLIP-KD for this kind of distillation task, what recommendations would you suggest? Specifically, in terms of learning rate settings, choice of training dataset for distillation, or any other adjustments that might be beneficial.

Thank you for your time, and I am looking forward to your reply!

Best regards, Ruixiang

Han-jiaxin commented 3 weeks ago

Hello, I am also trying to load the pre-trained model of the student network and train it, but my training results are not very satisfactory. I would like to know if you have made any progress or have any suggestions?

winycg commented 2 weeks ago

Hi, thanks for your attention of this work. I personally think the student loaded from pre-trained weights is not beneficial for distillation, because pretrained teacher and student have different feature distribution. Therefore, the supervision from the teacher may destroy the original learned feature representations of the pretrained student model.