Questions about KD loss

yuanli2333 / Teacher-free-Knowledge-Distillation

Knowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization

MIT License

580 stars 67 forks source link

Questions about KD loss #5

Closed Paper99 closed 4 years ago

Paper99 commented 5 years ago

Hello! I noticed that you didn't use batch_mean for calculating the KL loss on Pytorch 1.2.0. Could you please tell me why you didn't use the batch_mean option?

Thanks~

s7ev3n commented 4 years ago

I believe this is a bug the author does not realizes! I once used nn.KLDivLoss() without batch_mean, and the experiment failed when I set alpha=1.

yuanli2333 commented 4 years ago

Hi, It is not a bug, if you use batch_mean, you should search hyper-parameters again, which should be different with the hyper-parameters we provide here. And it is normal that the experiment failed when you set alpha=1. For most of machine learning model, hyper-parameters will have influence on the results of training, including our Tf-KD and KD.

s7ev3n commented 4 years ago

Well, according to the definition of KD loss, I think KL calculates the distance of student and teacher distributions, thus should divide the batch size, not element-wise, the KD loss implementation is also in RepDistiller, which it divides y_s.shape[0].

yuanli2333 commented 4 years ago

Hi, There are some different choice for the batch_mean, such as in Teacher-Assistant KD or Pytorch Implementation. I agree with you that it should be averaged on batch size, it will have influence on how to choose hyper-parameters. Tf-KD will still work when doing average on batch size, and hyper-parameters (especially \alpha) will be different and we can obtain it by search.

Paper99 commented 4 years ago

Thank you.