Closed Paper99 closed 4 years ago
I believe this is a bug the author does not realizes! I once used nn.KLDivLoss()
without batch_mean
, and the experiment failed when I set alpha=1
.
Hi, It is not a bug, if you use batch_mean, you should search hyper-parameters again, which should be different with the hyper-parameters we provide here. And it is normal that the experiment failed when you set alpha=1. For most of machine learning model, hyper-parameters will have influence on the results of training, including our Tf-KD and KD.
Well, according to the definition of KD loss, I think KL calculates the distance of student and teacher distributions, thus should divide the batch size, not element-wise, the KD loss implementation is also in RepDistiller, which it divides y_s.shape[0]
.
Hi, There are some different choice for the batch_mean, such as in Teacher-Assistant KD or Pytorch Implementation. I agree with you that it should be averaged on batch size, it will have influence on how to choose hyper-parameters. Tf-KD will still work when doing average on batch size, and hyper-parameters (especially \alpha) will be different and we can obtain it by search.
Thank you.
Hello! I noticed that you didn't use
batch_mean
for calculating the KL loss on Pytorch 1.2.0. Could you please tell me why you didn't use thebatch_mean
option?Thanks~