yuanli2333 / Teacher-free-Knowledge-Distillation

Knowledge Distillation: CVPR2020 Oral, Revisiting Knowledge Distillation via Label Smoothing Regularization
MIT License
579 stars 67 forks source link

Mismatch between Eq.9 in the paper and the code #19

Open MingSun-Tse opened 4 years ago

MingSun-Tse commented 4 years ago

Hello, thanks for your great work! I have a question about a possible mismatch between the Eq.9 in the paper and the real implementations in the code.

Here are the loss equations of LS, and your proposed regularization: Imgur

As seen, the temperature $\tau$ is missing for the $p$ in Eq.10 compared with Eq.9. This might be problematic: In your paper (Sec.4) and many other places (like this issue and 2020 ICLR openreview), when you differentiate your method from Label Smoothing (LS), the existence of the temperature is an essential factor to support your statement, while in practice, it is not used. This looks like a big mismatch in terms of the methodology, because for Eq.10 above, I can set the $\tau$ to a very large number to make $p^d_{\tau}$ become uniform distribution $u$ (in fact, the values you picked -- 20 or 40 in Tab.6 of your supplementary material -- is large enough to make this happen. You can print the value of F.softmax(teacher_soft/T, dim=1) in your code to verify this), then set the $\alpha$ in Eq.10 to the $\alpha$ in Eq.3. Then Eq.10 will be exactly the same as Eq.3. This shows your implementation is truly an over-parameterized version of LS, contradicting your claim in the paper and many other places. Do you have any comments about this potential problem?

I may misunderstand something. If any, please point it out. Thanks again!

yuanli2333 commented 4 years ago

Hi, your understanding on LSR is correct. Actually, the loss function (Eq.10 as said in here) is just for CIFAR100. I checked on ImageNet before by using the same equation as Eq.9, and achieved better results than LSR. But on CIFAR10/100, I searched some hyper-parameters on \tau and \alpha, but get similar improvements with LSR by using Eq.9. I will double check the results on CIFAR100 by using Eq.9, and upload the loss function for ImageNet.

MingSun-Tse commented 4 years ago

Hi @yuanli2333 ,

Great thanks for your reply! It will be great to have the updated results. If on CIFAR10/100, LSR is equivalent to the proposed Tf_reg loss, that may undermine the effectiveness of your proposed method.

Second question about the KD loss implementation, also noted by previous issue, but I do not fully understand: For your implemented KD loss, the total mean instead of batchmean is used. This means, for the KLD loss term, the real value in code is 1/n_class of its supposed value. Therefore, when the KD loss is obtained by KD_loss = (1. - alpha) * loss_CE + alpha * D_KL. What really happens is KD_loss = (1. - alpha) * loss_CE + alpha/n_class * D_KL. As seen, the multipliers before loss_CE and D_KL are no longer summed to 1. This may also affect your results and conclusions. Especially, for CIFAR100, TinyImageNet and ImageNet, n_class=100, 200 and 1000. That is quite large. The grid search range of alpha is like [0.01, 1] (based on the supplementary material). Now it is unexpectedly scaled by 1/100 to 1/1000. Possibly, the search range does not cover the (nearly) optimal value at all. Like for ImageNet, the alpha used in paper is 0.1. Based on above, the real factor is 0.0001, so the KD loss will be 0.9 * loss_CE + 0.0001 * D_KL. The weight of D_KL seems too small to really play a part. Does this potentially affect the results and conclusions?

If i misunderstand anything. Please point it out. Thanks!

JosephChenHub commented 4 years ago

Hi @yuanli2333 ,

Great thanks for your reply! It will be great to have the updated results. If on CIFAR10/100, LSR is equivalent to the proposed Tf_reg loss, that may undermine the effectiveness of your proposed method.

Second question about the KD loss implementation, also noted by previous issue, but I do not fully understand: For your implemented KD loss, the total mean instead of batchmean is used. This means, for the KLD loss term, the real value in code is 1/n_class of its supposed value. Therefore, when the KD loss is obtained by KD_loss = (1. - alpha) * loss_CE + alpha * D_KL. What really happens is KD_loss = (1. - alpha) * loss_CE + alpha/n_class * D_KL. As seen, the multipliers before loss_CE and D_KL are no longer summed to 1. This may also affect your results and conclusions. Especially, for CIFAR100, TinyImageNet and ImageNet, n_class=100, 200 and 1000. That is quite large. The grid search range of alpha is like [0.01, 1] (based on the supplementary material). Now it is unexpectedly scaled by 1/100 to 1/1000. Possibly, the search range does not cover the (nearly) optimal value at all. Like for ImageNet, the alpha used in paper is 0.1. Based on above, the real factor is 0.0001, so the KD loss will be 0.9 * loss_CE + 0.0001 * D_KL. The weight of D_KL seems too small to really play a part. Does this potentially affect the results and conclusions?

If i misunderstand anything. Please point it out. Thanks!

Hi, I would share my thoughts. The alpha in the exp. of CIFAR100 is set to 0.95, which means the loss will be 0.05 * CE + 0.95 * KL. However, the implementation does use the batch size and number of classes to calculate the mean, and the author use a large multiplier to amplify the loss. Besides this issue, there exists a little trick that the baseline of ResNet18 on CIFAR100 is relatively lower , which may be caused by the modified ResNet. Another implementation of ResNet18 (pytorch-cifar100)[https://github.com/weiaicunzai/pytorch-cifar100] can obtain a result of 76.39% top1 accuracy on CIFAR100 , and in my own experiments, the accuracy can achieve up to 78.05% without any extra augmentations. So I would cast my doubt on the performance gain on CIFAR100. And I have conducted a self distillation exp. based on my baseline, it just improved the performance from 77.96% to 78.45%. see my issue: #20

wutaiqiang commented 2 years ago

The so call distillation is de facto label smoothing. With Large T, The F.softmax(teacher_output/T, dim=1) is nearly the uniform distribution, which is the same to label smoothing.