Closed JiaxiangBU closed 2 years ago
Thanks for your attention!
We agree that there is a relation with KD although this is not our motivation. And we actually also noted in Section 3.2 that our loss is also similar to self-distillation.
For the theoretical analysis, we mainly focus on proving 1) the relation of our objective and the bootstrapping loss and 2) the convergence property.
Thanks for your idea of label reweighting. I am curious about the theoretical foundation. The loss designed contain the label loss and pseudo ones. The latter one seemingly plays a role of teacher model in the knowledge distilling and teach the current batch to train. I think there is a sub-field of KD related, self-teaching.
Moreover, the alpha and beta are both updated during training, it is new in KD where the weight is controlled by a constant or temperature.