Open noranart opened 4 years ago
I also have the same question.Isn't that just continue training?
I guess this answers your qustion:
If you training more iterations for same student model, momentum of SGD will be accumulated. We
just use student model of first generation to initial the next generation, which :
- finds better local minimal,
- relieves the generalization gap between the teacher model and the student model.
@Coderx7 @noranart how to do knowledge distillation by mxnet?
you can look for this:https://github.com/TuSimple/neuron-selectivity-transfer good luck!
@Coderx7 Hey, I wonder the source of these comments. thx.
I guess this answers your qustion:
If you training more iterations for same student model, momentum of SGD will be accumulated. We just use student model of first generation to initial the next generation, which :
- finds better local minimal,
- relieves the generalization gap between the teacher model and the student model.
@Coderx7: got it. thx.
When training student generation 2, you use the student weight from generation 1. Isn't that just continue training? Are you resetting the learning rate, or resetting the weight of some part of student (e.g., head)? What exactly is second generation student?
Also could you share the TPR@FPR of the student when training without teacher? Is the gain significant?