zma-c-137 / VarGFaceNet

312 stars 84 forks source link

Generation 2 of recursive knowledge distillation #11

Open noranart opened 4 years ago

noranart commented 4 years ago

When training student generation 2, you use the student weight from generation 1. Isn't that just continue training? Are you resetting the learning rate, or resetting the weight of some part of student (e.g., head)? What exactly is second generation student?

Also could you share the TPR@FPR of the student when training without teacher? Is the gain significant?

xiexuliunian commented 4 years ago

I also have the same question.Isn't that just continue training?

Coderx7 commented 4 years ago

I guess this answers your qustion:

If you training more iterations for same student model, momentum of SGD will be accumulated. We
just use student model of first generation to initial the next generation, which :

  1. finds better local minimal,
  2. relieves the generalization gap between the teacher model and the student model.
ctgushiwei commented 4 years ago

@Coderx7 @noranart how to do knowledge distillation by mxnet?

xiexuliunian commented 4 years ago

you can look for this:https://github.com/TuSimple/neuron-selectivity-transfer good luck!

ainnn commented 4 years ago

@Coderx7 Hey, I wonder the source of these comments. thx.

I guess this answers your qustion:

If you training more iterations for same student model, momentum of SGD will be accumulated. We just use student model of first generation to initial the next generation, which :

  1. finds better local minimal,
  2. relieves the generalization gap between the teacher model and the student model.
Coderx7 commented 4 years ago

@ainnn : Here you are : src

ainnn commented 4 years ago

@Coderx7: got it. thx.