Closed YuQi9797 closed 4 years ago
Both alpha and factor are parts of the EDT strategy.When the code is training,the code recalculate the current loss_alphas at the beginning of each epoch by using alpha、factor in config.py and current epoch.You can find more calculation details in channel-distillation/utils/util.py/adjust_loss_alpha function.We only use EDT on CD loss by setting CD loss factor=0.9.In the first 60 epochs,We only use CD loss to train student model,which is considered to initialize the student model by using the teacher model.
loss_alphas[i]
can be viewed as a hyperparameters to finetune loss between losses? because With the exception of CD Loss, all other factor=1. so the loss_alphas[i] = alpha = loss_rate
. By set loss_rate to finetune loss between losses.
and I have another quesition about https://github.com/zhouzaida/channel-distillation/blob/master/losses/kd_loss.py#L36
why correct_t[correct_t == 0.0] = 1.0
. Our intention is that the teacher only transfer the positive predition distribution to the student and directly ignore the negative.
We only let the teacher model teach the student model knowledge of samples which predictions are correct by teacher model.
yeah, the mask
retains the correct prediction position.
and then:
correct_s = s.mul(mask)
correct_t = t.mul(mask)
I think, now the correct_t
retains the correct prediction and the incorrect values is 0. So does correct_s
.
but the correct_t[correct_t == 0.0] = 1.0
, I don't know, why we do that?
becuase the location [correct_t == 0.0] in correct_s also = 0.0
yeah, the
mask
retains the correct prediction position. and then:correct_s = s.mul(mask)
correct_t = t.mul(mask)
I think, now the
correct_t
retains the correct prediction and the incorrect values is 0. So doescorrect_s
.but the
correct_t[correct_t == 0.0] = 1.0
, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0
F.kl_div formula:
loss=y(log(y)-x)
You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.
yeah, the
mask
retains the correct prediction position. and then:correct_s = s.mul(mask)
correct_t = t.mul(mask)
I think, now thecorrect_t
retains the correct prediction and the incorrect values is 0. So doescorrect_s
. but thecorrect_t[correct_t == 0.0] = 1.0
, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0F.kl_div formula:
loss=y(log(y)-x)
You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.
I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.
F.kl_div(q.log(), p) =
The teacher's distribution is in the numerator, and the student's distribution is in the denominator
yeah, the
mask
retains the correct prediction position. and then:correct_s = s.mul(mask)
correct_t = t.mul(mask)
I think, now thecorrect_t
retains the correct prediction and the incorrect values is 0. So doescorrect_s
. but thecorrect_t[correct_t == 0.0] = 1.0
, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0F.kl_div formula: loss=y(log(y)-x) You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.
I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.
F.kl_div(q.log(), p) =
The teacher's distribution is in the numerator, and the student's distribution is in the denominator
In our code, we only pay attention to the specific calculation method of KL loss in pytorch.In formula loss=y(log(y)-x) y is correct_t, and x is correct_s. In this formula, setting y to 1 has no effect on the final calculation result.But if y has value 0,there will be an error about log(0).
wow! I see. Thank you very very very much! :)
------------------ 原始邮件 ------------------ 发件人: "zgcr"<notifications@github.com>; 发送时间: 2020年8月5日(星期三) 中午11:52 收件人: "zhouzaida/channel-distillation"<channel-distillation@noreply.github.com>; 抄送: "919664295"<919664295@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [zhouzaida/channel-distillation] training detail (#3)
yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask) I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s. but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0
F.kl_div formula: loss=y(log(y)-x) You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.
I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.
F.kl_div(q.log(), p) =
The teacher's distribution is in the numerator, and the student's distribution is in the denominator
In our code, we only pay attention to the specific calculation method of KL loss in pytorch.In formula loss=y(log(y)-x) y is correct_t, and x is correct_s. In this formula, setting y to 1 has no effect on the final calculation result.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for your attention on our work. If issue has been solved, please close the issue. And any questions about implementation detail or paper are welcome. @YuQi9797
Hello author, I ran the code and had some questions about the training process. https://github.com/zhouzaida/channel-distillation/blob/master/cifar_train.py#L207
the
loss_alphas[i]
is EDT(α) ? And in the paper, you say we only decrease the weight of CD loss, so I don't know. what's the meaning of this code, every loss have the hyperparameter of EDT(α).And then I see from the results of the run,when the epoch < 60,only have CD loss,and other loss are zero. Why do we do this?
I‘m looking forward your reply. Your reader, Joey.