zhouzaida / channel-distillation

PyTorch implementation for Channel Distillation
98 stars 17 forks source link

training detail #3

Closed YuQi9797 closed 4 years ago

YuQi9797 commented 4 years ago

Hello author, I ran the code and had some questions about the training process. https://github.com/zhouzaida/channel-distillation/blob/master/cifar_train.py#L207

the loss_alphas[i] is EDT(α) ? And in the paper, you say we only decrease the weight of CD loss, so I don't know. what's the meaning of this code, every loss have the hyperparameter of EDT(α).

And then I see from the results of the run,when the epoch < 60,only have CD loss,and other loss are zero. Why do we do this?

I‘m looking forward your reply. Your reader, Joey.

zgcr commented 4 years ago

Both alpha and factor are parts of the EDT strategy.When the code is training,the code recalculate the current loss_alphas at the beginning of each epoch by using alpha、factor in config.py and current epoch.You can find more calculation details in channel-distillation/utils/util.py/adjust_loss_alpha function.We only use EDT on CD loss by setting CD loss factor=0.9.In the first 60 epochs,We only use CD loss to train student model,which is considered to initialize the student model by using the teacher model.

YuQi9797 commented 4 years ago

loss_alphas[i] can be viewed as a hyperparameters to finetune loss between losses? because With the exception of CD Loss, all other factor=1. so the loss_alphas[i] = alpha = loss_rate . By set loss_rate to finetune loss between losses.

and I have another quesition about https://github.com/zhouzaida/channel-distillation/blob/master/losses/kd_loss.py#L36 why correct_t[correct_t == 0.0] = 1.0. Our intention is that the teacher only transfer the positive predition distribution to the student and directly ignore the negative.

zgcr commented 4 years ago

We only let the teacher model teach the student model knowledge of samples which predictions are correct by teacher model.

YuQi9797 commented 4 years ago

yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask)

I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s.

but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0

zgcr commented 4 years ago

yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask)

I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s.

but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0

F.kl_div formula:

loss=y(log(y)-x)

You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.

YuQi9797 commented 4 years ago

yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask) I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s. but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0

F.kl_div formula:

loss=y(log(y)-x)

You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.

I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.

F.kl_div(q.log(), p) = image

The teacher's distribution is in the numerator, and the student's distribution is in the denominator

zgcr commented 4 years ago

yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask) I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s. but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0

F.kl_div formula: loss=y(log(y)-x) You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.

I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.

F.kl_div(q.log(), p) = image

The teacher's distribution is in the numerator, and the student's distribution is in the denominator

In our code, we only pay attention to the specific calculation method of KL loss in pytorch.In formula loss=y(log(y)-x) y is correct_t, and x is correct_s. In this formula, setting y to 1 has no effect on the final calculation result.But if y has value 0,there will be an error about log(0).

YuQi9797 commented 4 years ago

wow! I see. Thank you very very very much!  :)

------------------ 原始邮件 ------------------ 发件人: "zgcr"<notifications@github.com>; 发送时间: 2020年8月5日(星期三) 中午11:52 收件人: "zhouzaida/channel-distillation"<channel-distillation@noreply.github.com>; 抄送: "919664295"<919664295@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [zhouzaida/channel-distillation] training detail (#3)

yeah, the mask retains the correct prediction position. and then: correct_s = s.mul(mask) correct_t = t.mul(mask) I think, now the correct_t retains the correct prediction and the incorrect values is 0. So does correct_s. but the correct_t[correct_t == 0.0] = 1.0, I don't know, why we do that? becuase the location [correct_t == 0.0] in correct_s also = 0.0

F.kl_div formula: loss=y(log(y)-x) You need understand the calculation formula of F.kl_div in Pytorch.The intput correct_s has used log operation,but intput correct_t doesn't used log operation.I am sure you can find why after understand the formula of F.kl_div.

I’m sorry to have delayed you too much time, but the problem I’m referring to is: because the teacher only passes the knowledge after the prediction is correct to the students, then after passing the mask, the teacher predicts the wrong position, its number is 0, and the student’s position The value of is also 0. But if the value of the teacher here is set to 1, then there is a difference between it and the students.

F.kl_div(q.log(), p) =

The teacher's distribution is in the numerator, and the student's distribution is in the denominator

In our code, we only pay attention to the specific calculation method of KL loss in pytorch.In formula loss=y(log(y)-x) y is correct_t, and x is correct_s. In this formula, setting y to 1 has no effect on the final calculation result.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

zhouzaida commented 4 years ago

Thanks for your attention on our work. If issue has been solved, please close the issue. And any questions about implementation detail or paper are welcome. @YuQi9797