Open szm-R opened 7 years ago
For hyper parameters, I usually set loss weight to 1; temperatures around 2 to 10 often bring similar results, but infinite temperature (i.e. distilling the logits) is quite different. A large number of experiments are needed anyway.
Teachers should be good enough, especially for difficult tasks like ImageNet. However, better performance of the teacher doesn't always lead to better distillation results. You may need to try several teachers.
Remember to freeze all the parameters of the teachers: set lr_mult and decay_mult to 0; set use_global_stats to true for BatchNorm layer; change Dropout layer to Scale layer, etc
Hello again and thanks for your answer. Could you perhaps tell me what teachers and students have you tried so far? (If they are famous ones like AlexNet, GoogleNet, SqueezeNet and ...)
I haven't tried many models myself. I advise you to take a look at section 4 of this paper. I think various ResNets are useful for validating the training methods.
@wentianli Hi, I have some trouble using this layer, can you release one example .prototxt. Thanks a lot.
@wentianli could you please provide an example on how this can be implemented and trained?
For example, the prototxt for CIFAR10 goes like this...
First, there is a Data
layer.
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 32
mean_value: 125.30691805
mean_value: 122.95039414
mean_value: 113.86538318
}
data_param {
source: "/home/cifar10_pad4_train_lmdb"
batch_size: 128
backend: LMDB
}
image_data_param {
shuffle: true
}
}
Then, blob 'data' is fed into the first layer of the student network.
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 16
pad: 1
kernel_size: 3
stride: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
}
}
}
...
For classification task, there is usually an InnerProdoct
layer which outputs score
.
layer {
name: "score"
type: "InnerProduct"
bottom: "pool_global"
top: "score"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
In most cases, we use a SoftmaxWithLoss
layer to compute the cross entropy loss between score
and ground truth label
. For knowledge distillation, you can keep it and use a smaller loss_weight.
layer {
name: "loss"
type: "SoftmaxWithLoss"
include {
phase: TRAIN
}
bottom: "score"
bottom: "label"
top: "loss"
loss_weight: 1
}
Similarly, we feed the data into the teacher network. Remember to freeze its weights.
layer {
name: "conv1_teacher"
type: "Convolution"
bottom: "data"
top: "conv1_teacher"
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
convolution_param {
num_output: 32
pad: 1
kernel_size: 3
stride: 1
}
}
...
The teacher network also produces a score for classification. Here, we name the blob score_teacher
. It corresponds to the term "soft label" or "soft target" in the reference paper.
layer {
name: "score_teacher"
type: "InnerProduct"
bottom: "pool_global_teacher"
top: "score_teacher"
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
inner_product_param {
num_output: 10
}
}
Finally, a KnowledgeDistillation
layer computes the KL loss between score
and score_teacher
.
layer {
name: "KD"
type: "KnowledgeDistillation"
bottom: "score"
bottom: "score_teacher"
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param { temperature: 4}
loss_weight: 1
}
Here is another way to implement. Since the teacher network is fixed, we can compute and save the aforementioned score_teacher
(in hdf5 format) beforehand. When we train the student network, we simply need to load score_teacher
with a HDF5Data
layer and the teacher network is no more included in the prototxt. This is for better efficiency. However, it is slightly different if data augmentation is used.
Thank you very much, I had the impression that we first train our teacher network on a dataset, then calculate all the logits, then we train our student model on the same dataset, calculate the logits, then for distillation, we would use these two vectors of logits and go on! looking at your example and explanation, it seems teacher and student network, are trained in parallel and pretraining them is not necessary right?
Edit: I see your second comment which clears everything now. Thank you very much :)
@wentianli thanks for your train prototxt, i also think the training problem .if the teach network also in training, due to the teacher network is complexity,the batch-size of training have to set small.if we first save the result of the teacher network, training student network could have a large batchsize. but this should change the input layers. thanks for your share, i will have a try.
@zhanglaplace you can use iter_size in solver.prototxt for a large batchsize with limited gpu memory. Besides, training the teacher and student networks simultaneously can be called mutual learning, which is very tricky.
@wentianli thanks
@wentianli you mention earlier that in the teacher network we should
change Dropout layer to Scale layer, etc
Why this should be changed?
@wentianli Hi , I saw you use two loss layers "SoftmaxWithLoss" and "KnowledgeDistillation" in which they both use score as bottom. But in the code of InnerProduct layer of caffe , it only use diff from one top blob, it is different from conv layer which accumulate diffs. So the network may be trained use only one loss . Could you provide the training result using the prototxt above, dose the network work well?
@dawuchen Caffe automatically splits a blob when it is used twice. The diffs are thus accumulated.
@wentianli You are right .I made a mistake about the accumulate operation in conv layer, it is for different kernels. Thanks.
My prototxt is like this input: "data" layer{ name: "data" type: "ImageData" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: true mean_value: 127.5 mean_value: 127.5 mean_value: 127.5 scale: 0.0078125 } image_data_param { source: "cifarlist.txt" batch_size: 32 new_width: 112 new_height: 112 is_color: true shuffle: true } } layer { bottom: "data" top: "conv1" name: "conv1" type: "Convolution" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 32 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } . . . layer { bottom: "pool_avg" top: "classifier" name: "classifier" type: "InnerProduct" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } inner_product_param { num_output: 10 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } layer { name: "softmax_loss1" type: "SoftmaxWithLoss" bottom: "classifier" bottom: "label" top: "softmax_loss1" } layer { name: "accuracy" type: "Accuracy" bottom: "classifier" bottom: "label" top: "accuracy" include: { phase: TRAIN } }
layer { bottom: "data" top: "conv1s" name: "conv1s" type: "Convolution" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 16 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } . . . layer { bottom: "pool_avgs" top: "classifiers" name: "classifiers" type: "InnerProduct" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } inner_product_param { num_output: 10 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } layer { name: "softmax_loss2" type: "SoftmaxWithLoss" bottom: "classifiers" bottom: "label" top: "softmax_loss2" loss_weight: 0.2 } layer { name: "accuracys" type: "Accuracy" bottom: "classifiers" bottom: "label" top: "accuracys" include: { phase: TRAIN } }
layer { name: "KL_loss" type: "KnowledgeDistillation" bottom: "classifiers" #student bottom: "classifier" #teacher top: "KL_loss" include { phase: TRAIN } knowledge_distillation_param { temperature: 4 } loss_weight: 1 } When I train it, Log warning "KnowledgeDistillation Layer cannot backpropagate to soft label nor label inputs"
@qinxianyuzi the warning occurs because the second bottom doesn't receive any gradients.
you can use propagate_down
to stop backprop
layer {
name: "KL_loss"
type: "KnowledgeDistillation"
bottom: "classifiers" #student
bottom: "classifier" #teacher
propagate_down: 1
propagate_down: 0
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param {
temperature: 4
}
loss_weight: 1
}
or freeze the teacher network as said in #2
@wentianli Thanks very much. Does student harder to learn with higher temperature?
@qinxianyuzi the optimal temperature is often between 2 and 10
@wentianli Thank you! Sometimes we should try different temperature arrording to different training task.
hello, @wentianli when i have two teacher models(i.e., ensemble model), how should i arrange the logits of each teacher model? if i choose average strategy to combine the two teachers, can i make a mean operation on the logits of them directly when training a student model?
@liangzimei Averaging logits is incorrect. The kl loss sums p_i * log(q_i) + constant
w.r.t every i, where p_i
is the probability that the teacher produces for class i
. When there are two teachers, this term becomes 0.5 * p1_i * log(q_i) + 0.5 * p2_i * log(q_i) + constant
, which means you need two knowledge_distillation_layers.
@wentianli thank you so much, i will have a try. Do you Implement softmaxloss layer with temperature which is employed when training a teacher? Or just use a power layer? thanks in advance.
@liangzimei I didn't use temperature when training a teacher. A scale layer with fixed weights could solve that. layer { name: "XXX" type: "Scale" bottom: "XXX" top: "XXX" param { lr_mult: 0 decay_mult: 0 } scale_param { filler { value: 0.5 } # here temperature = 2 bias_term: false } }
@wentianli ok, you mean when training a teacher, temperature=1 is used in most cases (including hinton's paper)?
can be used in a regression model? for example, face alignment
@liuqunzhong L1 or L2 loss is used for regression. knowledge distillation layer is implemented for classification.
hello, @wentianli when i train a student model (i.e., mobilnet-v1 ) taught by an ensemble model (2 models, one of them is mobilenet-v1 ), the student's accuracy is always between two teachers, any suggestions ? thanks in advance...
@liangzimei You mean the student model outperforms its counterpart and underperforms the teacher model? It should be so. To obtain better accuracy, you probably need to replace the mobilenet-v1 in the ensemble with a better one.
thanks for sharing why kl divergence is adopted for loss instead of cross-entropy?
@WormCoder The only difference between kl divergence and cross entropy is a constant term, which doesn't affect backprop at all. When the student and the teacher have exactly the same outputs (this is our goal for training), kl divergence becomes zero.
student is 2424 teacher is 4848 , and two datasets with the same data order. is ok?
@WormCoder yeah, it drops fast in the beginning.
@liangzimei hello,i want to know when to freeze teacher model, if i prefer to set propagate_down:false, do i need to set weight decay=0 in solver.prototxt? i mean is it enough to freeze a model with the single parameter propagate_down?
@westnight when training student model, we should freeze the teacher, we can set all the "lr_mult=0", 'lr_decay=0' in conv layer of the teacher to avoid updating parameters. BN layer may be different, you can refer to the previous replies.
@liangzimei thank you. To freeze the teacher model, i know setting lr and wd equal zero works.But i wonder if using propagate_down is another way.
@westnight according to my understanding, it is. when i train student model , i use both propagate_down: false and lr_mult: 0 for safe.
Thank you for your great work. To compose the prototxt for training, should I write the teacher prototxt and student prototxt into single file? IF this is right, then how to initialize the 'teacher' with pretrained caffemodel? After training, how to save the partial 'student' model? Could you pls send me one copy of your prototxt? Thank you so much.
@iamweiweishi Into single file? Yes. Because 'student' is randomly initialized, it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. To save partial model, a convenient way is to save the model in HDF5 format, and then you can rename the blobs or delete some blobs. For an example of prototxt, please see the comment above.
Thank you. @wentianli It works now.
@wentianli I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。you say " it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. " I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.
@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.
@hito0512 Sorry to bother you。Do you solve this problem finally? I have the same trouble.
L1 or L2 loss is used for regression how about distillation L1 or L2 loss?
您的来件已收到。谢谢您的来信!
@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.
@hito0512 sorry for disturbing now, I need to work on the above said fashion. Could u please help me with the procedure if found. Thank You
您的来件已收到。谢谢您的来信!
Hi wentianli
I've been testing the knowledge distillation method for a while by playing with Caffe's available layers and I was able to achieve nearly good results with some simple models. It's been a couple of days that I have come across your layer, I examined the source code and it seemed like a good implementation to me. Now I'm trying to use your layer to enhance the accuracy of GoogleNet (using a ResNet model as the teacher). Now I wanted to ask you about any tips you might know about this process, about tuning the hyper parameters like the loss weights, the solver type, learning rate, etc.
I appreciate any help greatly.