Question about tips and tricks

szm-R commented 7 years ago

Hi wentianli

I've been testing the knowledge distillation method for a while by playing with Caffe's available layers and I was able to achieve nearly good results with some simple models. It's been a couple of days that I have come across your layer, I examined the source code and it seemed like a good implementation to me. Now I'm trying to use your layer to enhance the accuracy of GoogleNet (using a ResNet model as the teacher). Now I wanted to ask you about any tips you might know about this process, about tuning the hyper parameters like the loss weights, the solver type, learning rate, etc.

I appreciate any help greatly.

wentianli commented 7 years ago

For hyper parameters, I usually set loss weight to 1; temperatures around 2 to 10 often bring similar results, but infinite temperature (i.e. distilling the logits) is quite different. A large number of experiments are needed anyway.

Teachers should be good enough, especially for difficult tasks like ImageNet. However, better performance of the teacher doesn't always lead to better distillation results. You may need to try several teachers.

Remember to freeze all the parameters of the teachers: set lr_mult and decay_mult to 0; set use_global_stats to true for BatchNorm layer; change Dropout layer to Scale layer, etc

szm-R commented 7 years ago

Hello again and thanks for your answer. Could you perhaps tell me what teachers and students have you tried so far? (If they are famous ones like AlexNet, GoogleNet, SqueezeNet and ...)

wentianli commented 7 years ago

I haven't tried many models myself. I advise you to take a look at section 4 of this paper. I think various ResNets are useful for validating the training methods.

adapt-image-models commented 7 years ago

@wentianli Hi, I have some trouble using this layer, can you release one example .prototxt. Thanks a lot.

Coderx7 commented 6 years ago

@wentianli could you please provide an example on how this can be implemented and trained?

wentianli commented 6 years ago

For example, the prototxt for CIFAR10 goes like this...

First, there is a Data layer.

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 32
    mean_value: 125.30691805
    mean_value: 122.95039414
    mean_value: 113.86538318
  }
  data_param {
    source: "/home/cifar10_pad4_train_lmdb"
    batch_size: 128
    backend: LMDB
  }
  image_data_param {
    shuffle: true
  }
}

Then, blob 'data' is fed into the first layer of the student network.

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 16
    pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
    }
  }
}
...

For classification task, there is usually an InnerProdoct layer which outputs score.

layer {
  name: "score"
  type: "InnerProduct"
  bottom: "pool_global"
  top: "score"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

In most cases, we use a SoftmaxWithLoss layer to compute the cross entropy loss between score and ground truth label. For knowledge distillation, you can keep it and use a smaller loss_weight.

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  include {
    phase: TRAIN
  }
  bottom: "score"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}

Similarly, we feed the data into the teacher network. Remember to freeze its weights.

layer {
  name: "conv1_teacher"
  type: "Convolution"
  bottom: "data"
  top: "conv1_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  convolution_param {
    num_output: 32
    pad: 1
    kernel_size: 3
    stride: 1
  }
}
...

The teacher network also produces a score for classification. Here, we name the blob score_teacher. It corresponds to the term "soft label" or "soft target" in the reference paper.

layer {
  name: "score_teacher"
  type: "InnerProduct"
  bottom: "pool_global_teacher"
  top: "score_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
  }
}

Finally, a KnowledgeDistillation layer computes the KL loss between score and score_teacher.

layer {
  name: "KD"
  type: "KnowledgeDistillation"
  bottom: "score"
  bottom: "score_teacher"
  top: "KL_loss"
  include { phase: TRAIN }
  knowledge_distillation_param { temperature: 4}
  loss_weight: 1
}

wentianli commented 6 years ago

Here is another way to implement. Since the teacher network is fixed, we can compute and save the aforementioned score_teacher (in hdf5 format) beforehand. When we train the student network, we simply need to load score_teacher with a HDF5Data layer and the teacher network is no more included in the prototxt. This is for better efficiency. However, it is slightly different if data augmentation is used.

Coderx7 commented 6 years ago

Thank you very much, I had the impression that we first train our teacher network on a dataset, then calculate all the logits, then we train our student model on the same dataset, calculate the logits, then for distillation, we would use these two vectors of logits and go on! looking at your example and explanation, it seems teacher and student network, are trained in parallel and pretraining them is not necessary right?

Edit: I see your second comment which clears everything now. Thank you very much :)

zhanglaplace commented 6 years ago

@wentianli thanks for your train prototxt, i also think the training problem .if the teach network also in training, due to the teacher network is complexity，the batch-size of training have to set small.if we first save the result of the teacher network, training student network could have a large batchsize. but this should change the input layers. thanks for your share, i will have a try.

wentianli commented 6 years ago

@zhanglaplace you can use iter_size in solver.prototxt for a large batchsize with limited gpu memory. Besides, training the teacher and student networks simultaneously can be called mutual learning, which is very tricky.

zhanglaplace commented 6 years ago

@wentianli thanks

pasxalinamed commented 6 years ago

@wentianli you mention earlier that in the teacher network we should

change Dropout layer to Scale layer, etc

Why this should be changed?

dawuchen commented 6 years ago

@wentianli Hi , I saw you use two loss layers "SoftmaxWithLoss" and "KnowledgeDistillation" in which they both use score as bottom. But in the code of InnerProduct layer of caffe , it only use diff from one top blob, it is different from conv layer which accumulate diffs. So the network may be trained use only one loss . Could you provide the training result using the prototxt above, dose the network work well?

wentianli commented 6 years ago

@dawuchen Caffe automatically splits a blob when it is used twice. The diffs are thus accumulated.

dawuchen commented 6 years ago

@wentianli You are right .I made a mistake about the accumulate operation in conv layer, it is for different kernels. Thanks.

qinxianyuzi commented 6 years ago

My prototxt is like this input: "data" layer{ name: "data" type: "ImageData" top: "data" top: "label" include { phase: TRAIN } transform_param { mirror: true mean_value: 127.5 mean_value: 127.5 mean_value: 127.5 scale: 0.0078125 } image_data_param { source: "cifarlist.txt" batch_size: 32 new_width: 112 new_height: 112 is_color: true shuffle: true } } layer { bottom: "data" top: "conv1" name: "conv1" type: "Convolution" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 32 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } . . . layer { bottom: "pool_avg" top: "classifier" name: "classifier" type: "InnerProduct" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } inner_product_param { num_output: 10 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } layer { name: "softmax_loss1" type: "SoftmaxWithLoss" bottom: "classifier" bottom: "label" top: "softmax_loss1" } layer { name: "accuracy" type: "Accuracy" bottom: "classifier" bottom: "label" top: "accuracy" include: { phase: TRAIN } }

layer { bottom: "data" top: "conv1s" name: "conv1s" type: "Convolution" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 16 kernel_size: 3 pad: 1 stride: 1 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } . . . layer { bottom: "pool_avgs" top: "classifiers" name: "classifiers" type: "InnerProduct" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } inner_product_param { num_output: 10 weight_filler { type: "msra" } bias_filler { type: "constant" value: 0 } } } layer { name: "softmax_loss2" type: "SoftmaxWithLoss" bottom: "classifiers" bottom: "label" top: "softmax_loss2" loss_weight: 0.2 } layer { name: "accuracys" type: "Accuracy" bottom: "classifiers" bottom: "label" top: "accuracys" include: { phase: TRAIN } }

layer { name: "KL_loss" type: "KnowledgeDistillation" bottom: "classifiers" #student bottom: "classifier" #teacher top: "KL_loss" include { phase: TRAIN } knowledge_distillation_param { temperature: 4 } loss_weight: 1 } When I train it, Log warning "KnowledgeDistillation Layer cannot backpropagate to soft label nor label inputs"

wentianli commented 6 years ago

@qinxianyuzi the warning occurs because the second bottom doesn't receive any gradients. you can use propagate_down to stop backprop layer { name: "KL_loss" type: "KnowledgeDistillation" bottom: "classifiers" #student bottom: "classifier" #teacher propagate_down: 1 propagate_down: 0 top: "KL_loss" include { phase: TRAIN } knowledge_distillation_param { temperature: 4 } loss_weight: 1 }

or freeze the teacher network as said in #2

qinxianyuzi commented 6 years ago

@wentianli Thanks very much. Does student harder to learn with higher temperature?

wentianli commented 6 years ago

@qinxianyuzi the optimal temperature is often between 2 and 10

qinxianyuzi commented 6 years ago

@wentianli Thank you! Sometimes we should try different temperature arrording to different training task.

liangzimei commented 6 years ago

hello, @wentianli when i have two teacher models(i.e., ensemble model), how should i arrange the logits of each teacher model? if i choose average strategy to combine the two teachers, can i make a mean operation on the logits of them directly when training a student model?

wentianli commented 6 years ago

@liangzimei Averaging logits is incorrect. The kl loss sums p_i * log(q_i) + constant w.r.t every i, where p_i is the probability that the teacher produces for class i. When there are two teachers, this term becomes 0.5 * p1_i * log(q_i) + 0.5 * p2_i * log(q_i) + constant, which means you need two knowledge_distillation_layers.

liangzimei commented 6 years ago

@wentianli thank you so much, i will have a try. Do you Implement softmaxloss layer with temperature which is employed when training a teacher? Or just use a power layer? thanks in advance.

wentianli commented 6 years ago

@liangzimei I didn't use temperature when training a teacher. A scale layer with fixed weights could solve that. layer { name: "XXX" type: "Scale" bottom: "XXX" top: "XXX" param { lr_mult: 0 decay_mult: 0 } scale_param { filler { value: 0.5 } # here temperature = 2 bias_term: false } }

liangzimei commented 6 years ago

@wentianli ok, you mean when training a teacher, temperature=1 is used in most cases (including hinton's paper)?

liuqunzhong commented 6 years ago

can be used in a regression model？ for example， face alignment

wentianli commented 6 years ago

@liuqunzhong L1 or L2 loss is used for regression. knowledge distillation layer is implemented for classification.

liangzimei commented 6 years ago

hello, @wentianli when i train a student model (i.e., mobilnet-v1 ) taught by an ensemble model (2 models, one of them is mobilenet-v1 ), the student's accuracy is always between two teachers, any suggestions ? thanks in advance...

wentianli commented 6 years ago

@liangzimei You mean the student model outperforms its counterpart and underperforms the teacher model? It should be so. To obtain better accuracy, you probably need to replace the mobilenet-v1 in the ensemble with a better one.

WormCoder commented 6 years ago

thanks for sharing why kl divergence is adopted for loss instead of cross-entropy?

wentianli commented 6 years ago

@WormCoder The only difference between kl divergence and cross entropy is a constant term, which doesn't affect backprop at all. When the student and the teacher have exactly the same outputs (this is our goal for training), kl divergence becomes zero.

liuqunzhong commented 6 years ago

student is 2424 teacher is 4848 , and two datasets with the same data order. is ok？

liangzimei commented 6 years ago

@WormCoder yeah, it drops fast in the beginning.

westnight commented 6 years ago

@liangzimei hello,i want to know when to freeze teacher model, if i prefer to set propagate_down:false, do i need to set weight decay=0 in solver.prototxt? i mean is it enough to freeze a model with the single parameter propagate_down?

liangzimei commented 6 years ago

@westnight when training student model, we should freeze the teacher, we can set all the "lr_mult=0", 'lr_decay=0' in conv layer of the teacher to avoid updating parameters. BN layer may be different, you can refer to the previous replies.

westnight commented 6 years ago

@liangzimei thank you. To freeze the teacher model, i know setting lr and wd equal zero works.But i wonder if using propagate_down is another way.

liangzimei commented 6 years ago

@westnight according to my understanding, it is. when i train student model , i use both propagate_down: false and lr_mult: 0 for safe.

iamweiweishi commented 6 years ago

Thank you for your great work. To compose the prototxt for training, should I write the teacher prototxt and student prototxt into single file? IF this is right, then how to initialize the 'teacher' with pretrained caffemodel? After training, how to save the partial 'student' model? Could you pls send me one copy of your prototxt? Thank you so much.

wentianli commented 6 years ago

@iamweiweishi Into single file? Yes. Because 'student' is randomly initialized, it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. To save partial model, a convenient way is to save the model in HDF5 format, and then you can rename the blobs or delete some blobs. For an example of prototxt, please see the comment above.

iamweiweishi commented 6 years ago

Thank you. @wentianli It works now.

hito0512 commented 4 years ago

@wentianli I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。you say " it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. " I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

hito0512 commented 4 years ago

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

Wenzhiqiang16 commented 4 years ago

@hito0512 Sorry to bother you。Do you solve this problem finally？ I have the same trouble.

wuzuowuyou commented 2 years ago

L1 or L2 loss is used for regression how about distillation L1 or L2 loss？

dawuchen commented 2 years ago

您的来件已收到。谢谢您的来信！

Aviator99999 commented 1 year ago

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

@hito0512 sorry for disturbing now, I need to work on the above said fashion. Could u please help me with the procedure if found. Thank You

dawuchen commented 1 year ago

您的来件已收到。谢谢您的来信！

wentianli / knowledge_distillation_caffe

Question about tips and tricks #1