This is a CPU implementation of knowledge distillation in Caffe.
This code is heavily based on softmax_loss_layer.hpp and softmax_loss_layer.cpp.
Please refer to the paper
Hinton, G. Vinyals, O. and Dean, J. Distilling knowledge in a neural network. 2015.
CAFFE
ROOT
cd $ROOT
git clone https://github.com/wentianli/knowledge_distillation_caffe.git
cp $ROOT/knowledge_distillation_layer.hpp $CAFFE/include/caffe/layers
cp $ROOT/knowledge_distillation_layer.cpp $CAFFE/src/caffe/layers
Modify $CAFFE/src/caffe/proto/caffe.proto
add optional KnowledgeDistillationParameter
in LayerParameter
message LayerParameter {
...
//next available layer-specific ID
optional KnowledgeDistillationParameter knowledge_distillation_param = 147;
}
add message KnowledgeDistillationParameter
message KnowledgeDistillationParameter {
optional float temperature = 1 [default = 1];
}
KnowledgeDistillation Layer has one specific parameter temperature
.
The layer takes 2 or 3 input blobs:
bottom[0]
: the logits of the student
bottom[1]
: the logits of the teacher
bottom[2]
(optional): label inputs
The logits are first divided by temperatrue T, then mapped to probability distributions over classes using the softmax function. The layer computes KL divergence instead of cross entropy. The gradients are multiplied by T^2, as suggested in the paper.
prototxt
(2 input blobs are given)
layer {
name: "KD"
type: "KnowledgeDistillation"
bottom: "student_logits"
bottom: "taecher_logits"
top: "KL_div"
include { phase: TRAIN }
knowledge_distillation_param { temperature: 4 } #usually larger than 1
loss_weight: 1
}
layer {
name: "KD"
type: "KnowledgeDistillation"
bottom: "student_logits"
bottom: "taecher_logits"
bottom: "label"
top: "KL_div"
include { phase: TRAIN }
knowledge_distillation_param { temperature: 4 }
loss_param {ignore_label: 2}
loss_weight: 1
}