Closed kai-xie closed 7 years ago
Hi @kai-xie , I don't understand. If you keep the same hyper-parameters of conv layers in the second phase, wouldn't the algorithm keeps pruning these layers? When saying pruning the layers separately, I meant not to further prune or splice the conv layers when pruning fully connected layers (there are a bunch of ways but you can simply do this by setting a zero or negative number to iter_stop for these layers). Also, I didn't really encounter the learning rate problem as you did, but the pruning rates do change (obviously not to 100% or 0%) during training and that's how the algorithm works. As in your case, I think you can first try what I said and maybe larger c_rates in fully connected layers to see if the pruning still fails.
@yiwenguo Thanks for your reply! I will try again to see how it works.
It worked when training conv
and ip
layers seperately by controlling the iter_stop
. Thank you very much! @yiwenguo
Hi @kai-xie , I have some concerns:
Thanks.
@HaiPhan1991
1. pruning ( fine-tuning )
conv
and ip
layers to compress layer, as described in the README.iter_stop
in conv
layers to max_iter, while ip
layers to 0 or negative value (0 or negative iter_stop
means no pruning). Then start training. iter_stop
in conv
layers to 0 or negative, while ip
layers to max_iter. Then start training.tips: for a proof of the pruning concept, don't set the c_rate
too large, [-1, 1] would be a good choice to start with, or try out with mnist first.
2. check the pruning rate
I used python scripts to do this. The python API is not provided in this repo, but you can work this around by compiling the caffe.proto
manually, then extract the weights/bias and mask blobs using your own python scripts.
I am also trying to apply DNS to a newer version of caffe, so that the python API could be used. Here is my repo. In my version, after compiling the caffe and pycaffe, prepare your compressed DNS caffemodel, and run the following command from your CAFFE_ROOT (make sure you have set CAFFE_ROOT
environment variable, which is the dir of you caffe folder) :
python compression_scripts/dns_to_normal.py <dns.prototxt> <dns_model.caffemodel> <target.prototxt> <output_target.caffemodel>
The compression rate should be shown on the screen, and the output_target.caffemodel
should have the same size as a normal caffemodel (about 1/2 of the dns_model.caffemodel) which can be tested.
e.g.
python compression_scripts/dns_to_normal.py examples/mnist/dns_train_val.prototxt examples/mnist/dns_iter_10000.caffemodel examples/mnist/mnist_train_val.prototxt examples/mnist/mnist_test.caffemodel
My repo is still under development, so a little bit messy with files and folders, but it works fine with small pruning rate (i.e. small c_rate
), but would be buggy with large c_rate
. Still working on it.
Hope this would help.
Awesome! Thank you for your detail instructions. It's really helpful. I am doing on ImageNet dataset, hope it works well.
Hi @kai-xie , I'm working on the DNS recently. For Problem 3 you pointed out, does the constant setting of mu and std work finally? I update the mu and std every iteration. And I find the pruning rarely changes between iterations.
Hi Yiwen,
I was trying to prune Alexnet but only made little progress. Would you please share the detailed training process and hyperparameters?
The training tricks you provided in #12 are very useful, but I still cannot reproduce the results in your paper. Here are some problems I've encountered during the pruning process:
Problem 1.
conv
layers are easy to prune butip
layers are not.Let's name the layers in Alexnet to be pruned as
conv1
,conv2
,conv3
,conv4
,conv5
,ip1
,ip2
, andip3
.conv
layers andip
layers are divided into different pruning groups. So I trainedconv1
toconv5
(type: "CConvolution") together in the first place, while leaving theip1
toip3
as normal innerproduct layer (type: "InnerProduct"). This step is successful.But when I move on to try to fine-tune and prune the
ip1
layers (hyperparameters in theCconv' layers are kept the same,
ip2and
ip3` are still "InnerProduct" type), it does not converge and the accuracy does not increase either before the loss blows up (the loss always blows up to 87.3365, don't know why this number).Problem 2. Hyperparameters.
c_rate I used [-0.7 ,0.5] for the c_rate of
conv
layers (negative c_rate for firstconv
layer) and [0.5, 1.6] for theip
layers when pruning and finetuning, so that each layer's pruning rate are almost the same as circled in the following picture.learning rate (
lr
) 10^-5 to 10^-6. Largerlr
makes the loss blow up very quickly. Sometimes even 10^-5 can only last for few hundreds to thousands interations (batch size 1024) before the loss blows up.With the above parameters and training process in Problem 1, the result does not converge.
I think
lr
andc_rate
are the most two important hyperparameters during the pruning process. Is my understanding correct?Problem 3. Pruning rate degradation. In order to get a converged results, I used smaller c_rate for the
ip
layers so that around 40% of the total parameters are kept. Now the result converges and accuracy is around 56%-57% .But I found that if we compare the caffemodels snapshot in the early stage of pruning and those of the late stage, around 40% of total parameters are kept in the early stage model, while 100% of total parameters are kept in late stage model, which means the pruning fails.
I think this is because the
mu
andstd
are only calculated in the first iteration. After tens of thousands of iterations, themu
andstd
have changed dramatically. Do you think this could be the possible reason for this problem?Thank you very much for your patience! It would be of great help if more detailed training process could be offered.
Thanks!