yoshitomo-matsubara / torchdistill

A coding-free framework built on PyTorch for reproducible deep learning studies. 🏆25 knowledge distillation methods presented at CVPR, ICLR, ECCV, NeurIPS, ICCV, etc are implemented so far. 🎁 Trained models, training logs and configurations are available for ensuring the reproducibiliy and benchmark.
https://yoshitomo-matsubara.net/torchdistill/
MIT License
1.37k stars 132 forks source link

Possible re-implementation of KD w/ LS #478

Closed sunshangquan closed 3 months ago

sunshangquan commented 3 months ago

Hi @yoshitomo-matsubara, this is a great work! I find your implementation of KD w/ LS in Benchmarks with this yaml. But I notice that it uses a different setting from KD. So I trained it through your codes with the command:

torchrun  --nproc_per_node=3 examples/torchvision/image_classification.py     --config configs/official/ilsvrc2012/yoshitomo-matsubara/rrpr2020/kd-ls-resnet18_from_resnet34.yaml     --run_log log/ilsvrc2012/kd-ls-resnet18_from_resnet34.log     --world_size 3     -adjust_lr

where the same batch_size and lr as KD were used. The final result is "Acc@1 71.3906 Acc@5 90.3722". The log file is kd-ls-resnet18_from_resnet34.log, and the weight and yaml files are in at Dropbox. During my training, the machine was running other tasks and therefore much slower than yours. Is the experiment fair to compare in the benchmarks. I will appreciate it if a further check can be done.

yoshitomo-matsubara commented 3 months ago

Hi @sunshangquan ,

Thank you for sharing the result!

Key differences between my and your yaml files are

param mine yours
(training) batch_size 512 256 (actually 768, due to DDP with 3 GPUs)
lr 0.2 0.1 (actually 0.3, due to DDP with 3 GPUs + -adjust_lr )
temperature 2.0 1.0
beta 9 0.5

Could you clarify what you mean by

I notice that it uses a different setting from KD.

?

The hyperparameters in my YAML file (left) are basically the same as described in your paper, code and response, and I feel that the hyperparameters in your yaml files are very different from what you reported in the paper and repository.

sunshangquan commented 3 months ago

Hi @yoshitomo-matsubara , thank you for your quick reply and attention! I mean the reported 71.23 of "KD w/ LS" in Benchmark uses batch_size of 512 and lr of 0.2 , which is different from the setting for 71.37 of KD in Benchmark . I guess different values of batch_size and lr between KD and KD w/ LS might be kind of unfair, and opened this issue.

In terms of temperature and beta, it was my mistake that I did not change them into 2.0 and 9. I will reopen this issue later if I run a new experiment. I will appreciate it if a further check could be made.

yoshitomo-matsubara commented 3 months ago

Basically, the benchmark table shows results reported in the original papers and those I reproduced using the hyperparameters reported in their papers and repositories.

To some extent, I agree to the point that different values of batch size and lr may be unfair, but that's what's happening in the communities, unfortunately. E.g., people tune hyperparameters of their methods though they just copy-paste reported numbers from the previous studies or don't describe how they obtained results of the baseline methods.

P.S. Next time you report new results with the same hyperparameter values as those in the paper, use Discussions instead e.g., https://github.com/yoshitomo-matsubara/torchdistill/discussions/474 then I will add the results, config, and model weights as official ones.

sunshangquan commented 3 months ago

I am sorry that I did not know I should have used Discussions in this case. If possible, could you delete this issue and I will update in Discussions.