winycg / CIRKD

[CVPR-2022] Official implementations of CIRKD: Cross-Image Relational Knowledge Distillation for Semantic Segmentation and implementations on Cityscapes, ADE20K, COCO-Stuff., Pascal VOC and CamVid.
180 stars 26 forks source link

Reproduce results #5

Closed Muyun99 closed 2 years ago

Muyun99 commented 2 years ago

Hi, thanks your excellent work and nice code style.

I reproduce your result and find that the performance is too high.

The student baseline is about 76.21 mIoU, the scripts such as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python -m torch.distributed.launch --nproc_per_node=4 train_baseline.py \
    --model deeplabv3 \
    --backbone resnet18 \
    --data cityscapes \
    --batch-size 8 \
    --max-iterations 80000 \
    --save-dir work_dirs/deeplabv3_res18_baseline_bs8 \
    --log-dir work_dirs/deeplabv3_res18_baseline_bs8 \
    --pretrained-base pretrain/resnet18-imagenet.pth

And I try to use only KD loss to reproduce, the scripts such as follows, I get a mIoU 76.61

CUDA_VISIBLE_DEVICES=0,1,2,3 \
    python -m torch.distributed.launch --nproc_per_node=4 \
    train_kd_onlykdloss.py \
    --teacher-model deeplabv3 \
    --student-model deeplabv3 \
    --teacher-backbone resnet101 \
    --student-backbone resnet18 \
    --data cityscapes \
    --batch-size 8 \
    --max-iterations 80000 \
    --save-dir work_dirs/deeplabv3_res18_OnlyKD_bs8 \
    --log-dir work_dirs/deeplabv3_res18_OnlyKD_bs8 \
    --teacher-pretrained pretrain/deeplabv3_resnet101_citys_best_model.pth \
    --student-pretrained-base pretrain/resnet18-imagenet.pth

So I`m confused that the magic performance, it is so high and really embarrassing.

If you have time can you reproduce the baseline result? I'd like to know if some parameters are set incorrectly.

Looking forward to your reply. Thanks!

Best Regards! Yun

winycg commented 2 years ago

Hi, I think these are not the wrong results. I notice that you train models with 80K training iterations, which are longer than ours with 40K iterations. I think more training iterations would lead to better performance and the result may be saturated. Therefore, I guess that the effectiveness of KD may be discounted. To investigate the efficacy of KD obviously, I suggest that the training recipe does not need very long iterations and you may achieve an appropriate baseline. For example, some previous segmentation KD works, such as SKD, IFVD, even use 512x512 crop size and 40K iterations with 8 batch size, a relatively lower 69.10 mIoU is achieved as a baseline.

Muyun99 commented 2 years ago

Hi, the reason I use the 80K training iterations is that I use 8 as the batch size.

I only have four GPUs and cannot run with 16 batchsize following the original CIRKD setting. So I want to use the bs8/80K iters comparable with bs16/40K, is it right way to reproduce it?

However, as you says, longer training iterations, I will try to achieve a appropriate baseline to get a better comparison with KD methods.

Thanks for your reply and your nice codebase, it`s really helpful for me to compare other KD methods.

winycg commented 2 years ago

Thanks for your attention. Towards your experiments, it seems that bs8/80K may be better than bs16/40K. I think it is OK to reproduce all methods towards your own training setup, as long as they are trained in the same settings.