Differences in Validation results when training in Pascal Context Dataset

zhanghang1989 / PyTorch-Encoding

A CV toolkit for my papers.

https://hangzhang.org/PyTorch-Encoding/

MIT License

2.04k stars 452 forks source link

Differences in Validation results when training in Pascal Context Dataset #343

Open alexlopezcifuentes opened 3 years ago

alexlopezcifuentes commented 3 years ago

Hi Hang Zhang!

First I want to thank you for the amazing repository.

I'm trying to train DeepLabv3 with ResNeSt-101 backbone (DeepLab_ResNeSt101_PContext) for the task of semantic segmentation in Pascal Context Dataset. I'm running the code without any issue, however, I'm still under you results from the pre-trained model that you provide in https://hangzhang.org/PyTorch-Encoding/model_zoo/segmentation.html :

Model	Pix Accuracy	MIoU
Mine	79.1 %	52.1 %
Yours	81.9 %	56.5 %

I'm using the exact same hyperparameters as you and using the following training command: python train.py --dataset pcontext --model deeplab --aux --backbone resnest101

Is there something that I'm missing for reaching you results? I assume that your model is trained using Auxiliary Loss but not Semantic Encoding Loss. Are you using some pretraining data maybe?

Thanks in advance!

Alex.

zhanghang1989 commented 3 years ago

Hi Alex,

Are you using batch size of 16? This is very important.

zhanghang1989 commented 3 years ago

Did you test the pretrained model using this script?

https://hangzhang.org/PyTorch-Encoding/model_zoo/segmentation.html#test-pretrained

alexlopezcifuentes commented 3 years ago

Hi!

Unfortunately, my GPU does not have enough memory to fit a batch size of 16, so I'm trying to simulate it by using gradient accumulation. I suppose that is the main problem, I was asking in case I have missed something else.

I do use your testing script (https://hangzhang.org/PyTorch-Encoding/model_zoo/segmentation.html#test-pretrained).

So I assume that the only problem is the batch size which is a problem with nearly no solution...

zhanghang1989 commented 3 years ago

You may try PyTorch checkpoint option, which saves memory usage.

alexlopezcifuentes commented 3 years ago

Thanks for the suggestion. I tried it and although it saves GPU memory the performance of the final model is worse to the one train with lower batch size.

Can I ask you which GPU did you used to train the model, and how much memory did it have? I want to approximately know how many memory I'll need.

zhanghang1989 commented 3 years ago

For the experiments in the paper, I used AWS EC2 P3.24dn instance with 8x 32GB V100 gpus, but may not be necessary. 16GB per gpu should be enough for most of the experiments.