zylo117 / Yet-Another-EfficientDet-Pytorch

The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights.
GNU Lesser General Public License v3.0
5.2k stars 1.27k forks source link

GPU memory during training #507

Open eticin opened 4 years ago

eticin commented 4 years ago

I am using 4 Tesla V100 with batch size of 4 for training D7x model but I got CUDA out of memory error. Also, I got memory error for D7 model. The possible largest model for training with single image in 32 Gb GPU is D6. I think there should be a problem. Is this memory usage normal ? Please, can you share information related to gpu usage during training? Note: I am trying to train model completely, not head only.

zylo117 commented 4 years ago

What's your training command?

eticin commented 4 years ago

My training command for 2 gpu and D7 model is python train.py -c 7 -p y3_run6 -n 16 --batch_size 2 --lr 1e-6 \ --load_weights /mnt/trains/users/botan/Yet-Another-EfficientDet-Pytorch/weights/pre/efficientdet-d7.pth \ --num_epochs 200 \ Training command for D7x is similar.

Art200696 commented 4 years ago

Same here. Have problem with d8.

python train.py -c 8 -p wtm -n 12 --batch_size 32 --lr 1e-5 --num_epochs 200 \ --load_weights /data/wtm/efficient-det/weights/efficientdet-d8.pth\ --data_path /data/wtm/efficient-det --optim adamw

Tesla V100-SXM2 * 4 Nvidia - 410.104 CUDA - 10.2

EDIT: Retried with 7,6,5 - same issue

zylo117 commented 4 years ago

What about -n 0 --batch_size {num_gpu}?

rvandeghen commented 3 years ago

@yldrmBtn @Art200696 have you found a solution ? For me the explanation does not directly come from the size of the network but the size of the input image. While I tried to debug the memory consumption, I've pointed out that the backbone was responsible of 80% of it. I think EfficientNet does not reduce fast enough the spatial dimensions, leading to a large number of channels with large dimensions.

zylo117 commented 3 years ago

@rvandeghen actually pytorch is also to blame beacause the cache of every op stays forever. If you add torch.cuda.empty_cache() right after some certain op, the memory usage will soon drops.

For example, https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/blob/c533bc2de65135a6fe1d25ca437765c630943afb/efficientdet/model.py#L44

The solution is not to use pytorch, lol. This issue will not happen on static graph framework like tf 1.x.

rvandeghen commented 3 years ago

@zylo117 I have seen you mentioned this trick in another issue but pytorch does not recommend to use it. I have discovered a small issue in the repo but it might be intentional. How can I reach you to discuss it without a PR ?

eticin commented 3 years ago

Actually, I did not think about the problem too much. I tried some other small models instead of solving problem

zylo117 commented 3 years ago

@rvandeghen Yes, this trick is not nice or elegant at all and I've never used it. I mentioned it just to show what that memory is used for. And we can discuss anything about the repo here.

fangyixiao18 commented 3 years ago

@rvandeghen Yes, this trick is not nice or elegant at all and I've never used it. I mentioned it just to show what that memory is used for. And we can discuss anything about the repo here.

I also met the same problem, CUDA OOM while training the d7x. I use the 32GB V100. The command is python train.py -c 8 -p nuimages --batch_size 8 --lr 0.0005 --num_epochs 24 --data_path /cache/ --load_weights ./pretrained_models/efficientdet-d8.pth

rvandeghen commented 3 years ago

@fangyixiao18 I think a batch size of 8 with effdet7x is too big even for one V100. I suggest you to debug the memory consumption starting from a BS=1 and to try increasing it while you can.

fangyixiao18 commented 3 years ago

@fangyixiao18 I think a batch size of 8 with effdet7x is too big even for one V100. I suggest you to debug the memory consumption starting from a BS=1 and to try increasing it while you can.

Actually, I use 8 V100 GPUs. Each gpu has BS=1.