weiaicunzai / pytorch-cifar100

Practice on cifar100(ResNet, DenseNet, VGG, GoogleNet, InceptionV3, InceptionV4, Inception-ResNetv2, Xception, Resnet In Resnet, ResNext,ShuffleNet, ShuffleNetv2, MobileNet, MobileNetv2, SqueezeNet, NasNet, Residual Attention Network, SENet, WideResNet)
4.19k stars 1.16k forks source link

CUDA out of memory problem #31

Open monkeyDemon opened 4 years ago

monkeyDemon commented 4 years ago

It seems some of the nets define in models has some hidden bug. For example, I use senet and will get CUDA out of memory error, but my batch_size is only 64, my GPU memory is 11G。

But when I use the model file here https://github.com/moskomule/senet.pytorch/tree/master/senet that only occupy 7G memory when batch_size=90.

I find senet.py resnext.py inceptionv4.py both has similar problem,may be more models.

bokveizen commented 4 years ago

True, so I am only using resnet and VGG

weiaicunzai commented 4 years ago

I've just updated my code, fixed this bug.

I've tested my updated code on Google Colab

python3.6 pytorch1.6 a K80 gpu:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

here is the output during training(seresnet152, batch_size=64), we can see that the GPU memory consumption is 7832MB, you could try yourself. If you have a different result, plz let me know, thanks. @monkeyDemon @bokveizen

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| Active memory         |  772034 KB |    6276 MB |   37813 GB |   37812 GB |
|       from large pool |  438784 KB |    5926 MB |   37354 GB |   37354 GB |
|       from small pool |  333250 KB |     480 MB |     458 GB |     458 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    7832 MB |    7832 MB |    7832 MB |       0 B  |
|       from large pool |    7350 MB |    7350 MB |    7350 MB |       0 B  |
|       from small pool |     482 MB |     482 MB |     482 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  354366 KB |    1425 MB |   19691 GB |   19690 GB |
|       from large pool |  351744 KB |    1423 MB |   19197 GB |   19197 GB |
|       from small pool |    2622 KB |      33 MB |     493 GB |     493 GB |
|---------------------------------------------------------------------------|
| Allocations           |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    2940    |    3808    |    6708 K  |    6705 K  |
|       from large pool |     141    |     549    |    2429 K  |    2429 K  |
|       from small pool |    2799    |    3414    |    4278 K  |    4275 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     499    |     499    |     499    |       0    |
|       from large pool |     258    |     258    |     258    |       0    |
|       from small pool |     241    |     241    |     241    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     106    |     121    |    3384 K  |    3384 K  |
|       from large pool |      44    |      83    |    1058 K  |    1058 K  |
|       from small pool |      62    |      77    |    2325 K  |    2325 K  |
|===========================================================================|
ShaoZeng commented 3 years ago

I found that mobilenet.py has similar problem, which occupies more GPU memory. Can you check it? THX!

Vickeyhw commented 3 years ago

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

weiaicunzai commented 3 years ago

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

Vickeyhw commented 3 years ago

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

weiaicunzai commented 3 years ago

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

Thanks, I will try to reproduce the bug you mentioned. Currently My GPU server is down due to the hardware problems, already sent to repaired, it might take a while, sorry.

weiaicunzai commented 3 years ago

I found that googlenet.py also occupies so many gpu memory that when I train it on ImageNet dataset , even 4 gpus with 20GB per gpu are not enough.

Could you please tell me, what is your input image size and batch size during training?

@weiaicunzai my input image size is 224x224. I tried to set batch size to 128 , 256 and 64, but none of them work.

I use 3 downsampling in googlenet, results larger feature map size during training , thats why we have large momery consumption during training. Fewer downsampling is beneficial for small input size like 32x32.I add one more downsampling layer in my googlenet implementation, the GPU memory usage drops from 14GB to 7GB during training on cifar100, but accuracy also drops about 2 percent. If you are going to train the large input image(224x224), you could use 5 times downsampling just as in the original paper, to further reduce the memory usage without losing much network performance.