torch.cuda.OutOfMemoryError: CUDA out of memory

zhang-haojie / wesam

[CVPR 2024] Code for "Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation"

MIT License

130 stars 9 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory #22

Closed zhangnanyue closed 2 months ago

zhangnanyue commented 3 months ago

作者您好，我在使用8卡3090，batch_size=1，使用自己的数据集（图像大小为960×768）对huge模型进行训练，出现如下问题：

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.70 GiB total capacity; 21.89 GiB already allocated; 116.56 MiB free; 22.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我在程序开头设置了os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128'，依旧没有解决问题。

在之前的issue中，看到你们的回答，使用4卡3090即可进行训练。

是否是我这边的设置或者参数错误，我该如何解决这个问题呢？

hiyyg commented 3 months ago

买大显卡

zhangnanyue commented 3 months ago

买大显卡

我理解是，其它GPU还有空余的显存，那不应该爆单卡内存呀，它为什么执着在单卡上申请内存。这里不是太明白，大佬能说下这是为什么吗？

zhang-haojie commented 3 months ago

In previous simple validation, training the vit_h model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.

zhangnanyue commented 3 months ago

In previous simple validation, training the vit_h model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.

Thank you for your reply. If I use two 3090 GPUs connected via NVLink, can I train the 'vit_h' model with a batch size of 1?

zhang-haojie commented 3 months ago

Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the max_nums in config or changing the number of LoRA layers.

zhangnanyue commented 3 months ago

Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the max_nums in config or changing the number of LoRA layers.

Thank you for your suggestion