zhang-haojie / wesam

[CVPR 2024] Code for "Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation"
MIT License
130 stars 9 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory #22

Closed zhangnanyue closed 2 months ago

zhangnanyue commented 3 months ago

作者您好,我在使用8卡3090,batch_size=1,使用自己的数据集(图像大小为960×768)对huge模型进行训练,出现如下问题:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.70 GiB total capacity; 21.89 GiB already allocated; 116.56 MiB free; 22.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

我在程序开头设置了os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128',依旧没有解决问题。

在之前的issue中,看到你们的回答,使用4卡3090即可进行训练。

是否是我这边的设置或者参数错误,我该如何解决这个问题呢?

hiyyg commented 3 months ago

买大显卡

zhangnanyue commented 3 months ago

买大显卡

我理解是,其它GPU还有空余的显存,那不应该爆单卡内存呀,它为什么执着在单卡上申请内存。这里不是太明白,大佬能说下这是为什么吗?

zhang-haojie commented 3 months ago

In previous simple validation, training the vit_h model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.

zhangnanyue commented 3 months ago

In previous simple validation, training the vit_h model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.

Thank you for your reply. If I use two 3090 GPUs connected via NVLink, can I train the 'vit_h' model with a batch size of 1?

zhang-haojie commented 3 months ago

Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the max_nums in config or changing the number of LoRA layers.

zhangnanyue commented 3 months ago

Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the max_nums in config or changing the number of LoRA layers.

Thank you for your suggestion