Closed zhangnanyue closed 2 months ago
买大显卡
买大显卡
我理解是,其它GPU还有空余的显存,那不应该爆单卡内存呀,它为什么执着在单卡上申请内存。这里不是太明白,大佬能说下这是为什么吗?
In previous simple validation, training the vit_h
model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.
In previous simple validation, training the
vit_h
model requires over 40GB of memory even with a batchsize 1. This is because the batchsize setting in the config refers to the batchsize per GPU other GPUs will not help.
Thank you for your reply. If I use two 3090 GPUs connected via NVLink, can I train the 'vit_h' model with a batch size of 1?
Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the max_nums
in config or changing the number of LoRA
layers.
Sorry, I haven't used NVLink. If the GPU memory can reach 40G, you can also consider reducing the
max_nums
in config or changing the number ofLoRA
layers.
Thank you for your suggestion
作者您好,我在使用8卡3090,batch_size=1,使用自己的数据集(图像大小为960×768)对huge模型进行训练,出现如下问题:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 3; 23.70 GiB total capacity; 21.89 GiB already allocated; 116.56 MiB free; 22.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
我在程序开头设置了os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:128',依旧没有解决问题。
在之前的issue中,看到你们的回答,使用4卡3090即可进行训练。
是否是我这边的设置或者参数错误,我该如何解决这个问题呢?