CUDA out of memory - Githubissues

StefanIsSmart commented 3 months ago

When I am reproducing your work, I find always CUDA out of memory.

like this:

Traceback (most recent call last): File "run.py", line 62, in trainer.train_and_test() File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 101, in train_and_test self.train() File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 83, in train trn_loss = self.train_iterations() File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 295, in train_iterations output = self.model(mol_batch).view(-1) File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/model.py", line 54, in forward xm, hm = self.mol_conv(xm, data_mol.edge_index, data_mol.edge_attr, h=hm, batch=data_mol.batch) File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 259, in forward x = self.conv(x, edge_index, edge_attr) File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 131, in forward return self.conv(x, edge_index, edge_attr) File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 40, in forward return self.propagate(edge_index, x=x, edge_attr=edge_attr, size=size) File "/tmp/layer_TripletMessage_propagate_6_8b4kbw.py", line 194, in propagate out = self.message( File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 49, in message alpha = (triplet self.weight_triplet_att).sum(dim=-1) # time consuming 12.14s torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 866.00 MiB (GPU 4; 23.64 GiB total capacity; 10.45 GiB already allocated; 31.56 MiB free; 10.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to deal with that?

StefanIsSmart commented 3 months ago

我修改了您数据的维度，希望用您的框架训练自己的东西。其他的没有发生改动。

代码的GPU管理方式似乎没有办法避免这个问题，于是我就考虑将低精度训练过程中的一个配置不同seed用同一个GPU改成了一个seed都要单独选择一个GPU，并且GPU选择的条件改成了只要小于24570就需要寻找下一块（我的单卡总显存24576），按理来说已经要求很严格，几乎是理想情况下一个任务一张卡。还是会有out of memory。

在您工作中如果出现了GPU溢出的配套参数，您默认都是直接扔掉该次采样的吗？还是说您的工作框架对于模型大小和卡的容量需要提前摸条件，不能完全智能化迁移实现？

StefanIsSmart commented 3 months ago

另外有的训练过程中，会出现Nan（应该是计算metrics_fn过程中出现nan无法计算），这样的您是直接认为正常，扔掉嘛？

StefanIsSmart commented 3 months ago

有时候采样采到batch size 是 768的时候很容易就out of memory

yvquanli commented 2 months ago

是的，这种batch size大的，肯定会溢出，一般是直接扔掉的，我建议换A100或者之类的这种内存大的显卡

Harry @.***> 于2024年7月8日周一 13:33写道：

有时候采样采到batch size 是 768的时候很容易就out of memory

— Reply to this email directly, view it on GitHub https://github.com/yvquanli/GLAM/issues/10#issuecomment-2213065630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFSIIFEOW7YTOANMXQDT5LZLIQDNAVCNFSM6AAAAABKLVMCJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGA3DKNRTGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

yvquanli / GLAM

CUDA out of memory #10