Open StefanIsSmart opened 3 months ago
我修改了您数据的维度,希望用您的框架训练自己的东西。其他的没有发生改动。
代码的GPU管理方式似乎没有办法避免这个问题,于是我就考虑将低精度训练过程中的一个配置不同seed用同一个GPU改成了一个seed都要单独选择一个GPU,并且GPU选择的条件改成了只要小于24570就需要寻找下一块(我的单卡总显存24576),按理来说已经要求很严格,几乎是理想情况下一个任务一张卡。还是会有out of memory。
在您工作中如果出现了GPU溢出的配套参数,您默认都是直接扔掉该次采样的吗? 还是说您的工作框架对于模型大小和卡的容量需要提前摸条件,不能完全智能化迁移实现?
另外有的训练过程中,会出现Nan(应该是计算metrics_fn过程中出现nan无法计算),这样的您是直接认为正常,扔掉嘛?
有时候采样采到batch size 是 768的时候 很容易就out of memory
是的,这种batch size大的,肯定会溢出,一般是直接扔掉的,我建议换A100或者之类的 这种内存大的显卡
Harry @.***> 于2024年7月8日周一 13:33写道:
有时候采样采到batch size 是 768的时候 很容易就out of memory
— Reply to this email directly, view it on GitHub https://github.com/yvquanli/GLAM/issues/10#issuecomment-2213065630, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALFSIIFEOW7YTOANMXQDT5LZLIQDNAVCNFSM6AAAAABKLVMCJWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGA3DKNRTGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
When I am reproducing your work, I find always CUDA out of memory.
like this:
Traceback (most recent call last): File "run.py", line 62, in
trainer.train_and_test()
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 101, in train_and_test
self.train()
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 83, in train
trn_loss = self.train_iterations()
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/trainer.py", line 295, in train_iterations
output = self.model(mol_batch).view(-1)
File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/model.py", line 54, in forward
xm, hm = self.mol_conv(xm, data_mol.edge_index, data_mol.edge_attr, h=hm, batch=data_mol.batch)
File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 259, in forward
x = self.conv(x, edge_index, edge_attr)
File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 131, in forward
return self.conv(x, edge_index, edge_attr)
File "/export/disk3/why/software/Miniforge3/envs/PyG252/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 40, in forward
return self.propagate(edge_index, x=x, edge_attr=edge_attr, size=size)
File "/tmp/layer_TripletMessage_propagate_6_8b4kbw.py", line 194, in propagate
out = self.message(
File "/export/disk7/why/workbench/MERGE/v0/GLAM_repeat/GLAM/src_1gp_EmbeddingCompare/layer.py", line 49, in message
alpha = (triplet self.weight_triplet_att).sum(dim=-1) # time consuming 12.14s
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 866.00 MiB (GPU 4; 23.64 GiB total capacity; 10.45 GiB already allocated; 31.56 MiB free; 10.81 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
How to deal with that?