Distributed training on 2 GPUs

JiatengLiu commented 9 months ago

Sorry to bother you again, I was training on a single GeForce RTX 3090, but I had a problem with the first stage of training:

When the train epoch is equal to 73, the program reported an error of "CUDA out of memory". Is this inconsistent with what you mentioned in your paper? Below is the error infomation.

exp: geometry_zju_377  eta: 0:04:31  epoch: 71  step: 35670  offset_loss: 0.0013  grad_loss: 0.0116  ograd_loss: 0.0078  mask_loss: 1.8135  img_loss: 0.0161  loss: 1.8298  data: 0.1377  batch: 0.8588  lr: 0.000425  max_mem: 20721
...
RuntimeError: CUDA out of memory. Tried to allocate 134.00 MiB (GPU 0; 23.69 GiB total capacity; 20.34 GiB already allocated; 85.94 MiB free; 21.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So I tried distributed training and set gpus in our geometry_zju_377.yaml file to [2,3] and explicitly set distributed to True, but the error was reported as follows. How can I solve it?

Traceback (most recent call last):
File "train_geometry.py", line 114, in <module>
main()
File "train_geometry.py", line 91, in main
cfg.local_rank = int(os.environ['RANK']) % torch.cuda.device_count()
File "/opt/conda/envs/RAvatar/lib/python3.8/os.py", line 675, in __getitem__
raise KeyError(key) from None
KeyError: 'RANK'

By the way, I tried to fix the error by explicitly fixing some os.environ values, but this did not work. Maybe I set the value incorrectly.

wenbin-lin commented 9 months ago

This problem seems to be caused by a bug in the loading of the ZJU-Mocap dataset, I will fix it soon. Besides, training with multiple GPUs has not yet been tested.

JiatengLiu commented 9 months ago

So I can still realize that all tasks can still be done on the single GeForce RTX 3090? By the way, I am trying to solve the problems encountered in parallel training of multiple GPUs. I will contact you when I have the results. Thanks:)

wenbin-lin commented 9 months ago

Yes, we used only one GPU for all tasks. The bug in loading the ZJU-Mocap dataset has been fixed, please try the new code.

JiatengLiu commented 9 months ago

I will retry later。 Thanks ：）

--------------原始邮件-------------- 发件人："Wenbin Lin @.>; 发送时间：2024年2月28日(星期三) 晚上11:17 收件人："wenbin-lin/RelightableAvatar" @.>; 抄送："Jett @.>;"Author @.>; 主题：Re: [wenbin-lin/RelightableAvatar] Distributed training on 2 GPUs (Issue #2)

Yes, we used only one GPU for all tasks. The bug in loading the ZJU-Mocap dataset has been fixed, please try the new code.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

JiatengLiu commented 8 months ago

This problem seems to be caused by a bug in the loading of the ZJU-Mocap dataset, I will fix it soon. Besides, training with multiple GPUs has not yet been tested.

I find another question about ZJU_Mocap, all mask images in the mask folder of each characters in the dataset are black image(all values are zero). Do you know the reason about this?

wenbin-lin commented 8 months ago

There are small but non-zeros values in the mask images, so the images look very dark.

JiatengLiu commented 8 months ago

Sorry. I understand:)

--------------原始邮件-------------- 发件人："Wenbin Lin @.>; 发送时间：2024年3月5日(星期二) 下午3:52 收件人："wenbin-lin/RelightableAvatar" @.>; 抄送："Jett @.>;"Author @.>; 主题：Re: [wenbin-lin/RelightableAvatar] Distributed training on 2 GPUs (Issue #2)

There are small but non-zeros values in the mask images, so the images look very dark.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wenbin-lin / RelightableAvatar

Distributed training on 2 GPUs #2

--------------原始邮件-------------- 发件人："Wenbin Lin @.>; 发送时间：2024年2月28日(星期三) 晚上11:17 收件人："wenbin-lin/RelightableAvatar" @.>; 抄送："Jett @.>;"Author @.>; 主题：Re: [wenbin-lin/RelightableAvatar] Distributed training on 2 GPUs (Issue #2)

--------------原始邮件-------------- 发件人："Wenbin Lin @.>; 发送时间：2024年3月5日(星期二) 下午3:52 收件人："wenbin-lin/RelightableAvatar" @.>; 抄送："Jett @.>;"Author @.>; 主题：Re: [wenbin-lin/RelightableAvatar] Distributed training on 2 GPUs (Issue #2)