creating unet model and diffusion...
32,16,8 32,16,8
[16, 32, 64] [] [16, 32, 64]
Let's use 2 GPUs!
Unet model size: 2521.918MB
creating data loader...
Viz data len: 784
training...
Number of samples: 784
Diffusion loss: 1.0043
Traceback (most recent call last):
File "scripts/image_syn/train.py", line 140, in
main()
File "scripts/image_syn/train.py", line 67, in main
CycleTrainLoop(
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/src/run_desc.py", line 118, in run_loop
self.run_step(batch, cond).to("cuda:6")
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/src/run_desc.py", line 103, in run_step
took_step = self.mp_trainer.optimize(self.opt)
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/utils/fp16_util.py", line 190, in optimize
return self._optimize_fp16(opt)
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/utils/fp16_util.py", line 208, in _optimize_fp16
opt.step()
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, args, kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 234, in step
adam(params_with_grad,
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 300, in adam
func(params,
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 410, in _single_tensor_adam
denom = (exp_avg_sq.sqrt() / bias_correction2sqrt).add(eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.46 GiB (GPU 0; 23.68 GiB total capacity; 19.95 GiB already allocated; 1.62 GiB free; 20.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am trying to make it work on my server, having 6 gpus. I have tried few ways to see the dist_util file but it did not work so well. Can someone help me out please?
Problem raised:
Module: OpenFabrics (openib) Host: gpusystem
Another transport will be used instead, although this may result in lower performance.
NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.
Logging to logs/monuseg/prop1.0_pretrain Program executed via: scripts/image_syn/train.py \ --data_dir monuseg/allpatch256x256_128/train \ --viz_data_dir monuseg/allpatch256x256_128/train \ --lr 1e-4 \ --batch_size 1 \ --attention_resolutions 32,16,8 \ --diffusion_steps 1000 \ --image_size 512 \ --learn_sigma True \ --noise_schedule linear \ --num_channels 256 \ --num_head_channels 64 \ --num_res_blocks 2 \ --resblock_updown True \ --use_fp16 True \ --use_scale_shift_norm True \ --use_checkpoint False \ --num_classes 2 \ --class_cond True \ --no_instance False \ --max_iterations 15 \ --save_interval 10000 \ --viz_interval 10000 \ --viz_batch_size 2
creating unet model and diffusion... 32,16,8 32,16,8 [16, 32, 64] [] [16, 32, 64] Let's use 2 GPUs! Unet model size: 2521.918MB creating data loader... Viz data len: 784 training... Number of samples: 784 Diffusion loss: 1.0043 Traceback (most recent call last): File "scripts/image_syn/train.py", line 140, in
main()
File "scripts/image_syn/train.py", line 67, in main
CycleTrainLoop(
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/src/run_desc.py", line 118, in run_loop
self.run_step(batch, cond).to("cuda:6")
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/src/run_desc.py", line 103, in run_step
took_step = self.mp_trainer.optimize(self.opt)
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/utils/fp16_util.py", line 190, in optimize
return self._optimize_fp16(opt)
File "/sda/hyunbin20240205/Nudiff-main/nudiff/image_syn/utils/fp16_util.py", line 208, in _optimize_fp16
opt.step()
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, args, kwargs)
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 234, in step
adam(params_with_grad,
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 300, in adam
func(params,
File "/home/edsr/anaconda3/lib/python3.8/site-packages/torch/optim/adam.py", line 410, in _single_tensor_adam
denom = (exp_avg_sq.sqrt() / bias_correction2sqrt).add(eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.46 GiB (GPU 0; 23.68 GiB total capacity; 19.95 GiB already allocated; 1.62 GiB free; 20.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I am trying to make it work on my server, having 6 gpus. I have tried few ways to see the dist_util file but it did not work so well. Can someone help me out please?
Thank you