Closed hyunjunChhoi closed 1 year ago
Hi @hyunjunChhoi
This DDP issue usually happens when no gradients are found in backpropagation after a forward pass. In our case, it is likely caused by the limited batch size and input resolution, so no anchor embedding is captured for the contrastive loss.
Could you please simply change this line to
torch.sum(anchor_embeds) * .0
and see whether the issue is solved or not.
Cheers, Yuyuan
Hi @hyunjunChhoi
This DDP issue usually happens when no gradients are found in backpropagation after a forward pass. In our case, it is likely caused by the limited batch size and input resolution, so no anchor embedding is captured for the contrastive loss.
Could you please simply change this line to
torch.sum(anchor_embeds) * .0
and see whether the issue is solved or not.
Cheers, Yuyuan
Thanks for your quick reply
However, issue not resolved . Following issue occur when I set GPU to 4 with each batchsize 2
epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Traceback (most recent call last):aded (0.000 MB deduped)
File "rpl_corocl.code/main.py", line 196, in
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 123, in train del ood_proj, city_proj UnboundLocalError: local variable 'city_proj' referenced before assignment
I printed global ood num
which is as following(when error occurs with contrast loss is 0) :
[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')] [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')] [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')] [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')]
Hi @hyunjunChhoi
Could you comment this line and see what's going on?
Cheers, Yuyuan
epoch=0, iter=[2/360] | entropy: 0.134 energy: 0.299, contrastive: 6.706
[tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')]
[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')]
[tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')]
[tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')]
epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000
[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1')]
[tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3')]
[tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0')]
[tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2')]
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Traceback (most recent call last):
File "rpl_corocl.code/main.py", line 196, in
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, args)
File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main
trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader,
File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 54, in train
non_residual_logits, residual_logits, projects = model(input_data)
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, **kwargs)
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all forward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Same error occurred I don't know why
Hi @hyunjunChhoi
That's wired. I've tested the code in 2 * V100 (32 GB), but everything works fine...
Could you double check the embedding numbers for both city and ood cases in this line?
Also, could you change the line in here to be
loss = self.info_nce(anchors_=anchor_embeds, a_labels_=anchor_labels.unsqueeze(1), contras_=contrs_embeds,
c_labels_=contrs_labels.unsqueeze(1)) if anchor_embeds.nelement() > 0 else \
torch.sum(city_proj) * .0 + torch.sum(ood_proj) * .0
, and see what's going on?
Cheers, Yuyuan
Hi @hyunjunChhoi
That's wired. I've tested the code in 2 * V100 (32 GB), but everything works fine...
Could you double check the embedding numbers for both city and ood cases in this line?
Also, could you change the line in here to be
loss = self.info_nce(anchors_=anchor_embeds, a_labels_=anchor_labels.unsqueeze(1), contras_=contrs_embeds, c_labels_=contrs_labels.unsqueeze(1)) if anchor_embeds.nelement() > 0 else \ torch.sum(city_proj) * .0 + torch.sum(ood_proj) * .0
, and see what's going on?
Cheers, Yuyuan
Thanks however I think error occurs right after the loop that zero contrastive loss occurs I corrected my code like this:
for batch_idx in tbar:
curr_idx = epoch * loader_len + batch_idx
city_imgs, city_targets, city_mix_imgs, city_mix_targets, ood_imgs, ood_targets = next(train_loader)
city_mix_imgs, city_mix_targets = city_mix_imgs.cuda(non_blocking=True), city_mix_targets.cuda(non_blocking=True)
ood_imgs, ood_targets = ood_imgs.cuda(non_blocking=True), ood_targets.cuda(non_blocking=True)
optimizer.zero_grad()
self.engine.update_iteration(epoch, curr_idx)
ood_indices = [254 in i for i in city_mix_targets]
global_ood_num = self.fetch_global_ood_num(local_ood_num=torch.tensor([sum(ood_indices)]).cuda())
catch_anomaly += sum(global_ood_num).item()
print(global_ood_num)
if all([i >= 2 for i in global_ood_num]):
input_data = torch.cat([city_mix_imgs, ood_imgs], dim=0)
half_batch_size = int(input_data.shape[0] / 2)
#print(half_batch_size)
non_residual_logits, residual_logits, projects = model(input_data)
#print(projects.size())
city_vanilla_logits, city_mix_logits, city_proj = \
non_residual_logits[:half_batch_size], residual_logits[:half_batch_size], projects[:half_batch_size]
ood_logits, ood_proj = residual_logits[half_batch_size:], projects[half_batch_size:]
contras_loss = self.contras_loss(city_proj=city_proj[ood_indices],
city_gt=city_mix_targets[ood_indices],
city_pred=city_mix_logits[ood_indices],
ood_pred=ood_logits[ood_indices],
ood_proj=ood_proj[ood_indices], ood_gt=ood_targets[ood_indices])
else:
ood_logits, ood_proj = None, None
print("error1")
city_vanilla_logits, city_mix_logits, _ = model(city_mix_imgs)
print("error2")
contras_loss = torch.tensor([.0], device=city_mix_logits.device)
Then it looks like this:
epoch=0, iter=[2/360] | entropy: 0.134 energy: 0.299, contrastive: 6.706
[tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')]
[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')]
error1
error1
[tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')]
[tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')]
error1
error1
error2
error2
error2
error2
epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000
[tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2')]
[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1')]
[tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3')]
[tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0')]
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
Traceback (most recent call last):
File "rpl_corocl.code/main.py", line 196, in
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, args)
File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main
trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader,
File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 54, in train
non_residual_logits, residual_logits, projects = model(input_data)
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, **kwargs)
File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all forward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Maybe it occurs when the cut-mix scene has no ood label in the scene by some reason Also, Could you try the test with batchsize is 2 then this error occur frequent
Maybe there some problem with
if all([i >= 2 for i in global_ood_num]):
this statement . could you check?
I think the key difference between rpl and rpl_con is projection layer
In the case of zero contrastive loss , the projection layer is not used and the error may occurs
Thanks in advance
Hi @hyunjunChhoi
I've tried it again in 2*V100 (32G) yesterday, and everything still works fine (even with batch_size=2). Could you try to run you scripts based on 2 GPUs and see what's going on? Currently, I don't have 4 GPUs to use and I'll work on this issue (i.e., 3 or even more GPUs ddp training) if other users meet same problem in the future.
Cheers, Yuyuan
Hi @hyunjunChhoi
I've finally got a chance to access 4 GPUs to test the code. I'll let you know if there is any bug in the program.
I have tested the code with 4 * 3090 GPUs with small image resolution (i.e., 300x300) and limited batch size (i.e., batch_size=2), and I cannot obtain any bugs during the training. Here is the log test_4gpus.txt.
@hyunjunChhoi Please double check our deployment information and thanks for your interest.
Hi,
I actually got the same error too:
del ood_proj, city_proj
UnboundLocalError: local variable 'city_proj' referenced before assignment
To my understanding this is because in this line the code checks if there are at least 2 ood pixels per image (I think) and then if it does not satisfy this condition the constrastive loss is not used for that batch. However in that case the variable city_proj
is left undefined. I think that modifying this line to ood_logits, ood_proj, city_proj = None, None, None
should solve the problem.
Hope that can help future users of the code! And a clarification from the authors could be nice as to why is this condition necessary.
Regards, Pau
epoch=11, iter=[127/360] | entropy: 0.041 energy: 0.012, contrastive: 0.000 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Traceback (most recent call last): File "rpl_corocl.code/main.py", line 196, in
torch.multiprocessing.spawn(main, nprocs=args.gpus, args=(args.gpus, config, args))
File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 51, in train non_residual_logits, residual_logits, projects = model(input_data) File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
; (2) making sure allforward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module'sforward
function. Please include the loss function and the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, iterable).Thanks for your great work. Sometimes contrastive loss goes to 0. When contrastive loss is 0 during training , following error occurs Already find_unused_parameters is set to True, however error occurs.
Could you help?? In my setting I cannot set batchsize to 8 in single GPU setting because of GPU memory
Thanks in advance