multi GPU setting experiment for contrastive learning make error

hyunjunChhoi commented 1 year ago

epoch=11, iter=[127/360] | entropy: 0.041 energy: 0.012, contrastive: 0.000 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Traceback (most recent call last): File "rpl_corocl.code/main.py", line 196, in torch.multiprocessing.spawn(main, nprocs=args.gpus, args=(args.gpus, config, args)) File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 51, in train non_residual_logits, residual_logits, projects = model(input_data) File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/numb7315/.conda/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Thanks for your great work. Sometimes contrastive loss goes to 0. When contrastive loss is 0 during training , following error occurs Already find_unused_parameters is set to True, however error occurs.

Could you help?? In my setting I cannot set batchsize to 8 in single GPU setting because of GPU memory

Thanks in advance

yyliu01 commented 1 year ago

Hi @hyunjunChhoi

This DDP issue usually happens when no gradients are found in backpropagation after a forward pass. In our case, it is likely caused by the limited batch size and input resolution, so no anchor embedding is captured for the contrastive loss.

Could you please simply change this line to

torch.sum(anchor_embeds) * .0

and see whether the issue is solved or not.

Cheers, Yuyuan

hyunjunChhoi commented 1 year ago

Hi @hyunjunChhoi

This DDP issue usually happens when no gradients are found in backpropagation after a forward pass. In our case, it is likely caused by the limited batch size and input resolution, so no anchor embedding is captured for the contrastive loss.

Could you please simply change this line to
torch.sum(anchor_embeds) * .0
and see whether the issue is solved or not.

Cheers, Yuyuan

Thanks for your quick reply

However, issue not resolved . Following issue occur when I set GPU to 4 with each batchsize 2

epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000 wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Traceback (most recent call last):aded (0.000 MB deduped) File "rpl_corocl.code/main.py", line 196, in torch.multiprocessing.spawn(main, nprocs=args.gpus, args=(args.gpus, config, args)) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 123, in train del ood_proj, city_proj UnboundLocalError: local variable 'city_proj' referenced before assignment

I printed global ood num

which is as following(when error occurs with contrast loss is 0) :

[tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')] [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')] [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')] [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')]

yyliu01 commented 1 year ago

Hi @hyunjunChhoi

Could you comment this line and see what's going on?

Cheers, Yuyuan

hyunjunChhoi commented 1 year ago

Hi @hyunjunChhoi

Could you comment this line and see what's going on?

Cheers, Yuyuan

epoch=0, iter=[2/360] | entropy: 0.134 energy: 0.299, contrastive: 6.706 [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')] [tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')] [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')] [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')] epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000 [tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1')] [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3')] [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0')] [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2')] wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Traceback (most recent call last): File "rpl_corocl.code/main.py", line 196, in torch.multiprocessing.spawn(main, nprocs=args.gpus, args=(args.gpus, config, args)) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 54, in train non_residual_logits, residual_logits, projects = model(input_data) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Same error occurred I don't know why

yyliu01 commented 1 year ago

Hi @hyunjunChhoi

That's wired. I've tested the code in 2 * V100 (32 GB), but everything works fine...

Could you double check the embedding numbers for both city and ood cases in this line?

Also, could you change the line in here to be

        loss = self.info_nce(anchors_=anchor_embeds, a_labels_=anchor_labels.unsqueeze(1), contras_=contrs_embeds,
                             c_labels_=contrs_labels.unsqueeze(1)) if anchor_embeds.nelement() > 0 else \
            torch.sum(city_proj) * .0 + torch.sum(ood_proj) * .0

, and see what's going on?

Cheers, Yuyuan

hyunjunChhoi commented 1 year ago

Hi @hyunjunChhoi

That's wired. I've tested the code in 2 * V100 (32 GB), but everything works fine...

Could you double check the embedding numbers for both city and ood cases in this line?

Also, could you change the line in here to be
        loss = self.info_nce(anchors_=anchor_embeds, a_labels_=anchor_labels.unsqueeze(1), contras_=contrs_embeds,
                             c_labels_=contrs_labels.unsqueeze(1)) if anchor_embeds.nelement() > 0 else \
            torch.sum(city_proj) * .0 + torch.sum(ood_proj) * .0 
, and see what's going on?

Cheers, Yuyuan

Thanks however I think error occurs right after the loop that zero contrastive loss occurs I corrected my code like this:

    for batch_idx in tbar:
        curr_idx = epoch * loader_len + batch_idx
        city_imgs, city_targets, city_mix_imgs, city_mix_targets, ood_imgs, ood_targets = next(train_loader)
        city_mix_imgs, city_mix_targets = city_mix_imgs.cuda(non_blocking=True), city_mix_targets.cuda(non_blocking=True)
        ood_imgs, ood_targets = ood_imgs.cuda(non_blocking=True), ood_targets.cuda(non_blocking=True)
        optimizer.zero_grad()
        self.engine.update_iteration(epoch, curr_idx)
        ood_indices = [254 in i for i in city_mix_targets]
        global_ood_num = self.fetch_global_ood_num(local_ood_num=torch.tensor([sum(ood_indices)]).cuda())
        catch_anomaly += sum(global_ood_num).item()
        print(global_ood_num)

        if all([i >= 2 for i in global_ood_num]):
            input_data = torch.cat([city_mix_imgs, ood_imgs], dim=0)
            half_batch_size = int(input_data.shape[0] / 2)
            #print(half_batch_size)

            non_residual_logits, residual_logits, projects = model(input_data)
            #print(projects.size())
            city_vanilla_logits, city_mix_logits, city_proj = \
                non_residual_logits[:half_batch_size], residual_logits[:half_batch_size], projects[:half_batch_size]

            ood_logits, ood_proj = residual_logits[half_batch_size:], projects[half_batch_size:]
            contras_loss = self.contras_loss(city_proj=city_proj[ood_indices],
                                             city_gt=city_mix_targets[ood_indices],
                                             city_pred=city_mix_logits[ood_indices],
                                             ood_pred=ood_logits[ood_indices],
                                             ood_proj=ood_proj[ood_indices], ood_gt=ood_targets[ood_indices])
        else:
            ood_logits, ood_proj = None, None
            print("error1")
            city_vanilla_logits, city_mix_logits, _ = model(city_mix_imgs)
            print("error2")
            contras_loss = torch.tensor([.0], device=city_mix_logits.device)

Then it looks like this:

epoch=0, iter=[2/360] | entropy: 0.134 energy: 0.299, contrastive: 6.706 [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([1], device='cuda:2')] [tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([1], device='cuda:1')] error1 error1 [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([1], device='cuda:0')] [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([1], device='cuda:3')] error1 error1 error2 error2 error2 error2 epoch=0, iter=[3/360] | entropy: 0.050 energy: 0.246, contrastive: 0.000 [tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2'), tensor([2], device='cuda:2')] [tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1'), tensor([2], device='cuda:1')] [tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3'), tensor([2], device='cuda:3')] [tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0'), tensor([2], device='cuda:0')] wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing. Traceback (most recent call last): File "rpl_corocl.code/main.py", line 196, in torch.multiprocessing.spawn(main, nprocs=args.gpus, args=(args.gpus, config, args)) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/numb7315/RPL-main/rpl_corocl.code/main.py", line 149, in main trainer.train(model=model, epoch=curr_epoch, train_sampler=train_sampler, train_loader=train_loader, File "/home/numb7315/RPL-main/rpl_corocl.code/engine/trainer.py", line 54, in train non_residual_logits, residual_logits, projects = model(input_data) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/home/numb7315/anaconda3/envs/balanced/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward if self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

Maybe it occurs when the cut-mix scene has no ood label in the scene by some reason Also, Could you try the test with batchsize is 2 then this error occur frequent

Maybe there some problem with

if all([i >= 2 for i in global_ood_num]):

this statement . could you check?

I think the key difference between rpl and rpl_con is projection layer

In the case of zero contrastive loss , the projection layer is not used and the error may occurs

Thanks in advance

yyliu01 commented 1 year ago

Hi @hyunjunChhoi

I've tried it again in 2*V100 (32G) yesterday, and everything still works fine (even with batch_size=2). Could you try to run you scripts based on 2 GPUs and see what's going on? Currently, I don't have 4 GPUs to use and I'll work on this issue (i.e., 3 or even more GPUs ddp training) if other users meet same problem in the future.

Cheers, Yuyuan

yyliu01 commented 1 year ago

Hi @hyunjunChhoi

I've finally got a chance to access 4 GPUs to test the code. I'll let you know if there is any bug in the program.

yyliu01 commented 1 year ago

I have tested the code with 4 * 3090 GPUs with small image resolution (i.e., 300x300) and limited batch size (i.e., batch_size=2), and I cannot obtain any bugs during the training. Here is the log test_4gpus.txt.

@hyunjunChhoi Please double check our deployment information and thanks for your interest.

pdejorge commented 11 months ago

Hi,

I actually got the same error too:

del ood_proj, city_proj
UnboundLocalError: local variable 'city_proj' referenced before assignment

To my understanding this is because in this line the code checks if there are at least 2 ood pixels per image (I think) and then if it does not satisfy this condition the constrastive loss is not used for that batch. However in that case the variable city_proj is left undefined. I think that modifying this line to ood_logits, ood_proj, city_proj = None, None, None should solve the problem.

Hope that can help future users of the code! And a clarification from the authors could be nice as to why is this condition necessary.

Regards, Pau

yyliu01 commented 11 months ago

Hi @pdejorge

If I remember correct, the condition is necessary for DDP training; while the projector module will hang in case when one GPU card has OoD object and others not. It is related with this issue.

Thanks for the report.

Cheers, Yuyuan

yyliu01 / RPL

multi GPU setting experiment for contrastive learning make error #2