Multi-gpu training code get stuck after a few iterations

SSSHZ commented 2 years ago

Hi, I tried the multi-gpu training code but the program always got stuck after a few iterations.

Environment:

pytorch 1.7.1
cuda 10.2
gcc version 7.5.0
Ubuntu 18.04.3 LTS

Reproduce the bug: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 train_net_ddp.py --config_file coco --gpus 4

Output:


  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/aa/anaconda3/envs/e2ec/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/site-packages/torch/distributed/launch.py", line 253, in main
    process.wait()
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1019, in wait
    return self._wait(timeout=timeout)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1653, in _wait
    (pid, sts) = self._try_wait(0)
  File "/home/a/anaconda3/envs/e2ec/lib/python3.7/subprocess.py", line 1611, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt```

zhang-tao-whu commented 2 years ago

Hello, I think this bug is caused by allocating too many workers for dataloader. When training with multi gpus, ${--bs} is the number of batch size of a single gpu, the actual batch size of your above command is 24*4=96. For convenience, I directly set the number of workers as the number of batch size. 96 workers for dataloder is probably too many.

You can try set the batch size smaller, such as --bs 6 when using 4 GPUS. Or you can modify num_worker of the function make_ddp_train_loader in dataset/data_loader.py.

SSSHZ commented 2 years ago

After setting train.batch_size = 6 in configs/coco.py, I tried num_workers=4 or 2 or 0 for make_ddp_train_loader in dataset/data_loader.py and the same issue always happened.

Perhaps this bug is not caused by workers for dataloader.

SSSHZ commented 2 years ago

The problem might be caused by the combination of Pytorch 1.7.1, CUDA 10.2, NCCL 2.7.8. The easiest solution is to try Pytorch 1.7.0 for me.

zhang-tao-whu / e2ec

Multi-gpu training code get stuck after a few iterations #13