xg-chu / CrowdDet

[CVPR 2020] Detection in Crowded Scenes: One Proposal, Multiple Predictions
MIT License
422 stars 85 forks source link

NCCL error when running train.py #86

Closed yrhyrhyrh closed 8 months ago

yrhyrhyrh commented 8 months ago

I followed the instructions in ReadMe, and on step 3 when trying to run train.py, I face NCCL error:

Init multi-processing training...
d13186ffee3a:57:57 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
d13186ffee3a:57:57 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

d13186ffee3a:57:57 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
d13186ffee3a:57:57 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
NCCL version 2.4.8+cuda10.1

d13186ffee3a:57:75 [0] misc/topo.cc:22 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
d13186ffee3a:57:75 [0] NCCL INFO init.cc:876 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:909 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:947 -> 2
d13186ffee3a:57:75 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
  File "train.py", line 174, in <module>
    run_train()
  File "train.py", line 171, in run_train
    multi_train(args, config, Network)
  File "train.py", line 155, in multi_train
    torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/crowddet/tools/train.py", line 100, in train_worker
    net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[rank], broadcast_buffers=False)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
    self.broadcast_bucket_size)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428398394/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8

The only changes i made were: adding

COPY ./cuda-keyring_1.0-1_all.deb cuda-keyring_1.0-1_all.deb
RUN rm /etc/apt/sources.list.d/cuda.list \
    && rm /etc/apt/sources.list.d/nvidia-ml.list \
    && dpkg -i cuda-keyring_1.0-1_all.deb

to Dockerfile before RUN apt-get update and adding in resnet50_fbaug.pth.

Anyone faced the same issue or have potential solution for this?