I followed the instructions in ReadMe, and on step 3 when trying to run train.py, I face NCCL error:
Init multi-processing training...
d13186ffee3a:57:57 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.2<0>
d13186ffee3a:57:57 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
d13186ffee3a:57:57 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
d13186ffee3a:57:57 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
NCCL version 2.4.8+cuda10.1
d13186ffee3a:57:75 [0] misc/topo.cc:22 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
d13186ffee3a:57:75 [0] NCCL INFO init.cc:876 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:909 -> 2
d13186ffee3a:57:75 [0] NCCL INFO init.cc:947 -> 2
d13186ffee3a:57:75 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
File "train.py", line 174, in <module>
run_train()
File "train.py", line 171, in run_train
multi_train(args, config, Network)
File "train.py", line 155, in multi_train
torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/crowddet/tools/train.py", line 100, in train_worker
net = torch.nn.parallel.DistributedDataParallel(net, device_ids=[rank], broadcast_buffers=False)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 285, in __init__
self.broadcast_bucket_size)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 483, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1587428398394/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
I followed the instructions in ReadMe, and on step 3 when trying to run train.py, I face NCCL error:
The only changes i made were: adding
to Dockerfile before
RUN apt-get update
and adding in resnet50_fbaug.pth.Anyone faced the same issue or have potential solution for this?