Closed EDGSCOUT closed 3 years ago
Hi, could you check your pytorch and cuda version? It seems to me that maybe the problem is caused by some library issues.
Hi, could you check your pytorch and cuda version? It seems to me that maybe the problem is caused by some library issues.
yes,I have solved it. it's a version problem. we can close.
Traceback (most recent call last): File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Torch Version: 1.7.0
Torch Version: 1.7.0
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/ps/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ps/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ps/anaconda3/bin/python', '-u', 'main.py', '--local_rank=3', '--config=config/MGMatting-DIM.toml']' returned non-zero exit status 1.
(base) ps@ps:~/Downloads/MGMatting-main/code-base$ Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe