Open adeschemps opened 4 years ago
This code using 'torch.distributed'. unfortunately, 'torch.distributed' does not support windows. https://github.com/pytorch/pytorch/issues/37068
I made some change for using at windows. https://github.com/kei97103/CC-FPSE
It works well on my enviroment... But I'm not sure this code works well in other enviroment.
Thanks for your answer, I'll post an update when I get around to giving it a try to tell you if it works on my end.
I am using the nvidia docker container for pytorch-1912. I can clone the github repository without any problem, but when I try to run CC-FPSE on my own data (on a 4 GPU instance) :
python train.py --name condconv --netG condconv --netD fpse --lambda_feat 20 --dataset_mode custom --label_dir mydata/train_label --image_dir mydata/train_img --label_nc 6 --no_instance --batchSize 1 --niter 100 --niter_decay 100 --use_vae --ngpus_per_node 4
I get the following error :
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/uge_mnt/home/adeschem/CC-FPSE/train.py", line 37, in main_worker dist.init_process_group(backend='nccl', init_method=opt.dist_url, world_size=world_size, rank=rank) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 109, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon) RuntimeError: Network is unreachable
This seems to be related to torch distributed communication package, eventhough I am not using the --mpdist option to use distributed multiprocessing.