Distributed training - Githubissues

I use the command “ python -m torch.distributed.launch --nproc_per_node=4 train_net.py --config configs/nhr/sport1.py” you gave for distributed training, but the following error pops up

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. data/trained_model/nhr/sport1 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68315 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 68316) of binary: /mnt/data/local-disk2/software/anaconda3/envs/mlp_maps/bin/python

Single gpu training is possible

Are there any other parameters that need to be set？

zju3dv / mlp_maps

Distributed training #21