zju3dv / mlp_maps

Code for "Representing Volumetric Videos as Dynamic MLP Maps" CVPR 2023
Other
235 stars 10 forks source link

Distributed training #21

Closed ch1998 closed 1 year ago

ch1998 commented 1 year ago

I use the command “ python -m torch.distributed.launch --nproc_per_node=4 train_net.py --config configs/nhr/sport1.py” you gave for distributed training, but the following error pops up

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. data/trained_model/nhr/sport1 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68315 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 68316) of binary: /mnt/data/local-disk2/software/anaconda3/envs/mlp_maps/bin/python

Single gpu training is possible

Are there any other parameters that need to be set?

guoyj4 commented 8 months ago

@ch1998 Hi, have you figured out how to solve the problem. I am facing the same error during distributed training. Thanks!