Closed azad96 closed 3 years ago
Hi, I have done limited experiments with multi-gpu training on a single machine. I would say there is a high chance it stops because one of the processes has crashed for some reason. Could you look at the outputs of the other processes ?
@ylabbe, I started a training with debug mode with the following command
runjob --ngpus=4 --queue=local python -m cosypose.scripts.run_pose_training --config tless-coarse --debug
Then, these are the processes that were created.
I checked the output of each one with
tail -f /proc/<pid>/fd/1
but all are the same which is below.
By the way, everything seems okay when I use 1 gpu.
the problem with this multi-gpu configuration seems to be solved when setting the number of workers to zero.
I started the training of the refiner with the following command
runjob --ngpus=4 --queue=local python -m cosypose.scripts.run_pose_training --config tless-refiner
However, the process is stuck at some point as it can be seen from the below screenshot.
What can be the reason that the training does not end?