ylabbe / cosypose

Code for "CosyPose: Consistent multi-view multi-object 6D pose estimation", ECCV 2020.
MIT License
301 stars 89 forks source link

Training of tless-refiner #32

Closed azad96 closed 3 years ago

azad96 commented 3 years ago

I started the training of the refiner with the following command runjob --ngpus=4 --queue=local python -m cosypose.scripts.run_pose_training --config tless-refiner

However, the process is stuck at some point as it can be seen from the below screenshot. Screenshot from 2021-01-04 12-54-58

What can be the reason that the training does not end?

ylabbe commented 3 years ago

Hi, I have done limited experiments with multi-gpu training on a single machine. I would say there is a high chance it stops because one of the processes has crashed for some reason. Could you look at the outputs of the other processes ?

azad96 commented 3 years ago

@ylabbe, I started a training with debug mode with the following command runjob --ngpus=4 --queue=local python -m cosypose.scripts.run_pose_training --config tless-coarse --debug Then, these are the processes that were created. Screenshot from 2021-01-05 10-05-58 I checked the output of each one with tail -f /proc/<pid>/fd/1 but all are the same which is below. Screenshot from 2021-01-05 10-06-10 By the way, everything seems okay when I use 1 gpu.

azad96 commented 3 years ago

the problem with this multi-gpu configuration seems to be solved when setting the number of workers to zero.