A question about distributed training on DDAD dataset

myc634 commented 1 year ago

Hello! I am following your work and doing a reproduction. But I got these questions below while using the command python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt for distributed training on the DDAD dataset.

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.

After training for a while, the process would be automatically shut down for running overtime. Are there any details or training settings that I have ignored? Or does the torch version matter? Thanks!

weiyithu commented 1 year ago

Is the GPU out of the memory ?

myc634 commented 1 year ago

Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code self.opt.batch_size = self.opt.batch_size // 6 and wondering what's this code for.

weiyithu commented 1 year ago

I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views.

myc634 commented 1 year ago

Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the scale-aware model with the SfM pretraining? The program predicts that it's going to take 63 hours for training.

weiyithu commented 1 year ago

I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training.

myc634 commented 1 year ago

Thank you very much!

weiyithu commented 1 year ago

You're welcome.

myc634 commented 1 year ago

Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before?

weiyithu / SurroundDepth

A question about distributed training on DDAD dataset #11