weiyithu / SurroundDepth

[CoRL 2022] SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation
MIT License
247 stars 38 forks source link

A question about distributed training on DDAD dataset #11

Open myc634 opened 1 year ago

myc634 commented 1 year ago

Hello! I am following your work and doing a reproduction. But I got these questions below while using the command python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt for distributed training on the DDAD dataset.

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.

After training for a while, the process would be automatically shut down for running overtime. Are there any details or training settings that I have ignored? Or does the torch version matter? Thanks!

weiyithu commented 1 year ago

Is the GPU out of the memory ?

myc634 commented 1 year ago

Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code self.opt.batch_size = self.opt.batch_size // 6 and wondering what's this code for.

weiyithu commented 1 year ago

I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views.

myc634 commented 1 year ago

Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the scale-aware model with the SfM pretraining? The program predicts that it's going to take 63 hours for training.

weiyithu commented 1 year ago

I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training.

myc634 commented 1 year ago

Thank you very much!

weiyithu commented 1 year ago

You're welcome.

myc634 commented 1 year ago

Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before?