Open myc634 opened 1 year ago
Is the GPU out of the memory ?
Maybe not, I guess. Currently, we are training on the A10 GPUs. Yesterday the training process just went on well without changing any settings. BTW, are you using 6*RTX3090 for training? I have found the code self.opt.batch_size = self.opt.batch_size // 6
and wondering what's this code for.
I'm sorry for that it is a liite bit confusing. This code means that we reshape one batch data (6, 3, H, W) to (1 ,6, 3, H, W) in the training step since one frame has 6 surrounding views.
Thanks for your explanation! You did a great and solid work! I have one last question: how long did it take for training on your machine for the scale-aware model
with the SfM pretraining
? The program predicts that it's going to take 63 hours for training.
I remember that for DDAD it takes about 1.5 days on 8 RTX 3090 for scale-aware training.
Thank you very much!
You're welcome.
Noting that you mentioned in this paper that the results of FSM are different from the original paper, have you tried to reproduce the results of FSM before?
Hello! I am following your work and doing a reproduction. But I got these questions below while using the command
python -m torch.distributed.launch --nproc_per_node 8 run.py --model_name ddad --config configs/ddad.txt
for distributed training on the DDAD dataset.[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1806986 milliseconds before timing out. '
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. T o avoid this inconsistency, we are taking the entire process down.
After training for a while, the process would be automatically shut down for running overtime. Are there any details or training settings that I have ignored? Or does the torch version matter? Thanks!