Got NCCL error while training

Chzzi commented 1 year ago

I'm trying to train an RGB model on the IsoGD dataset with the following script: python -m torch.distributed.launch --nproc_per_node=4 --master_port=1234 --use_env train.py --config config/IsoGD.yml --data ~/dataset/IsoGD_imgs --splits data/dataset_splits/IsoGD/rgb --save ./train_IsoGD_rgb/ --batch-size 8 --sample-duration 32 --smprob 0.2 --mixup 0.8 --shufflemix 0.3 --epochs 100 --distill 0.2 --type M --intar-fatcer 2

But I got NCCL error at the end of the first training epoch with the following log:

RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=145864, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1805538 milliseconds before timing out. [E ProcessGroupNCCL.cpp:414] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225747 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225748 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 225749 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 225746) of binary: /home/21031110062/.conda/envs/videomae/bin/python Traceback (most recent call last): File "/home/21031110062/.conda/envs/videomae/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/21031110062/.conda/envs/videomae/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/21031110062/.conda/envs/videomae/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

zhoubenjia commented 1 year ago

This problem occurs because in ddp mode, setting drop_last=True and single-gpu batch size=8 causes one gpu to have no data in the last iteration, while the others do. This prevents the process from synchronizing. You can solve this problem by setting drop_last=False. We will fix this bug as soon as possible. Also, it is highly recommended to set: “--finetune ./Checkpoints/NTU-RGBD-32-DTNV2-TSM/model_best.pth.tar” will help improve performance

Chzzi commented 1 year ago

Thanks for your answer, the training process seems to be working now, thanks a lot

zhoubenjia / MotionRGBD-PAMI

Got NCCL error while training #1