Watchdog caught collective operation timeout

srdfjy commented 2 months ago

Hi

When I use 4 machines (a total of 8 GPUs, 2 per machine) for training, there are no problems (and there are no problems when using fewer GPUs). However, when I use 4 machines (a total of 16 GPUs, 4 per machine) for training, the following error occurs.

Version and Model：v3.0.1 conformer u2++

2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory 2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0> 2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO cudaDriverVersion 11070 2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation 2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO Bootstrap : Using eth0:10.241.101.227<0> 2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory 2024/04/30 23:29:35 job-2046325-339461780-0:306:306 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Failed to open libibverbs.so[.1] 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Using network Socket 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0> 2024/04/30 23:29:35 job-2046325-339461780-0:305:305 [0] NCCL INFO cudaDriverVersion 11070 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Failed to open libibverbs.so[.1] 2024/04/30 23:29:35 NCCL version 2.18.6+cuda11.8 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NET/Socket : Using [0]eth0:10.241.101.227<0> 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Using network Socket 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init START 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init START 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Setting affinity for GPU 1 to aaaa,aaaaaaaa 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO P2P Chunksize set to 131072 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 15[1] -> 0[0] [receive] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 15[1] -> 0[0] [receive] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->2 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO P2P Chunksize set to 131072 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[0] [send] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[0] [send] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all rings 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 0[0] -> 2[0] [send] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all rings 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 01/0 : 2[0] -> 0[0] [receive] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/Socket/0 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO Connected all trees 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO Connected all trees 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 2024/04/30 23:29:35 job-2046325-339461780-0:305:325 [0] NCCL INFO comm 0x6359eb40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 3b000 commId 0x108ef55e18c5d28a - Init COMPLETE 2024/04/30 23:29:35 job-2046325-339461780-0:306:324 [1] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b1000 commId 0x108ef55e18c5d28a - Init COMPLETE 2024/04/30 23:29:36 2024-04-30 23:29:36,090 INFO Checkpoint: save to checkpoint /data/exp/init.pt 2024/04/30 23:29:36 2024-04-30 23:29:36,107 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 1 2024/04/30 23:29:38 2024-04-30 23:29:38,370 INFO Epoch 0 TRAIN info lr 8.333333333333334e-09 rank 0 2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before 2024/04/30 23:29:38 2024-04-30 23:29:38,422 INFO using accumulate grad, new batch size is 16 times larger than before 2024/05/01 00:00:56 [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out. 2024/05/01 00:00:57 job-2046325-339461780-0:306:327 [1] NCCL INFO [Service thread] Connection closed by localRank 1 2024/05/01 00:00:57 job-2046325-339461780-0:306:318 [0] NCCL INFO comm 0x885c980 rank 1 nranks 16 cudaDev 1 busId b1000 - Abort COMPLETE 2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. 2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800257 milliseconds before timing out. 2024/05/01 00:00:57 [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

srdfjy commented 1 month ago

@xingchensong

I have temporarily solved this issue by changing num_workers=4 and prefetch=250 to num_workers=2 and prefetch=125. However, I'm not sure why setting a higher number of num_workers would lead to this error. It seems that different numbers of GPU cards need to match the appropriate number of num_workers.

Mddct commented 1 month ago

may some oom occurs in training，

 num workers * gpus <= cpus cores

srdfjy commented 1 month ago

may some oom occurs in training，
 num workers * gpus <= cpus cores

I am using a total of 4 machines, each equipped with 4 V100 GPUs with 16GB of memory and 100 cores of dedicated CPU.

wenet-e2e / wenet

Watchdog caught collective operation timeout #2511