Rank 1 failed to pass monitoredBarrier in 1200000 ms

wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit

https://wenet-e2e.github.io/wenet/

Apache License 2.0

3.87k stars 1.03k forks source link

Rank 1 failed to pass monitoredBarrier in 1200000 ms #2552

Closed srdfjy closed 3 weeks ago

srdfjy commented 3 weeks ago

When training with DeepSpeed, an error occurred: "Rank 1 failed to pass monitoredBarrier in 30000 ms." When I changed the timeout to 1200s, the same timeout error was reported. What could be causing this?

I am using NCCL, but it's reporting a GLOO timeout here.

err

2024/06/05 12:08:28 [/opt/conda/conda-bld/pytorch_1702400366987/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete

cfg

num_works=2 prefetch=100 shards data，http read

xingchensong commented 3 weeks ago

http might not working correctly，you should first debug http connection

srdfjy commented 3 weeks ago

I used two machines (each with a single V100) for training and executed wget shards_000000000.tar on both machines; the HTTP download was normal.

download logs

train logs

xingchensong commented 3 weeks ago

Could u try torch_ddp?

srdfjy commented 3 weeks ago

The troch_ddp training is normal.

This error has triggered a RuntimeError in the wenet_join function.

logs

"2024/06/06 12:28:05 2024-06-06 12:28:05,934 INFO Detected uneven workload distribution: [Rank 0]: Rank 1 failed to pass monitoredBarrier in 30000 ms",

srdfjy commented 3 weeks ago

hi @xingchensong

Based on your response here, this should be normal for me, but I don't know why this error would occur.

https://github.com/wenet-e2e/wenet/issues/2266#issuecomment-1886436064

xingchensong commented 3 weeks ago

how many tars in your dataset? For small dataset, say only 2 tars, it is normal, timeout means you reach the end of epoch

srdfjy commented 3 weeks ago

how many tars in your dataset? For small dataset, say only 2 tars, it is normal, timeout means you reach the end of epoch

I have used 34 Tars for fine-tuning, and currently, the entire training can proceed normally. If this is normal, then I will ignore this log for now.

xingchensong commented 3 weeks ago

you dataset is small, this is normal

srdfjy commented 3 weeks ago

you dataset is small, this is normal

THX!