Closed srdfjy closed 3 weeks ago
http might not working correctly,you should first debug http connection
I used two machines (each with a single V100) for training and executed wget shards_000000000.tar on both machines; the HTTP download was normal.
Could u try torch_ddp?
The troch_ddp training is normal.
This error has triggered a RuntimeError in the wenet_join function.
"2024/06/06 12:28:05 2024-06-06 12:28:05,934 INFO Detected uneven workload distribution: [Rank 0]: Rank 1 failed to pass monitoredBarrier in 30000 ms",
hi @xingchensong
Based on your response here, this should be normal for me, but I don't know why this error would occur.
https://github.com/wenet-e2e/wenet/issues/2266#issuecomment-1886436064
how many tars in your dataset? For small dataset, say only 2 tars, it is normal, timeout means you reach the end of epoch
how many tars in your dataset? For small dataset, say only 2 tars, it is normal, timeout means you reach the end of epoch
I have used 34 Tars for fine-tuning, and currently, the entire training can proceed normally. If this is normal, then I will ignore this log for now.
you dataset is small, this is normal
you dataset is small, this is normal
THX!
Hi
When training with DeepSpeed, an error occurred: "Rank 1 failed to pass monitoredBarrier in 30000 ms." When I changed the timeout to 1200s, the same timeout error was reported. What could be causing this?
I am using NCCL, but it's reporting a GLOO timeout here.
err
2024/06/05 12:08:28 [/opt/conda/conda-bld/pytorch_1702400366987/work/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 30000ms for recv operation to complete
cfg
num_works=2 prefetch=100 shards data,http read