RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe

wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit

Apache License 2.0

707 stars 116 forks source link

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe #221

Closed WhXmURandom closed 11 months ago

WhXmURandom commented 11 months ago

WhXmURandom commented 11 months ago

重新安装与cuda11.1版本对应的torch后，又出现新的报错。

cdliang11 commented 11 months ago

重新安装与cuda11.1版本对应的torch后，又出现新的报错。

把实验路径删了，或者注释掉以下代码试试： https://github.com/wenet-e2e/wespeaker/blob/6550a2ae9b431662b78df393af4440be23a787df/wespeaker/bin/train.py#L60-L61

WhXmURandom commented 11 months ago

把modeldir删掉了，现在卡在这里十多分钟，是正常的吗？

WhXmURandom commented 11 months ago

仍然报错

WhXmURandom commented 11 months ago

使用单gpu的时候可以运行，多gpu就跑不动

cdliang11 commented 11 months ago

看起来是nccl的问题

cdliang11 commented 11 months ago

https://github.com/wenet-e2e/wespeaker/blob/6550a2ae9b431662b78df393af4440be23a787df/wespeaker/bin/train.py#L52 换成gloo试试

WhXmURandom commented 11 months ago

换成gloo似乎也不行

WhXmURandom commented 11 months ago

应该是卡在了dist.barrier(device_ids=[gpu])

WhXmURandom commented 11 months ago

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

wcqy-ye commented 8 months ago

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

请问是在哪个脚本加入这个呢我在wespeaker/examples/cnceleb/v2的run.sh里尝试加入这句话然后./run.sh运行还是不行

WhXmURandom commented 8 months ago

NCCL_P2P_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus \

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

请问是在哪个脚本加入这个呢我在wespeaker/examples/cnceleb/v2的run.sh里尝试加入这句话然后./run.sh运行还是不行

wcqy-ye commented 8 months ago

NCCL_P2P_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus \

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

请问是在哪个脚本加入这个呢我在wespeaker/examples/cnceleb/v2的run.sh里尝试加入这句话然后./run.sh运行还是不行

好像还是不行并且我尝试zhi只使用一个gpu，还是会报错并且很奇怪的是像是刚刚运行就错误了请问您有什么想法或者知道怎么做吗

WhXmURandom commented 8 months ago

NCCL_P2P_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus \

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

请问是在哪个脚本加入这个呢我在wespeaker/examples/cnceleb/v2的run.sh里尝试加入这句话然后./run.sh运行还是不行

好像还是不行并且我尝试zhi只使用一个gpu，还是会报错并且很奇怪的是像是刚刚运行就错误了请问您有什么想法或者知道怎么做吗

你把exp_dir删除再运行

wcqy-ye commented 8 months ago

NCCL_P2P_DISABLE=1 torchrun --standalone --nnodes=1 --nproc_per_node=$num_gpus \

在脚本前加入NCCL_P2P_DISABLE=1可以多卡跑了

请问是在哪个脚本加入这个呢我在wespeaker/examples/cnceleb/v2的run.sh里尝试加入这句话然后./run.sh运行还是不行

好像还是不行并且我尝试zhi只使用一个gpu，还是会报错并且很奇怪的是像是刚刚运行就错误了请问您有什么想法或者知道怎么做吗

你把exp_dir删除再运行

好的万分感谢解决了