NCCL Error when replica=2

Describe the bug

源码安装 8卡4090 运行模型, replica=2 第一个正常第二个报错

修改replica=1, 可以运行第一个模型, 再次运行时还是报错

To Reproduce

To help us to reproduce this bug, please provide information below:

Your Python version. py311
The version of xinference you use. source install at 2024年04月29日 10:26:10
Versions of crucial packages.

Full stack of the error.

model_uid = client.launch_model(model_name="qwen1.5-chat",
                            model_uid="qwen1.5-110b-4090",
                            model_format="gptq",
                            quantization="Int4",
                            n_gpu=4,
                            replica=2,
                            gpu_memory_utilization=0.9,
                            max_model_len=12000,
                            model_size_in_billions=110)

2024-04-29 10:15:14,374 xinference.core.supervisor 3464397 INFO     Xinference supervisor 0.0.0.0:63091 started
2024-04-29 10:15:14,401 xinference.core.worker 3464397 INFO     Starting metrics export server at 0.0.0.0:None
2024-04-29 10:15:14,403 xinference.core.worker 3464397 INFO     Checking metrics export server...
2024-04-29 10:15:15,018 xinference.core.worker 3464397 INFO     Metrics server is started at: http://0.0.0.0:33115
2024-04-29 10:15:15,019 xinference.core.worker 3464397 INFO     Xinference worker 0.0.0.0:63091 started
2024-04-29 10:15:15,020 xinference.core.worker 3464397 INFO     Purge cache directory: /root/.xinference/cache
2024-04-29 10:15:20,193 xinference.api.restful_api 3463847 INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
2024-04-29 10:15:32,993 xinference.model.llm.llm_family 3464397 INFO     Caching from Modelscope: qwen/Qwen1.5-110B-Chat-GPTQ-Int4
2024-04-29 10:15:33,035 - modelscope - INFO - PyTorch version 2.1.2 Found.
2024-04-29 10:15:33,036 - modelscope - INFO - Loading ast index from /data/modelscope/ast_indexer
2024-04-29 10:15:33,063 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 41f3ae728da9e54b3e8399d262082e47 and a total number of 972 components indexed
2024-04-29 10:15:33,661 xinference.model.llm.vllm.core 3465297 INFO     Loading qwen1.5-110b-4090 with following model config: {'gpu_memory_utilization': 0.9, 'max_model_len': 12000, 'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 4, 'block_size': 16, 'swap_space': 4, 'max_num_seqs': 256, 'quantization': None}
WARNING 04-29 10:15:33 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-04-29 10:15:35,648 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-29 10:15:36 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/root/.xinference/cache/qwen1.5-chat-gptq-110b-Int4', tokenizer='/root/.xinference/cache/qwen1.5-chat-gptq-110b-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-29 10:15:44 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-29 10:15:44 selector.py:25] Using XFormers backend.
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:55069 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to ?UNKNOWN? (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=3473023) INFO 04-29 10:15:46 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
(RayWorkerVllm pid=3473023) INFO 04-29 10:15:46 selector.py:25] Using XFormers backend.
INFO 04-29 10:15:46 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=3473023) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to ?UNKNOWN? (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=3473023) INFO 04-29 10:15:46 pynccl_utils.py:45] vLLM is using nccl==2.18.1
WARNING 04-29 10:15:47 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=3473023) WARNING 04-29 10:15:47 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=3473464) INFO 04-29 10:15:57 model_runner.py:104] Loading model weights took 14.3951 GB
(RayWorkerVllm pid=3473464) INFO 04-29 10:15:46 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorkerVllm pid=3473464) INFO 04-29 10:15:46 selector.py:25] Using XFormers backend. [repeated 2x across cluster]
(RayWorkerVllm pid=3473464) INFO 04-29 10:15:46 pynccl_utils.py:45] vLLM is using nccl==2.18.1 [repeated 2x across cluster]
(RayWorkerVllm pid=3473464) WARNING 04-29 10:15:47 custom_all_reduce.py:45] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster]
INFO 04-29 10:16:05 model_runner.py:104] Loading model weights took 14.3951 GB
INFO 04-29 10:16:19 ray_gpu_executor.py:240] # GPU blocks: 3350, # CPU blocks: 3276
INFO 04-29 10:16:22 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-29 10:16:22 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=3473023) INFO 04-29 10:16:22 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3473023) INFO 04-29 10:16:22 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerVllm pid=3473233) INFO 04-29 10:15:58 model_runner.py:104] Loading model weights took 14.3951 GB [repeated 2x across cluster]
(RayWorkerVllm pid=3473023) INFO 04-29 10:16:42 model_runner.py:867] Graph capturing finished in 20 secs.
(RayWorkerVllm pid=3473464) INFO 04-29 10:16:22 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. [repeated 2x across cluster]
(RayWorkerVllm pid=3473464) INFO 04-29 10:16:22 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage. [repeated 2x across cluster]
INFO 04-29 10:16:42 model_runner.py:867] Graph capturing finished in 20 secs.
2024-04-29 10:16:47,083 xinference.model.llm.llm_family 3464397 INFO     Caching from Modelscope: qwen/Qwen1.5-110B-Chat-GPTQ-Int4
2024-04-29 10:16:47,089 xinference.model.llm.vllm.core 3476426 INFO     Loading qwen1.5-110b-4090 with following model config: {'gpu_memory_utilization': 0.9, 'max_model_len': 12000, 'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 4, 'block_size': 16, 'swap_space': 4, 'max_num_seqs': 256, 'quantization': None}
WARNING 04-29 10:16:47 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-04-29 10:16:49,070 INFO worker.py:1752 -- Started a local Ray instance.
INFO 04-29 10:16:50 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='/root/.xinference/cache/qwen1.5-chat-gptq-110b-Int4', tokenizer='/root/.xinference/cache/qwen1.5-chat-gptq-110b-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-29 10:16:58 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-29 10:16:58 selector.py:25] Using XFormers backend.
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:40465 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to ?UNKNOWN? (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=3484365) INFO 04-29 10:16:59 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
(RayWorkerVllm pid=3484365) INFO 04-29 10:16:59 selector.py:25] Using XFormers backend.
INFO 04-29 10:17:00 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=3484365) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to ?UNKNOWN? (errno: 97 - Address family not supported by protocol).
2024-04-29 10:17:00,896 xinference.core.worker 3464397 ERROR    Failed to load model qwen1.5-110b-4090-2-1
Traceback (most recent call last):
  File "/data/inference/xinference/core/worker.py", line 707, in launch_builtin_model
    await model_ref.load()
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
    result = func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/model.py", line 239, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/model/llm/vllm/core.py", line 179, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 62, in __init__
    self._init_workers_ray(placement_group)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 191, in _init_workers_ray
    self._run_workers("init_device")
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/worker/worker.py", line 100, in init_device
    init_distributed_environment(self.parallel_config, self.rank,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment
    pynccl_utils.init_process_group(
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group
    comm = NCCLCommunicator(init_method=init_method,
      ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 236, in __init__
    dist.broadcast(tensor, src=0)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
    ^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: [address=0.0.0.0:33663, pid=3476426] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
Error while attaching to shared memory segment /dev/shm/nccl- (size -234881024)
2024-04-29 10:17:02,575 xinference.api.restful_api 3463847 ERROR    [address=0.0.0.0:33663, pid=3476426] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
Error while attaching to shared memory segment /dev/shm/nccl- (size -234881024)
Traceback (most recent call last):
  File "/data/inference/xinference/api/restful_api.py", line 741, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/supervisor.py", line 892, in launch_builtin_model
    await _launch_model()
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/supervisor.py", line 856, in _launch_model
    await _launch_one_model(rep_model_uid)
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/supervisor.py", line 838, in _launch_one_model
    await worker_ref.launch_builtin_model(
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/worker.py", line 707, in launch_builtin_model
    await model_ref.load()
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
    result = func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/core/model.py", line 239, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/data/inference/xinference/model/llm/vllm/core.py", line 179, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 348, in from_engine_args
    engine = cls(
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 311, in __init__
    self.engine = self._init_engine(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 422, in _init_engine
    return engine_class(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 110, in __init__
    self.model_executor = executor_class(model_config, cache_config,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 62, in __init__
    self._init_workers_ray(placement_group)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 191, in _init_workers_ray
    self._run_workers("init_device")
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/executor/ray_gpu_executor.py", line 324, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/worker/worker.py", line 100, in init_device
    init_distributed_environment(self.parallel_config, self.rank,
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/worker/worker.py", line 287, in init_distributed_environment
    pynccl_utils.init_process_group(
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl_utils.py", line 46, in init_process_group
    comm = NCCLCommunicator(init_method=init_method,
      ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/vllm/model_executor/parallel_utils/pynccl.py", line 236, in __init__
    dist.broadcast(tensor, src=0)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast
    work = default_pg.broadcast([tensor], opts)
    ^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: [address=0.0.0.0:33663, pid=3476426] NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
Error while attaching to shared memory segment /dev/shm/nccl- (size -234881024)
2024-04-29 10:17:04,352 xinference.core.worker 3464397 ERROR    Report status got error.
Traceback (most recent call last):
  File "/data/inference/xinference/core/worker.py", line 800, in report_status
    status = await asyncio.to_thread(gather_node_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/py311/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/inference/xinference/core/worker.py", line 799, in report_status
    async with timeout(2):
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/async_timeout/__init__.py", line 141, in __aexit__
    self._do_exit(exc_type)
  File "/data/miniconda3/envs/py311/lib/python3.11/site-packages/async_timeout/__init__.py", line 228, in _do_exit
    raise asyncio.TimeoutError
TimeoutError

Minimized code to reproduce the error.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

xorbitsai / inference

NCCL Error when replica=2 #1404

Describe the bug

To Reproduce

Expected behavior

Additional context