xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.99k stars 396 forks source link

多卡运行模型启动报错,单卡运行正常 #1668

Open JinCheng666 opened 3 months ago

JinCheng666 commented 3 months ago

Describe the bug

重启电脑后出现以下问题,重启前多卡是正常运行的 问题:多卡运行模型启动报错,单卡运行正常

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version. 3.10

    import torch torch.cuda.is_available() True torch.cuda.device_count() 2

  2. The version of xinference you use. 0.12.1 image

image

2024-06-18 19:00:58,283 xinference.core.supervisor 6285 INFO     Xinference supervisor 0.0.0.0:45235 started
2024-06-18 19:00:59,844 xinference.core.worker 6285 INFO     Starting metrics export server at 0.0.0.0:None
2024-06-18 19:00:59,846 xinference.core.worker 6285 INFO     Checking metrics export server...
2024-06-18 19:01:00,950 xinference.core.worker 6285 INFO     Metrics server is started at: http://0.0.0.0:46841
2024-06-18 19:01:00,951 xinference.core.supervisor 6285 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, '0.0.0.0:45235'), kwargs: {}
2024-06-18 19:01:00,951 xinference.core.supervisor 6285 DEBUG    Worker 0.0.0.0:45235 has been added successfully
2024-06-18 19:01:00,951 xinference.core.supervisor 6285 DEBUG    Leave add_worker, elapsed time: 0 s
2024-06-18 19:01:00,951 xinference.core.worker 6285 INFO     Xinference worker 0.0.0.0:45235 started
2024-06-18 19:01:00,952 xinference.core.worker 6285 INFO     Purge cache directory: /home/gx01/.xinference/cache
2024-06-18 19:01:00,961 xinference.core.supervisor 6285 DEBUG    Worker 0.0.0.0:45235 resources: {'cpu': ResourceStatus(usage=0.0, total=32, memory_used=3421847552, memory_available=130487431168, memory_total=135059939328), 'gpu-0': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728), 'gpu-1': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728)}
2024-06-18 19:01:03,284 xinference.core.supervisor 6285 DEBUG    Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>,), kwargs: {}
2024-06-18 19:01:03,284 xinference.core.supervisor 6285 DEBUG    Leave get_status, elapsed time: 0 s
2024-06-18 19:01:04,223 xinference.api.restful_api 6217 INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
2024-06-18 19:01:06,744 xinference.core.supervisor 6285 DEBUG    Enter list_models, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>,), kwargs: {}
2024-06-18 19:01:06,744 xinference.core.worker 6285 DEBUG    Enter list_models, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {}
2024-06-18 19:01:06,744 xinference.core.worker 6285 DEBUG    Leave list_models, elapsed time: 0 s
2024-06-18 19:01:06,744 xinference.core.supervisor 6285 DEBUG    Leave list_models, elapsed time: 0 s
2024-06-18 19:01:08,018 xinference.core.supervisor 6285 DEBUG    Enter list_models, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>,), kwargs: {}
2024-06-18 19:01:08,019 xinference.core.worker 6285 DEBUG    Enter list_models, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {}
2024-06-18 19:01:08,019 xinference.core.worker 6285 DEBUG    Leave list_models, elapsed time: 0 s
2024-06-18 19:01:08,019 xinference.core.supervisor 6285 DEBUG    Leave list_models, elapsed time: 0 s
2024-06-18 19:01:11,846 xinference.core.supervisor 6285 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'LLM'), kwargs: {'detailed': True}
2024-06-18 19:01:11,915 xinference.core.supervisor 6285 DEBUG    Leave list_model_registrations, elapsed time: 0 s
2024-06-18 19:01:12,584 xinference.core.supervisor 6285 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'LLM'), kwargs: {'detailed': False}
2024-06-18 19:01:12,585 xinference.core.supervisor 6285 DEBUG    Leave list_model_registrations, elapsed time: 0 s
2024-06-18 19:01:12,595 xinference.core.supervisor 6285 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'LLM', 'llama3:70b'), kwargs: {}
2024-06-18 19:01:12,595 xinference.core.supervisor 6285 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-06-18 19:01:12,596 xinference.core.supervisor 6285 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'LLM', 'qwen:110b'), kwargs: {}
2024-06-18 19:01:12,596 xinference.core.supervisor 6285 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-06-18 19:01:12,597 xinference.core.supervisor 6285 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'LLM', 'qwen:72b'), kwargs: {}
2024-06-18 19:01:12,597 xinference.core.supervisor 6285 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-06-18 19:01:13,904 xinference.core.supervisor 6285 DEBUG    Enter query_engines_by_model_name, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-18 19:01:13,904 xinference.core.supervisor 6285 DEBUG    Leave query_engines_by_model_name, elapsed time: 0 s
2024-06-18 19:01:19,763 xinference.core.supervisor 6285 DEBUG    Enter launch_builtin_model, model_uid: qwen:72b, model_name: qwen:72b, model_size: 72, model_format: gptq, quantization: Int4, replica: 1
2024-06-18 19:01:19,764 xinference.core.worker 6285 DEBUG    Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {}
2024-06-18 19:01:19,764 xinference.core.worker 6285 DEBUG    Leave get_model_count, elapsed time: 0 s
2024-06-18 19:01:19,764 xinference.core.worker 6285 DEBUG    Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f150fc05e40>,), kwargs: {'model_uid': 'qwen:72b-1-0', 'model_name': 'qwen:72b', 'model_size_in_billions': 72, 'model_format': 'gptq', 'quantization': 'Int4', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 2, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None}
2024-06-18 19:01:19,764 xinference.core.worker 6285 DEBUG    GPU selected: [0, 1] for model qwen:72b-1-0
2024-06-18 19:01:22,954 xinference.model.llm.core 6285 DEBUG    Launching qwen:72b-1-0 with VLLMChatModel
2024-06-18 19:01:22,954 xinference.model.llm.llm_family 6285 INFO     Caching from URI: /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4
2024-06-18 19:01:22,954 xinference.model.llm.llm_family 6285 INFO     Cache /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4 exists
2024-06-18 19:01:23,029 xinference.model.llm.vllm.core 6305 INFO     Loading qwen:72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-06-18 19:01:42,850 xinference.core.worker 6285 ERROR    Failed to load model qwen:72b-1-0
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/worker.py", line 665, in launch_builtin_model
    await model_ref.load()
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: Remote server unixsocket:///823787520 closed
2024-06-18 19:01:42,933 xinference.core.supervisor 6285 DEBUG    Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {'suppress_exception': True}
2024-06-18 19:01:42,934 xinference.core.supervisor 6285 DEBUG    Leave terminate_model, elapsed time: 0 s
2024-06-18 19:01:42,936 xinference.api.restful_api 6217 ERROR    [address=0.0.0.0:45235, pid=6285] Remote server unixsocket:///823787520 closed
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/api/restful_api.py", line 770, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/supervisor.py", line 837, in launch_builtin_model
    await _launch_model()
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/supervisor.py", line 801, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/supervisor.py", line 782, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/worker.py", line 665, in launch_builtin_model
    await model_ref.load()
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/core.py", line 84, in _listen
    raise ServerClosed(
xoscar.errors.ServerClosed: [address=0.0.0.0:45235, pid=6285] Remote server unixsocket:///823787520 closed
2024-06-18 19:19:51,805 xinference.core.supervisor 6285 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f150fba02c0>, 'qwen:72b'), kwargs: {}
2024-06-18 19:19:51,807 xinference.api.restful_api 6217 ERROR    [address=0.0.0.0:45235, pid=6285] Model not found in the model list, uid: qwen:72b
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1400, in create_chat_completion
    model = await (await self._get_supervisor_ref()).get_model(model_uid)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.12.1/lib/python3.10/site-packages/xinference/core/supervisor.py", line 934, in get_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=0.0.0.0:45235, pid=6285] Model not found in the model list, uid: qwen:72b
qinxuye commented 3 months ago

后续问题有解决吗?

JinCheng666 commented 3 months ago

后续问题有解决吗?

还没有解决。机器是一台虚拟机,重启之前多卡推理一个模型是正常的。重启后就无法多卡共同推理一个模型了。 请问还需要我收集哪些信息,我这边收集。 @qinxuye

qinxuye commented 3 months ago

试下最新版本还有问题吗?

JinCheng666 commented 3 months ago

试下最新版本还有问题吗?

@qinxuye 更新到0.12.3,仍然存在相同的问题

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 7 days with no activity.

JinCheng666 commented 1 month ago

更新到0.14.3 单卡运行正常,多卡运行仍然报错,但报错变了,如下。 @qinxuye 麻烦帮忙看下,机器硬件环境没有变化 启动脚本: nohup xinference-local --host 0.0.0.0 --port 9997 --log-level DEBUG &

Snipaste_2024-08-26_15-54-54

2024-08-26 15:37:47,330 xinference.core.supervisor 1397843 INFO     Xinference supervisor 0.0.0.0:46142 started
2024-08-26 15:37:48,794 xinference.core.worker 1397843 INFO     Starting metrics export server at 0.0.0.0:None
2024-08-26 15:37:48,796 xinference.core.worker 1397843 INFO     Checking metrics export server...
2024-08-26 15:37:50,101 xinference.core.worker 1397843 INFO     Metrics server is started at: http://0.0.0.0:41521
2024-08-26 15:37:50,102 xinference.core.worker 1397843 INFO     Purge cache directory: /home/gx01/.xinference/cache
2024-08-26 15:37:50,103 xinference.core.supervisor 1397843 DEBUG    Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, '0.0.0.0:46142'), kwargs: {}
2024-08-26 15:37:50,104 xinference.core.supervisor 1397843 DEBUG    Worker 0.0.0.0:46142 has been added successfully
2024-08-26 15:37:50,104 xinference.core.supervisor 1397843 DEBUG    Leave add_worker, elapsed time: 0 s
2024-08-26 15:37:50,104 xinference.core.worker 1397843 INFO     Connected to supervisor as a fresh worker
2024-08-26 15:37:50,114 xinference.core.worker 1397843 INFO     Xinference worker 0.0.0.0:46142 started
2024-08-26 15:37:50,116 xinference.core.supervisor 1397843 DEBUG    Worker 0.0.0.0:46142 resources: {'cpu': ResourceStatus(usage=0.0, total=32, memory_used=2375675904, memory_available=131400032256, memory_total=135059939328), 'gpu-0': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728), 'gpu-1': GPUStatus(mem_total=51527024640, mem_free=51032358912, mem_used=494665728)}
2024-08-26 15:37:52,322 xinference.core.supervisor 1397843 DEBUG    Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>,), kwargs: {}
2024-08-26 15:37:52,322 xinference.core.supervisor 1397843 DEBUG    Leave get_status, elapsed time: 0 s
2024-08-26 15:37:53,186 xinference.api.restful_api 1397771 INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO     Started server process [1397771]
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO     Waiting for application startup.
2024-08-26 15:37:53,318 uvicorn.error 1397771 INFO     Application startup complete.
2024-08-26 15:37:53,319 uvicorn.error 1397771 INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)
2024-08-26 15:48:26,826 xinference.core.supervisor 1397843 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {}
2024-08-26 15:48:26,828 xinference.api.restful_api 1397771 ERROR    [address=0.0.0.0:46142, pid=1397843] Model not found in the model list, uid: qwen:72b
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1660, in create_chat_completion
    model = await (await self._get_supervisor_ref()).get_model(model_uid)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1124, in get_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=0.0.0.0:46142, pid=1397843] Model not found in the model list, uid: qwen:72b
2024-08-26 15:48:26,831 uvicorn.access 1397771 INFO     10.4.134.11:64957 - "POST /v1/chat/completions HTTP/1.1" 400
2024-08-26 15:48:50,350 uvicorn.access 1397771 INFO     10.4.134.25:11074 - "GET / HTTP/1.1" 307
2024-08-26 15:48:50,637 uvicorn.access 1397771 INFO     10.4.134.25:11074 - "GET /v1/cluster/auth HTTP/1.1" 200
2024-08-26 15:48:50,670 xinference.core.supervisor 1397843 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM'), kwargs: {'detailed': True}
2024-08-26 15:48:50,670 uvicorn.access 1397771 INFO     10.4.134.25:11074 - "GET /v1/cluster/devices HTTP/1.1" 200
2024-08-26 15:48:50,815 xinference.core.supervisor 1397843 DEBUG    Leave list_model_registrations, elapsed time: 0 s
2024-08-26 15:48:50,826 uvicorn.access 1397771 INFO     10.4.134.25:11075 - "GET /v1/model_registrations/LLM?detailed=true HTTP/1.1" 200
2024-08-26 15:48:57,861 xinference.core.supervisor 1397843 DEBUG    Enter list_model_registrations, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM'), kwargs: {'detailed': False}
2024-08-26 15:48:57,862 xinference.core.supervisor 1397843 DEBUG    Leave list_model_registrations, elapsed time: 0 s
2024-08-26 15:48:57,863 uvicorn.access 1397771 INFO     10.4.134.25:11076 - "GET /v1/model_registrations/LLM HTTP/1.1" 200
2024-08-26 15:48:57,871 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'glm-4-9b'), kwargs: {}
2024-08-26 15:48:57,871 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,872 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'glm-4v-9b'), kwargs: {}
2024-08-26 15:48:57,872 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,872 uvicorn.access 1397771 INFO     10.4.134.25:11076 - "GET /v1/model_registrations/LLM/glm-4-9b HTTP/1.1" 200
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'llama3:70b'), kwargs: {}
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,874 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen1.5:14b'), kwargs: {}
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen1.5:72b'), kwargs: {}
2024-08-26 15:48:57,875 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,875 uvicorn.access 1397771 INFO     10.4.134.25:11077 - "GET /v1/model_registrations/LLM/glm-4v-9b HTTP/1.1" 200
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen2:7b'), kwargs: {}
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen:110b'), kwargs: {}
2024-08-26 15:48:57,876 uvicorn.access 1397771 INFO     10.4.134.25:11078 - "GET /v1/model_registrations/LLM/llama3%3A70b HTTP/1.1" 200
2024-08-26 15:48:57,876 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,877 uvicorn.access 1397771 INFO     10.4.134.25:11079 - "GET /v1/model_registrations/LLM/qwen1.5%3A14b HTTP/1.1" 200
2024-08-26 15:48:57,877 uvicorn.access 1397771 INFO     10.4.134.25:11080 - "GET /v1/model_registrations/LLM/qwen1.5%3A72b HTTP/1.1" 200
2024-08-26 15:48:57,878 xinference.core.supervisor 1397843 DEBUG    Enter get_model_registration, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'LLM', 'qwen:72b'), kwargs: {}
2024-08-26 15:48:57,878 uvicorn.access 1397771 INFO     10.4.134.25:11081 - "GET /v1/model_registrations/LLM/qwen2%3A7b HTTP/1.1" 200
2024-08-26 15:48:57,878 xinference.core.supervisor 1397843 DEBUG    Leave get_model_registration, elapsed time: 0 s
2024-08-26 15:48:57,878 uvicorn.access 1397771 INFO     10.4.134.25:11076 - "GET /v1/model_registrations/LLM/qwen%3A110b HTTP/1.1" 200
2024-08-26 15:48:57,879 uvicorn.access 1397771 INFO     10.4.134.25:11077 - "GET /v1/model_registrations/LLM/qwen%3A72b HTTP/1.1" 200
2024-08-26 15:49:05,752 xinference.core.supervisor 1397843 DEBUG    Enter query_engines_by_model_name, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {}
2024-08-26 15:49:05,753 xinference.core.worker 1397843 DEBUG    Enter query_engines_by_model_name, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>, 'qwen:72b'), kwargs: {}
2024-08-26 15:49:05,753 xinference.core.worker 1397843 DEBUG    Leave query_engines_by_model_name, elapsed time: 0 s
2024-08-26 15:49:05,753 xinference.core.supervisor 1397843 DEBUG    Leave query_engines_by_model_name, elapsed time: 0 s
2024-08-26 15:49:05,753 uvicorn.access 1397771 INFO     10.4.134.25:11085 - "GET /v1/engines/qwen%3A72b HTTP/1.1" 200
2024-08-26 15:49:19,139 xinference.core.supervisor 1397843 DEBUG    Enter launch_builtin_model, model_uid: qwen:72b, model_name: qwen:72b, model_size: 72, model_format: gptq, quantization: Int4, replica: 1, kwargs: {}
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG    Enter get_model_count, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>,), kwargs: {}
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG    Leave get_model_count, elapsed time: 0 s
2024-08-26 15:49:19,140 xinference.core.worker 1397843 DEBUG    Enter launch_builtin_model, args: (<xinference.core.worker.WorkerActor object at 0x7f90541af8d0>,), kwargs: {'model_uid': 'qwen:72b-1-0', 'model_name': 'qwen:72b', 'model_size_in_billions': 72, 'model_format': 'gptq', 'quantization': 'Int4', 'model_engine': 'vLLM', 'model_type': 'LLM', 'n_gpu': 2, 'request_limits': None, 'peft_model_config': None, 'gpu_idx': None, 'download_hub': None, 'model_path': None}
2024-08-26 15:49:19,141 xinference.core.worker 1397843 DEBUG    GPU selected: [0, 1] for model qwen:72b-1-0
2024-08-26 15:49:25,934 xinference.model.llm.core 1397843 DEBUG    Launching qwen:72b-1-0 with VLLMChatModel
2024-08-26 15:49:25,934 xinference.model.llm.llm_family 1397843 INFO     Caching from URI: /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4
2024-08-26 15:49:25,935 xinference.model.llm.llm_family 1397843 INFO     Cache /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4 exists
2024-08-26 15:49:25,952 xinference.model.llm.vllm.core 1399591 INFO     Loading qwen:72b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 2, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
2024-08-26 15:49:25,954 transformers.configuration_utils 1399591 INFO     loading configuration file /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4/config.json
2024-08-26 15:49:25,955 transformers.configuration_utils 1399591 INFO     Model config Qwen2Config {
  "_name_or_path": "/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4",
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 29696,
  "max_position_embeddings": 32768,
  "max_window_layers": 70,
  "model_type": "qwen2",
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "quantization_config": {
    "batch_size": 1,
    "bits": 4,
    "block_name_to_quantize": null,
    "cache_block_outputs": true,
    "damp_percent": 0.01,
    "dataset": null,
    "desc_act": false,
    "exllama_config": {
      "version": 1
    },
    "group_size": 128,
    "max_input_length": null,
    "model_seqlen": null,
    "module_name_preceding_first_block": null,
    "modules_in_block_to_quantize": null,
    "pad_token_id": null,
    "quant_method": "gptq",
    "sym": true,
    "tokenizer": null,
    "true_sequential": true,
    "use_cuda_fp16": false,
    "use_exllama": true
  },
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.4",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}

2024-08-26 15:49:25,955 transformers.models.auto.image_processing_auto 1399591 INFO     Could not locate the image processor configuration file, will try to use the model config instead.
2024-08-26 15:49:25,964 vllm.model_executor.layers.quantization.gptq_marlin 1399591 INFO     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
2024-08-26 15:49:25,976 vllm.config  1399591 INFO     Defaulting to use mp for distributed inference
2024-08-26 15:49:25,979 vllm.engine.llm_engine 1399591 INFO     Initializing an LLM engine (v0.5.5) with config: model='/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4, use_v2_block_manager=False, enable_prefix_caching=False)
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file vocab.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file merges.txt
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file tokenizer.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file added_tokens.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file special_tokens_map.json
2024-08-26 15:49:25,989 transformers.tokenization_utils_base 1399591 INFO     loading file tokenizer_config.json
2024-08-26 15:49:26,226 transformers.tokenization_utils_base 1399591 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-08-26 15:49:26,245 transformers.generation.configuration_utils 1399591 INFO     loading configuration file /home/gx01/models/Qwen2-72B-Instruct-GPTQ-Int4/generation_config.json
2024-08-26 15:49:26,246 transformers.generation.configuration_utils 1399591 INFO     Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8
}

2024-08-26 15:49:26,246 vllm.executor.multiproc_gpu_executor 1399591 WARNING  Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2024-08-26 15:49:26,263 vllm.triton_utils.custom_cache_manager 1399591 INFO     Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2024-08-26 15:49:26,513 vllm.executor.multiproc_worker_utils 1399694 INFO     Worker ready; awaiting tasks
2024-08-26 15:49:26,913 vllm.distributed.parallel_state 1399591 DEBUG    world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:54733 backend=nccl
2024-08-26 15:49:26,959 vllm.distributed.parallel_state 1399694 DEBUG    world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:54733 backend=nccl
2024-08-26 15:49:26,985 vllm.utils   1399591 INFO     Found nccl from library libnccl.so.2
2024-08-26 15:49:26,985 vllm.utils   1399694 INFO     Found nccl from library libnccl.so.2
2024-08-26 15:49:26,985 vllm.distributed.device_communicators.pynccl 1399591 INFO     vLLM is using nccl==2.20.5
2024-08-26 15:49:26,985 vllm.distributed.device_communicators.pynccl 1399694 INFO     vLLM is using nccl==2.20.5
2024-08-26 15:49:27,237 vllm.distributed.device_communicators.custom_all_reduce_utils 1399591 INFO     generating GPU P2P access cache in /home/gx01/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
2024-08-26 15:49:42,218 xinference.core.worker 1397843 ERROR    Failed to load model qwen:72b-1-0
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/worker.py", line 888, in launch_builtin_model
    await model_ref.load()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/model.py", line 303, in load
    self._model.load()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 239, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
    engine = cls(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in __init__
    self.model_executor = executor_class(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 175, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
    return GroupCoordinator(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in __init__
    self.ca_comm = CustomAllreduce(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in __init__
    if not _can_p2p(rank, world_size):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
    if not gpu_p2p_access_check(rank, i):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
    result = pickle.loads(returned.stdout)
_pickle.UnpicklingError: [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
2024-08-26 15:49:42,311 xinference.core.supervisor 1397843 DEBUG    Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f91351ad800>, 'qwen:72b'), kwargs: {'suppress_exception': True}
2024-08-26 15:49:42,311 xinference.core.supervisor 1397843 DEBUG    Leave terminate_model, elapsed time: 0 s
2024-08-26 15:49:42,317 xinference.api.restful_api 1397771 ERROR    [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
Traceback (most recent call last):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/api/restful_api.py", line 878, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1027, in launch_builtin_model
    await _launch_model()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 991, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/supervisor.py", line 970, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/worker.py", line 888, in launch_builtin_model
    await model_ref.load()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/core/model.py", line 303, in load
    self._model.load()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 239, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
    engine = cls(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 270, in __init__
    self.model_executor = executor_class(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 175, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
    return GroupCoordinator(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 164, in __init__
    self.ca_comm = CustomAllreduce(
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in __init__
    if not _can_p2p(rank, world_size):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
    if not gpu_p2p_access_check(rank, i):
  File "/home/gx01/miniconda3/envs/inference0.14.3/lib/python3.10/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
    result = pickle.loads(returned.stdout)
_pickle.UnpicklingError: [address=0.0.0.0:46431, pid=1399591] invalid load key, 'W'.
2024-08-26 15:49:42,319 uvicorn.access 1397771 INFO     10.4.134.25:11092 - "POST /v1/models HTTP/1.1" 500
lordk911 commented 1 month ago

同样的错误(invalid load key, 'W'.),奇怪的是一台机器可以,另一台就报这个错误

lordk911 commented 1 month ago

https://github.com/vllm-project/vllm/issues/7846

I also meet this warn : WARNING 08-27 14:33:56 cuda.py:22] You are using a deprecatedpynvmlpackage. Please installnvidia-ml-pyinstead. See https://pypi.org/project/pynvml for more information.

TragedyN commented 1 month ago

试过pip uninstall pynvml可行

JinCheng666 commented 1 month ago

同样的错误(invalid load key, 'W'.),奇怪的是一台机器可以,另一台就报这个错误

我也是,挺奇怪的。重启前可以多卡跑,重启后就不行了