Model Engine 使用 vLLm 和 Transformers 启动 qwen2.5-32b-instruct 均出错

System Info / 系統信息

Python Version: Python 3.10.6

[root@gpu-server ~]# nvidia-smi
Fri Oct 25 15:24:01 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800 80G...  Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   75C    P0   118W / 300W |   3840MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800 80G...  Off  | 00000000:B1:00.0 Off |                    0 |
| N/A   45C    P0    48W / 300W |      2MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     11474      C   java                             3838MiB |
+-----------------------------------------------------------------------------+

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

(xinference) [root@gpu-server ~]# pip list
Package                           Version
--------------------------------- ------------
accelerate                        0.29.1
addict                            2.4.0
aiobotocore                       2.7.0
aiofiles                          23.2.1
aiohttp                           3.9.1
aioitertools                      0.11.0
aioprometheus                     23.12.0
aiosignal                         1.3.1
aliyun-python-sdk-core            2.14.0
aliyun-python-sdk-kms             2.16.2
altair                            5.2.0
annotated-types                   0.7.0
anyio                             3.7.1
asttokens                         2.4.1
async-timeout                     4.0.3
attrs                             23.1.0
auto-gptq                         0.6.0
av                                12.3.0
bcrypt                            4.1.2
bitsandbytes                      0.41.3.post2
botocore                          1.31.64
certifi                           2023.11.17
cffi                              1.16.0
charset-normalizer                3.3.2
chatglm-cpp                       0.3.0
click                             8.1.7
cloudpickle                       3.0.0
cmake                             3.28.1
colorama                          0.4.6
coloredlogs                       15.0.1
comm                              0.2.0
contourpy                         1.2.0
crcmod                            1.7
cryptography                      41.0.7
ctransformers                     0.2.27
cycler                            0.12.1
datasets                          2.15.0
debugpy                           1.8.0
decorator                         5.1.1
dill                              0.3.7
diskcache                         5.6.3
distro                            1.8.0
ecdsa                             0.18.0
einops                            0.7.0
exceptiongroup                    1.2.0
executing                         2.0.1
fastapi                           0.110.3
ffmpy                             0.3.1
filelock                          3.13.1
fonttools                         4.47.0
frozenlist                        1.4.1
fsspec                            2023.10.0
gast                              0.5.4
gekko                             1.0.6
gradio                            4.26.0
gradio_client                     0.15.1
h11                               0.14.0
httpcore                          1.0.2
httptools                         0.6.1
httpx                             0.25.2
huggingface-hub                   0.24.6
humanfriendly                     10.0
idna                              3.6
importlib-metadata                7.0.0
importlib-resources               6.1.1
interegular                       0.3.3
ipykernel                         6.26.0
ipython                           8.17.2
jedi                              0.19.1
Jinja2                            3.1.2
jiter                             0.6.1
jmespath                          0.10.0
joblib                            1.3.2
jsonschema                        4.20.0
jsonschema-specifications         2023.11.2
jupyter_client                    8.6.0
jupyter_core                      5.5.0
kiwisolver                        1.4.5
lark                              1.1.9
linkify-it-py                     2.0.2
llama_cpp_python                  0.2.25
llvmlite                          0.42.0
lm-format-enforcer                0.9.8
markdown-it-py                    2.2.0
MarkupSafe                        2.1.3
matplotlib                        3.8.2
matplotlib-inline                 0.1.6
mdit-py-plugins                   0.3.3
mdurl                             0.1.2
modelscope                        1.10.0
mpmath                            1.3.0
msgpack                           1.0.7
multidict                         6.0.4
multiprocess                      0.70.15
nest-asyncio                      1.5.8
networkx                          3.2.1
ninja                             1.11.1.1
nltk                              3.8.1
numba                             0.59.1
numpy                             1.26.2
nvidia-cublas-cu12                12.1.3.1
nvidia-cuda-cupti-cu12            12.1.105
nvidia-cuda-nvrtc-cu12            12.1.105
nvidia-cuda-runtime-cu12          12.1.105
nvidia-cudnn-cu12                 8.9.2.26
nvidia-cufft-cu12                 11.0.2.54
nvidia-curand-cu12                10.3.2.106
nvidia-cusolver-cu12              11.4.5.107
nvidia-cusparse-cu12              12.1.0.106
nvidia-ml-py                      12.550.52
nvidia-nccl-cu12                  2.20.5
nvidia-nvjitlink-cu12             12.3.101
nvidia-nvtx-cu12                  12.1.105
openai                            1.50.1
opencv-contrib-python             4.10.0.82
opencv-python                     4.9.0.80
optimum                           1.16.1
orjson                            3.9.10
oss2                              2.18.3
outlines                          0.0.34
packaging                         23.2
pandas                            2.1.4
parso                             0.8.3
passlib                           1.7.4
peft                              0.7.1
pexpect                           4.8.0
Pillow                            10.1.0
pip                               23.3
platformdirs                      4.1.0
prometheus_client                 0.20.0
prometheus-fastapi-instrumentator 7.0.0
prompt-toolkit                    3.0.40
protobuf                          4.25.1
psutil                            5.9.7
ptyprocess                        0.7.0
pure-eval                         0.2.2
py-cpuinfo                        9.0.0
pyarrow                           14.0.2
pyarrow-hotfix                    0.6
pyasn1                            0.5.1
pycparser                         2.21
pycryptodome                      3.19.0
pydantic                          2.6.4
pydantic_core                     2.16.3
pydub                             0.25.1
Pygments                          2.16.1
pynvml                            11.5.0
pyparsing                         3.1.1
python-dateutil                   2.8.2
python-dotenv                     1.0.0
python-jose                       3.3.0
python-multipart                  0.0.9
pytz                              2023.3.post1
PyYAML                            6.0.1
pyzmq                             25.1.1
quantile-python                   1.1
qwen-vl-utils                     0.0.8
ray                               2.9.3
referencing                       0.32.0
regex                             2023.10.3
requests                          2.31.0
rich                              13.7.1
rouge                             1.0.1
rpds-py                           0.15.2
rsa                               4.9
ruff                              0.4.6
s3fs                              2023.10.0
safetensors                       0.4.1
scikit-learn                      1.3.2
scipy                             1.11.4
semantic-version                  2.10.0
sentence-transformers             2.7.0
sentencepiece                     0.1.99
setuptools                        69.0.2
shellingham                       1.5.4
simplejson                        3.19.2
six                               1.16.0
sniffio                           1.3.0
sortedcontainers                  2.4.0
sse-starlette                     1.8.2
stack-data                        0.6.3
starlette                         0.37.2
sympy                             1.12
tabulate                          0.9.0
tblib                             3.0.0
threadpoolctl                     3.2.0
tiktoken                          0.6.0
timm                              0.9.16
tokenizers                        0.20.1
tomli                             2.0.1
tomlkit                           0.12.0
toolz                             0.12.0
torch                             2.3.0+cu121
torchaudio                        2.3.0+cu121
torchvision                       0.18.0+cu121
tornado                           6.3.3
tqdm                              4.66.1
traitlets                         5.13.0
transformers                      4.45.1
transformers-stream-generator     0.0.4
triton                            2.3.0
typer                             0.11.1
typing_extensions                 4.12.2
tzdata                            2023.3
uc-micro-py                       1.0.2
urllib3                           2.0.7
uvicorn                           0.24.0.post1
uvloop                            0.19.0
vllm                              0.4.2
vllm-nccl-cu12                    2.18.1.0.4.0
watchfiles                        0.21.0
wcwidth                           0.2.9
websockets                        11.0.3
wheel                             0.41.2
wrapt                             1.16.0
xformers                          0.0.26.post1
xinference                        0.16.0
xinference-client                 0.16.0
xoscar                            0.3.0
xxhash                            3.4.1
yapf                              0.40.2
yarl                              1.9.4
zipp                              3.17.0

The command used to start Xinference / 用以启动 xinference 的命令

nohup xinference-local -H 172.22.149.188 -p 59997 &

Reproduction / 复现过程

使用 Transformers 启动 qwen2.5-32b-instruct 正常，对话时报错
- 截图

错误日志

Loading checkpoint shards: 100%|███████████████| 17/17 [00:23<00:00,  1.38s/it]
2024-10-25 15:28:56,618 xinference.core.worker 38651 INFO     [request 4db91eb8-9307-11ef-ad67-80615f20f615] Leave launch_builtin_model, elapsed time: 31 s
2024-10-25 15:29:44,187 transformers.models.qwen2.modeling_qwen2 63636 WARNING  Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)
2024-10-25 15:29:44,197 xinference.model.llm.transformers.utils 63636 ERROR    Internal error for batch inference: probability tensor contains either `inf`, `nan` or element < 0.
Traceback (most recent call last):
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/transformers/utils.py", line 483, in batch_inference_one_step
_batch_inference_one_step_internal(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/transformers/utils.py", line 286, in _batch_inference_one_step_internal
token = _get_token_from_logits(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/transformers/utils.py", line 111, in _get_token_from_logits
indices = torch.multinomial(probs, num_samples=2)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
2024-10-25 15:29:44,281 xinference.api.restful_api 37502 ERROR    Chat completion stream got an error: [address=172.22.149.188:44411, pid=63636] probability tensor contains either `inf`, `nan` or element < 0
Traceback (most recent call last):
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1926, in stream_results
async for item in iterator:
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
return await self._actor_ref.__xoscar_next__(self._uid)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message)  # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 431, in __xoscar_next__
raise e
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 419, in __xoscar_next__
r = await asyncio.create_task(_async_wrapper(gen))
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 409, in _async_wrapper
return await _gen.__anext__()  # noqa: F821
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 440, in _to_async_gen
async for v in gen:
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 568, in _queue_consumer
raise RuntimeError(res[len(XINFERENCE_STREAMING_ERROR_FLAG) :])
RuntimeError: [address=172.22.149.188:44411, pid=63636] probability tensor contains either `inf`, `nan` or element < 0
Traceback (most recent call last):
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/queueing.py", line 527, in process_events
response = await route_utils.call_process_api(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/route_utils.py", line 261, in call_process_api
output = await app.get_blocks().process_api(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/blocks.py", line 1786, in process_api
result = await self.call_function(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/blocks.py", line 1350, in call_function
prediction = await utils.async_iteration(iterator)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.__anext__()
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/utils.py", line 709, in asyncgen_wrapper
response = await iterator.__anext__()
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/chat_interface.py", line 545, in _stream_fn
first_response = await async_iteration(generator)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/utils.py", line 583, in async_iteration
return await iterator.__anext__()
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/utils.py", line 576, in __anext__
return await anyio.to_thread.run_sync(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async
return next(iterator)
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/chat_interface.py", line 122, in generate_wrapper
for chunk in model.chat(
File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/client/common.py", line 51, in streaming_response_iterator
raise Exception(str(error))
Exception: [address=172.22.149.188:44411, pid=63636] probability tensor contains either `inf`, `nan` or element < 0

使用 vLLm 启动 qwen2.5-32b-instruct 超时报错

截图

错误日志

INFO 10-25 12:11:07 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/home/models/Qwen25-32B-Instruct', speculative_config=None, tokenizer='/home/models/Qwen25-32B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/home/models/Qwen25-32B-Instruct)
INFO 10-25 12:11:12 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=49552) INFO 10-25 12:11:12 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1
INFO 10-25 12:11:13 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
INFO 10-25 12:11:13 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=49552) INFO 10-25 12:11:13 selector.py:81] Cannot use FlashAttention-2 backend because the flash_attn package is not found. Please install it for better performance.
(RayWorkerWrapper pid=49552) INFO 10-25 12:11:13 selector.py:32] Using XFormers backend.
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145] Traceback (most recent call last):
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 104, in init_device
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]     _check_if_gpu_supports_dtype(self.model_config.dtype)
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 321, in _check_if_gpu_supports_dtype
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]     compute_capability = torch.cuda.get_device_capability()
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]     prop = get_device_properties(device)
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/cuda/__init__.py", line 447, in get_device_properties
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145]     raise AssertionError("Invalid device id")
(RayWorkerWrapper pid=49552) ERROR 10-25 12:11:13 worker_base.py:145] AssertionError: Invalid device id
ERROR 10-25 12:21:14 worker_base.py:145] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 10-25 12:21:14 worker_base.py:145] Traceback (most recent call last):
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
ERROR 10-25 12:21:14 worker_base.py:145]     return executor(*args, **kwargs)
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
ERROR 10-25 12:21:14 worker_base.py:145]     init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
ERROR 10-25 12:21:14 worker_base.py:145]     init_distributed_environment(parallel_config.world_size, rank,
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
ERROR 10-25 12:21:14 worker_base.py:145]     torch.distributed.init_process_group(
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
ERROR 10-25 12:21:14 worker_base.py:145]     return func(*args, **kwargs)
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
ERROR 10-25 12:21:14 worker_base.py:145]     func_return = func(*args, **kwargs)
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
ERROR 10-25 12:21:14 worker_base.py:145]     store, rank, world_size = next(rendezvous_iterator)
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
ERROR 10-25 12:21:14 worker_base.py:145]     store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
ERROR 10-25 12:21:14 worker_base.py:145]   File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
ERROR 10-25 12:21:14 worker_base.py:145]     return TCPStore(
ERROR 10-25 12:21:14 worker_base.py:145] torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/2 clients joined.
2024-10-25 12:21:14,825 xinference.core.worker 38651 ERROR    Failed to load model custom-qwen25-32b-instruct-1-0
Traceback (most recent call last):
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 894, in launch_builtin_model
    await model_ref.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 375, in load
    self._model.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 261, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
    engine = cls(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self.model_executor = executor_class(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 300, in __init__
    super().__init__(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
    self._init_workers_ray(placement_group)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
    self._run_workers("init_device")
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
    raise e
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
    return executor(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
torch.distributed.DistStoreError: [address=172.22.149.188:45585, pid=43525] Timed out after 601 seconds waiting for clients. 1/2 clients joined.
2024-10-25 12:21:14,970 xinference.core.worker 38651 ERROR    [request b9f07de0-92eb-11ef-ad67-80615f20f615] Leave launch_builtin_model, error: [address=172.22.149.188:45585, pid=43525] Timed out after 601 seconds waiting for clients. 1/2 clients joined., elapsed time: 614 s
Traceback (most recent call last):
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 78, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 894, in launch_builtin_model
    await model_ref.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 375, in load
    self._model.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 261, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
    engine = cls(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self.model_executor = executor_class(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 300, in __init__
    super().__init__(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
    self._init_workers_ray(placement_group)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
    self._run_workers("init_device")
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
    raise e
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
    return executor(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
torch.distributed.DistStoreError: [address=172.22.149.188:45585, pid=43525] Timed out after 601 seconds waiting for clients. 1/2 clients joined.
2024-10-25 12:21:14,974 xinference.api.restful_api 37502 ERROR    [address=172.22.149.188:45585, pid=43525] Timed out after 601 seconds waiting for clients. 1/2 clients joined.
Traceback (most recent call last):
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 977, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1040, in launch_builtin_model
    await _launch_model()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1004, in _launch_model
    await _launch_one_model(rep_model_uid)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 983, in _launch_one_model
    await worker_ref.launch_builtin_model(
  File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
    async with lock:
  File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 78, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/worker.py", line 894, in launch_builtin_model
    await model_ref.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/model.py", line 375, in load
    self._model.load()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 261, in load
    self._engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 366, in from_engine_args
    engine = cls(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 324, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 442, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 160, in __init__
    self.model_executor = executor_class(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 300, in __init__
    super().__init__(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 43, in _init_executor
    self._init_workers_ray(placement_group)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 164, in _init_workers_ray
    self._run_workers("init_device")
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 234, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 146, in execute_method
    raise e
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 137, in execute_method
    return executor(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 111, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/worker/worker.py", line 288, in init_worker_distributed_environment
    init_distributed_environment(parallel_config.world_size, rank,
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 70, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
torch.distributed.DistStoreError: [address=172.22.149.188:45585, pid=43525] Timed out after 601 seconds waiting for clients. 1/2 clients joined.
2024-10-25 12:22:12,510 xinference.core.supervisor 38651 ERROR    [request 4a696e76-92ed-11ef-ad67-80615f20f615] Leave get_model, error: Model not found in the model list, uid: custom-qwen2-vl-7b-instruct, elapsed time: 0 s
Traceback (most recent call last):
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 78, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1137, in get_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: Model not found in the model list, uid: custom-qwen2-vl-7b-instruct
2024-10-25 12:22:12,512 xinference.api.restful_api 37502 ERROR    [address=172.22.149.188:61160, pid=38651] Model not found in the model list, uid: custom-qwen2-vl-7b-instruct
Traceback (most recent call last):
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/api/restful_api.py", line 1856, in create_chat_completion
    model = await (await self._get_supervisor_ref()).get_model(model_uid)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/utils.py", line 78, in wrapped
    ret = await func(*args, **kwargs)
  File "/home/miniconda3/envs/xinference/lib/python3.10/site-packages/xinference/core/supervisor.py", line 1137, in get_model
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
ValueError: [address=172.22.149.188:61160, pid=38651] Model not found in the model list, uid: custom-qwen2-vl-7b-instruct

Expected behavior / 期待表现

解决反馈问题

xorbitsai / inference

Model Engine 使用 vLLm 和 Transformers 启动 qwen2.5-32b-instruct 均出错 #2486

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现