xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.07k stars 406 forks source link

0.15.3跟qwen2_5-instruct-gptq-72b-Int8不兼容 #2389

Open monk-after-90s opened 1 week ago

monk-after-90s commented 1 week ago

System Info / 系統信息

cuda: 12.3 os:Ubuntu22.04 Python:3.11.9 pip list: vllm 0.6.2 vllm-flash-attn 2.6.1 xinference 0.15.2

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

0.15.3

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --port 9997

Reproduction / 复现过程

在0.15.3版本部署qwen2_5-instruct-gptq-72b-Int8,关键日志是

Oct 01 18:01:30 uni0 xinference-local[3167241]: 2024-10-01 18:01:30,054 xinference.core.worker 3167241 ERROR    Failed to load model qwen2.5-instruct-1-0
Oct 01 18:01:30 uni0 xinference-local[3167241]: Traceback (most recent call last):
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 893, in launch_builtin_model
Oct 01 18:01:30 uni0 xinference-local[3167241]:     await model_ref.load()
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return self._process_result_message(result)
Oct 01 18:01:30 uni0 xinference-local[3167241]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise message.as_instanceof_cause()
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
Oct 01 18:01:30 uni0 xinference-local[3167241]:     result = await self._run_coro(message.message_id, coro)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return await coro
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return await super().__on_receive__(message)  # type: ignore
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 558, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise ex
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     async with self._lock:
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     with debug_async_timeout('actor_lock_timeout',
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     result = await result
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/model.py", line 309, in load
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._model.load()
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/model/llm/vllm/core.py", line 257, in load
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._engine = AsyncLLMEngine.from_engine_args(engine_args)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
Oct 01 18:01:30 uni0 xinference-local[3167241]:     engine = cls(
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self.engine = self._engine_class(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 260, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     super().__init__(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._initialize_kv_caches()
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._run_workers("initialize_cache",
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
Oct 01 18:01:30 uni0 xinference-local[3167241]:     driver_worker_output = driver_worker_method(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 258, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise_if_cache_size_invalid(num_gpu_blocks,
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise ValueError(
Oct 01 18:01:30 uni0 xinference-local[3167241]: ValueError: [address=127.0.0.1:43411, pid=3167626] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Oct 01 18:01:30 uni0 xinference-local[3167626]: 2024-10-01 18:01:30,332 vllm.executor.multiproc_worker_utils 3167626 ERROR    Worker VllmWorkerProcess pid 3167883 died, exit code: -15
Oct 01 18:01:30 uni0 xinference-local[3167626]: 2024-10-01 18:01:30,332 vllm.executor.multiproc_worker_utils 3167626 INFO     Killing local vLLM worker processes
Oct 01 18:01:30 uni0 xinference-local[3167241]: 2024-10-01 18:01:30,904 xinference.core.worker 3167241 ERROR    [request 1497d888-7fdc-11ef-8045-207bd2601f8f] Leave launch_builtin_model, error: [address=127.0.0.1:43411, pid=3167626] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine., elapsed time: 22 s
Oct 01 18:01:30 uni0 xinference-local[3167241]: Traceback (most recent call last):
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 69, in wrapped
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ret = await func(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 893, in launch_builtin_model
Oct 01 18:01:30 uni0 xinference-local[3167241]:     await model_ref.load()
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return self._process_result_message(result)
Oct 01 18:01:30 uni0 xinference-local[3167241]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise message.as_instanceof_cause()
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
Oct 01 18:01:30 uni0 xinference-local[3167241]:     result = await self._run_coro(message.message_id, coro)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return await coro
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     return await super().__on_receive__(message)  # type: ignore
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 558, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise ex
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     async with self._lock:
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     with debug_async_timeout('actor_lock_timeout',
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     result = await result
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/model.py", line 309, in load
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._model.load()
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/model/llm/vllm/core.py", line 257, in load
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._engine = AsyncLLMEngine.from_engine_args(engine_args)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
Oct 01 18:01:30 uni0 xinference-local[3167241]:     engine = cls(
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self.engine = self._engine_class(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 260, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     super().__init__(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._initialize_kv_caches()
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167241]:     self._run_workers("initialize_cache",
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
Oct 01 18:01:30 uni0 xinference-local[3167241]:     driver_worker_output = driver_worker_method(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 258, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise_if_cache_size_invalid(num_gpu_blocks,
Oct 01 18:01:30 uni0 xinference-local[3167241]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167241]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
Oct 01 18:01:30 uni0 xinference-local[3167241]:     raise ValueError(
Oct 01 18:01:30 uni0 xinference-local[3167241]: ValueError: [address=127.0.0.1:43411, pid=3167626] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Oct 01 18:01:30 uni0 xinference-local[3167053]: 2024-10-01 18:01:30,914 xinference.api.restful_api 3167053 ERROR    [address=127.0.0.1:43411, pid=3167626] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Oct 01 18:01:30 uni0 xinference-local[3167053]: Traceback (most recent call last):
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 967, in launch_model
Oct 01 18:01:30 uni0 xinference-local[3167053]:     model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
Oct 01 18:01:30 uni0 xinference-local[3167053]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return self._process_result_message(result)
Oct 01 18:01:30 uni0 xinference-local[3167053]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise message.as_instanceof_cause()
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
Oct 01 18:01:30 uni0 xinference-local[3167053]:     result = await self._run_coro(message.message_id, coro)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return await coro
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return await super().__on_receive__(message)  # type: ignore
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 558, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise ex
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     async with self._lock:
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     with debug_async_timeout('actor_lock_timeout',
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     result = await result
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1032, in launch_builtin_model
Oct 01 18:01:30 uni0 xinference-local[3167053]:     await _launch_model()
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 996, in _launch_model
Oct 01 18:01:30 uni0 xinference-local[3167053]:     await _launch_one_model(rep_model_uid)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 975, in _launch_one_model
Oct 01 18:01:30 uni0 xinference-local[3167053]:     await worker_ref.launch_builtin_model(
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
Oct 01 18:01:30 uni0 xinference-local[3167053]:     async with lock:
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
Oct 01 18:01:30 uni0 xinference-local[3167053]:     result = await result
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 69, in wrapped
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ret = await func(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 893, in launch_builtin_model
Oct 01 18:01:30 uni0 xinference-local[3167053]:     await model_ref.load()
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return self._process_result_message(result)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise message.as_instanceof_cause()
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
Oct 01 18:01:30 uni0 xinference-local[3167053]:     result = await self._run_coro(message.message_id, coro)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return await coro
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     return await super().__on_receive__(message)  # type: ignore
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 558, in __on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise ex
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     async with self._lock:
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     with debug_async_timeout('actor_lock_timeout',
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     result = await result
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/core/model.py", line 309, in load
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self._model.load()
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/xinference/model/llm/vllm/core.py", line 257, in load
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self._engine = AsyncLLMEngine.from_engine_args(engine_args)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
Oct 01 18:01:30 uni0 xinference-local[3167053]:     engine = cls(
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 471, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self.engine = self._engine_class(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 260, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     super().__init__(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 339, in __init__
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self._initialize_kv_caches()
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 487, in _initialize_kv_caches
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 63, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167053]:     self._run_workers("initialize_cache",
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
Oct 01 18:01:30 uni0 xinference-local[3167053]:     driver_worker_output = driver_worker_method(*args, **kwargs)
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 258, in initialize_cache
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise_if_cache_size_invalid(num_gpu_blocks,
Oct 01 18:01:30 uni0 xinference-local[3167053]:     ^^^^^^^^^^^^^^^^^
Oct 01 18:01:30 uni0 xinference-local[3167053]:   File "/home/anxu/.conda/envs/xinference/lib/python3.11/site-packages/vllm/worker/worker.py", line 483, in raise_if_cache_size_invalid
Oct 01 18:01:30 uni0 xinference-local[3167053]:     raise ValueError(
Oct 01 18:01:30 uni0 xinference-local[3167053]: ValueError: [address=127.0.0.1:43411, pid=3167626] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (18176). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Expected behavior / 期待表现

在0.15.2版本部署qwen2_5-instruct-gptq-72b-Int8,日志是

Oct 01 18:02:55 uni0 xinference-local[3168834]: 2024-10-01 18:02:55,866 xinference.core.worker 3168834 INFO     [request 547b291e-7fdc-11ef-bf30-207bd2601f8f] Enter launch_builtin_model, args: <xinference.core.worker.WorkerActor object at 0x7f74be9f8ef0>, kwargs: model_uid=qwen2.5-instruct-1-0,model_name=qwen2.5-instruct,model_size_in_billions=72,model_format=gptq,quantization=Int8,model_engine=vLLM,model_type=LLM,n_gpu=auto,request_limits=None,peft_model_config=None,gpu_idx=[4, 5, 6, 7],download_hub=modelscope,model_path=None
Oct 01 18:02:55 uni0 xinference-local[3168834]: 2024-10-01 18:02:55,868 xinference.core.worker 3168834 INFO     You specify to launch the model: qwen2.5-instruct on GPU index: [4, 5, 6, 7] of the worker: 127.0.0.1:58180, xinference will automatically ignore the `n_gpu` option.
Oct 01 18:03:00 uni0 xinference-local[3168834]: 2024-10-01 18:03:00,064 xinference.model.llm.llm_family 3168834 INFO     Caching from Modelscope: qwen/Qwen2.5-72B-Instruct-GPTQ-Int8
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,176 xinference.model.llm.vllm.core 3169096 INFO     Loading qwen2.5-instruct with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 4, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}Enable lora: False. Lora count: 0.
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,177 transformers.configuration_utils 3169096 INFO     loading configuration file /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8/config.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,177 transformers.configuration_utils 3169096 INFO     loading configuration file /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8/config.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,178 transformers.modeling_rope_utils 3169096 WARNING  Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'original_max_position_embeddings'}
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,179 transformers.configuration_utils 3169096 INFO     Model config Qwen2Config {
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "_name_or_path": "/home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8",
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "architectures": [
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "Qwen2ForCausalLM"
Oct 01 18:03:00 uni0 xinference-local[3169096]:   ],
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "attention_dropout": 0.0,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "bos_token_id": 151643,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "eos_token_id": 151645,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "hidden_act": "silu",
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "hidden_size": 8192,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "initializer_range": 0.02,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "intermediate_size": 29696,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "max_position_embeddings": 32768,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "max_window_layers": 70,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "model_type": "qwen2",
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "num_attention_heads": 64,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "num_hidden_layers": 80,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "num_key_value_heads": 8,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "quantization_config": {
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "batch_size": 1,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "bits": 8,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "block_name_to_quantize": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "cache_block_outputs": true,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "damp_percent": 0.01,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "dataset": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "desc_act": false,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "exllama_config": {
Oct 01 18:03:00 uni0 xinference-local[3169096]:       "version": 1
Oct 01 18:03:00 uni0 xinference-local[3169096]:     },
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "group_size": 128,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "max_input_length": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "model_seqlen": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "module_name_preceding_first_block": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "modules_in_block_to_quantize": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "pad_token_id": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "quant_method": "gptq",
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "sym": true,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "tokenizer": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "true_sequential": true,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "use_cuda_fp16": false,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "use_exllama": true
Oct 01 18:03:00 uni0 xinference-local[3169096]:   },
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "rms_norm_eps": 1e-06,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "rope_scaling": {
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "factor": 4.0,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "original_max_position_embeddings": 32768,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "rope_type": "yarn",
Oct 01 18:03:00 uni0 xinference-local[3169096]:     "type": "yarn"
Oct 01 18:03:00 uni0 xinference-local[3169096]:   },
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "rope_theta": 1000000.0,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "sliding_window": null,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "tie_word_embeddings": false,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "torch_dtype": "float16",
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "transformers_version": "4.45.1",
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "use_cache": true,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "use_sliding_window": false,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "vocab_size": 152064
Oct 01 18:03:00 uni0 xinference-local[3169096]: }
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,180 transformers.models.auto.image_processing_auto 3169096 INFO     Could not locate the image processor configuration file, will try to use the model config instead.
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,226 vllm.model_executor.layers.quantization.gptq_marlin 3169096 INFO     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,296 vllm.config  3169096 INFO     Defaulting to use mp for distributed inference
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,300 vllm.engine.llm_engine 3169096 INFO     Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='/home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8', speculative_config=None, tokenizer='/home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,317 transformers.tokenization_utils_base 3169096 INFO     loading file vocab.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,317 transformers.tokenization_utils_base 3169096 INFO     loading file merges.txt
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,317 transformers.tokenization_utils_base 3169096 INFO     loading file tokenizer.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,317 transformers.tokenization_utils_base 3169096 INFO     loading file added_tokens.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,318 transformers.tokenization_utils_base 3169096 INFO     loading file special_tokens_map.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,318 transformers.tokenization_utils_base 3169096 INFO     loading file tokenizer_config.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,657 transformers.tokenization_utils_base 3169096 INFO     Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,675 transformers.generation.configuration_utils 3169096 INFO     loading configuration file /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8/generation_config.json
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,675 transformers.generation.configuration_utils 3169096 INFO     Generate config GenerationConfig {
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "bos_token_id": 151643,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "do_sample": true,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "eos_token_id": [
Oct 01 18:03:00 uni0 xinference-local[3169096]:     151645,
Oct 01 18:03:00 uni0 xinference-local[3169096]:     151643
Oct 01 18:03:00 uni0 xinference-local[3169096]:   ],
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "pad_token_id": 151643,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "repetition_penalty": 1.05,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "temperature": 0.7,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "top_k": 20,
Oct 01 18:03:00 uni0 xinference-local[3169096]:   "top_p": 0.8
Oct 01 18:03:00 uni0 xinference-local[3169096]: }
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,676 vllm.executor.multiproc_gpu_executor 3169096 WARNING  Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
Oct 01 18:03:00 uni0 xinference-local[3169096]: 2024-10-01 18:03:00,683 vllm.triton_utils.custom_cache_manager 3169096 INFO     Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
Oct 01 18:03:00 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:00,873 vllm.executor.multiproc_worker_utils 3169343 INFO     Worker ready; awaiting tasks
Oct 01 18:03:00 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:00,909 vllm.executor.multiproc_worker_utils 3169342 INFO     Worker ready; awaiting tasks
Oct 01 18:03:00 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:00,914 vllm.executor.multiproc_worker_utils 3169344 INFO     Worker ready; awaiting tasks
Oct 01 18:03:02 uni0 xinference-local[3169096]: 2024-10-01 18:03:02,290 vllm.utils   3169096 INFO     Found nccl from library libnccl.so.2
Oct 01 18:03:02 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:02,290 vllm.utils   3169342 INFO     Found nccl from library libnccl.so.2
Oct 01 18:03:02 uni0 xinference-local[3169096]: 2024-10-01 18:03:02,291 vllm.distributed.device_communicators.pynccl 3169096 INFO     vLLM is using nccl==2.20.5
Oct 01 18:03:02 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:02,291 vllm.distributed.device_communicators.pynccl 3169342 INFO     vLLM is using nccl==2.20.5
Oct 01 18:03:02 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:02,291 vllm.utils   3169344 INFO     Found nccl from library libnccl.so.2
Oct 01 18:03:02 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:02,292 vllm.distributed.device_communicators.pynccl 3169344 INFO     vLLM is using nccl==2.20.5
Oct 01 18:03:02 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:02,293 vllm.utils   3169343 INFO     Found nccl from library libnccl.so.2
Oct 01 18:03:02 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:02,293 vllm.distributed.device_communicators.pynccl 3169343 INFO     vLLM is using nccl==2.20.5
Oct 01 18:03:03 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:03,106 vllm.distributed.device_communicators.custom_all_reduce 3169344 WARNING  Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
Oct 01 18:03:03 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:03,106 vllm.distributed.device_communicators.custom_all_reduce 3169343 WARNING  Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
Oct 01 18:03:03 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:03,106 vllm.distributed.device_communicators.custom_all_reduce 3169342 WARNING  Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
Oct 01 18:03:03 uni0 xinference-local[3169096]: 2024-10-01 18:03:03,106 vllm.distributed.device_communicators.custom_all_reduce 3169096 WARNING  Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
Oct 01 18:03:03 uni0 xinference-local[3169096]: 2024-10-01 18:03:03,112 vllm.distributed.device_communicators.shm_broadcast 3169096 INFO     vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f3ec330fa50>, local_subscribe_port=48945, remote_subscribe_port=None)
Oct 01 18:03:03 uni0 xinference-local[3169096]: 2024-10-01 18:03:03,126 vllm.worker.model_runner 3169096 INFO     Starting to load model /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8...
Oct 01 18:03:03 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:03,127 vllm.worker.model_runner 3169342 INFO     Starting to load model /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8...
Oct 01 18:03:03 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:03,127 vllm.worker.model_runner 3169344 INFO     Starting to load model /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8...
Oct 01 18:03:03 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:03,127 vllm.worker.model_runner 3169343 INFO     Starting to load model /home/anxu/.xinference/cache/qwen2_5-instruct-gptq-72b-Int8...
Oct 01 18:03:03 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:03,130 vllm.model_executor.layers.quantization.gptq_marlin 3169344 INFO     Using MarlinLinearKernel for GPTQMarlinLinearMethod
Oct 01 18:03:03 uni0 xinference-local[3169096]: 2024-10-01 18:03:03,132 vllm.model_executor.layers.quantization.gptq_marlin 3169096 INFO     Using MarlinLinearKernel for GPTQMarlinLinearMethod
Oct 01 18:03:03 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:03,135 vllm.model_executor.layers.quantization.gptq_marlin 3169343 INFO     Using MarlinLinearKernel for GPTQMarlinLinearMethod
Oct 01 18:03:03 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:03,135 vllm.model_executor.layers.quantization.gptq_marlin 3169342 INFO     Using MarlinLinearKernel for GPTQMarlinLinearMethod
Oct 01 18:03:03 uni0 xinference-local[3169096]: [78B blob data]
Oct 01 18:03:03 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:04 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:04 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:05 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:05 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:06 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:06 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:07 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:07 uni0 xinference-local[3169096]: [86B blob data]
Oct 01 18:03:08 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:08 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:09 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:09 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:10 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:10 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:10 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:11 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:11 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:11,542 vllm.worker.model_runner 3169342 INFO     Loading model weights took 17.8716 GB
Oct 01 18:03:11 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:11 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:12 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:12 uni0 xinference-local[3169096]: [87B blob data]
Oct 01 18:03:12 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:12,325 vllm.worker.model_runner 3169343 INFO     Loading model weights took 17.8716 GB
Oct 01 18:03:12 uni0 xinference-local[3169096]: 2024-10-01 18:03:12,715 vllm.worker.model_runner 3169096 INFO     Loading model weights took 17.8716 GB
Oct 01 18:03:12 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:12,849 vllm.worker.model_runner 3169344 INFO     Loading model weights took 17.8716 GB
Oct 01 18:03:16 uni0 xinference-local[3169096]: 2024-10-01 18:03:16,895 vllm.executor.distributed_gpu_executor 3169096 INFO     # GPU blocks: 889, # CPU blocks: 3276
Oct 01 18:03:19 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:19,036 vllm.worker.model_runner 3169344 INFO     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
Oct 01 18:03:19 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:19,036 vllm.worker.model_runner 3169344 INFO     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Oct 01 18:03:19 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:19,085 vllm.worker.model_runner 3169343 INFO     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
Oct 01 18:03:19 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:19,088 vllm.worker.model_runner 3169343 INFO     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Oct 01 18:03:19 uni0 xinference-local[3169096]: 2024-10-01 18:03:19,714 vllm.worker.model_runner 3169096 INFO     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
Oct 01 18:03:19 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:19,714 vllm.worker.model_runner 3169342 INFO     Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
Oct 01 18:03:19 uni0 xinference-local[3169096]: 2024-10-01 18:03:19,714 vllm.worker.model_runner 3169096 INFO     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Oct 01 18:03:19 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:19,714 vllm.worker.model_runner 3169342 INFO     CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Oct 01 18:03:40 uni0 xinference-local[3169343]: (VllmWorkerProcess pid=3169343) 2024-10-01 18:03:40,267 vllm.worker.model_runner 3169343 INFO     Graph capturing finished in 21 secs.
Oct 01 18:03:40 uni0 xinference-local[3169344]: (VllmWorkerProcess pid=3169344) 2024-10-01 18:03:40,268 vllm.worker.model_runner 3169344 INFO     Graph capturing finished in 21 secs.
Oct 01 18:03:40 uni0 xinference-local[3169096]: 2024-10-01 18:03:40,277 vllm.worker.model_runner 3169096 INFO     Graph capturing finished in 21 secs.
Oct 01 18:03:40 uni0 xinference-local[3169342]: (VllmWorkerProcess pid=3169342) 2024-10-01 18:03:40,278 vllm.worker.model_runner 3169342 INFO     Graph capturing finished in 21 secs.
Oct 01 18:03:40 uni0 xinference-local[3168834]: 2024-10-01 18:03:40,296 xinference.core.worker 3168834 INFO     [request 547b291e-7fdc-11ef-bf30-207bd2601f8f] Leave launch_builtin_model, elapsed time: 44 s
qinxuye commented 1 week ago

0.15.3 默认把 vllm 的上下文设置成模型的上下文长度了,你的显卡放不下这么长,如果是命令行,尝试:

xinference launch xxxx --max_model_len 4096

如果是界面,在 extra 里添加,max_model_len,值是 4096.