xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.14k stars 418 forks source link

Docker v0.15.4版本使用 vLLM 启动 glm4-chat 报错:The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (53376). #2440

Open Laremn opened 1 week ago

Laremn commented 1 week ago

System Info / 系統信息

Linux

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

0.15.4 最新版

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose启动: xinference: image: xprobe/xinference:latest container_name: xinference ports:

Reproduction / 复现过程

2024-10-15 01:55:07,541 vllm.worker.model_runner 75563 INFO Starting to load model /usr/models/glm-4-9b-chat... Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:02, 4.01it/s] Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:00<00:02, 3.58it/s] Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:00<00:02, 3.31it/s] Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:01<00:01, 3.38it/s] Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:01<00:01, 3.07it/s] Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:02<00:01, 2.70it/s] Loading safetensors checkpoint shards: 70% Completed | 7/10 [00:02<00:01, 2.58it/s] Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:02<00:00, 2.56it/s] Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:03<00:00, 2.64it/s] Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00, 2.70it/s] Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00, 2.84it/s]

2024-10-15 01:55:11,396 vllm.worker.model_runner 75563 INFO Loading model weights took 17.5635 GB 2024-10-15 01:55:11,805 vllm.executor.gpu_executor 75563 INFO # GPU blocks: 3336, # CPU blocks: 6553 2024-10-15 01:55:11,811 xinference.core.worker 145 ERROR Failed to load model glm4-chat-1-0 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 894, in launch_builtin_model await model_ref.load() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 335, in load self._model.load() File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/core.py", line 261, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in init self.engine = self._init_engine(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine return engine_class(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in init super().init(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 319, in init self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 461, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 257, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 476, in raise_if_cache_size_invalid raise ValueError( ValueError: [address=0.0.0.0:36921, pid=75563] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (53376). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Expected behavior / 期待表现

解决该问题,修复Bug

qinxuye commented 6 days ago

加载模型的时候在 extra 选项里添加:max_model_len,值是 53376

Laremn commented 6 days ago

加载模型的时候在 extra 选项里添加:max_model_len,值是 53376

感谢,这个方式可以成功运行,但有两个问题想请教您一下: