Docker v0.15.4版本使用 vLLM 启动 glm4-chat 报错：The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (53376).

Laremn commented 1 week ago

System Info / 系統信息

Linux

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[X] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

0.15.4 最新版

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose启动： xinference: image: xprobe/xinference:latest container_name: xinference ports:

"9997:9997" networks:
network volumes:
./models/xinference:/root/.xinference
./models/models:/usr/models env_file:
.env deploy: resources: reservations: devices:
- driver: nvidia count: all capabilities: [gpu] runtime: nvidia command: xinference-local -H 0.0.0.0 --port 9997

Reproduction / 复现过程

2024-10-15 01:55:07,541 vllm.worker.model_runner 75563 INFO Starting to load model /usr/models/glm-4-9b-chat... Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 10% Completed | 1/10 [00:00<00:02, 4.01it/s] Loading safetensors checkpoint shards: 20% Completed | 2/10 [00:00<00:02, 3.58it/s] Loading safetensors checkpoint shards: 30% Completed | 3/10 [00:00<00:02, 3.31it/s] Loading safetensors checkpoint shards: 40% Completed | 4/10 [00:01<00:01, 3.38it/s] Loading safetensors checkpoint shards: 50% Completed | 5/10 [00:01<00:01, 3.07it/s] Loading safetensors checkpoint shards: 60% Completed | 6/10 [00:02<00:01, 2.70it/s] Loading safetensors checkpoint shards: 70% Completed | 7/10 [00:02<00:01, 2.58it/s] Loading safetensors checkpoint shards: 80% Completed | 8/10 [00:02<00:00, 2.56it/s] Loading safetensors checkpoint shards: 90% Completed | 9/10 [00:03<00:00, 2.64it/s] Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00, 2.70it/s] Loading safetensors checkpoint shards: 100% Completed | 10/10 [00:03<00:00, 2.84it/s]

2024-10-15 01:55:11,396 vllm.worker.model_runner 75563 INFO Loading model weights took 17.5635 GB 2024-10-15 01:55:11,805 vllm.executor.gpu_executor 75563 INFO # GPU blocks: 3336, # CPU blocks: 6553 2024-10-15 01:55:11,811 xinference.core.worker 145 ERROR Failed to load model glm4-chat-1-0 Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 894, in launch_builtin_model await model_ref.load() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 335, in load self._model.load() File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/vllm/core.py", line 261, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in init self.engine = self._init_engine(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine return engine_class(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in init super().init(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 319, in init self._initialize_kv_caches() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 461, in _initialize_kv_caches self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 125, in initialize_cache self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 257, in initialize_cache raise_if_cache_size_invalid(num_gpu_blocks, File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 476, in raise_if_cache_size_invalid raise ValueError( ValueError: [address=0.0.0.0:36921, pid=75563] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (53376). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

Expected behavior / 期待表现

解决该问题，修复Bug

qinxuye commented 6 days ago

加载模型的时候在 extra 选项里添加：max_model_len，值是 53376

Laremn commented 6 days ago

加载模型的时候在 extra 选项里添加：max_model_len，值是 53376

感谢，这个方式可以成功运行，但有两个问题想请教您一下：

这个 KV cache (53376)是根据 GPU 可用显存而动态生成的，还是写死的，是因为我这边 GPU 显存（4090 24G 单卡）不支持 128k 的上下文吗
将 max_model_len 的值设置为 53376 是不是表示 glm4-chat 只能使用 53376 的上下文长度？期待您的回复，感谢🙏

xorbitsai / inference