KV cache 的大小是哪里控制的？

wuyeguo commented 5 months ago

48GB显存load internlm-chat-20b-pytorch-20b 失败，报错如下

2024-03-11 02:55:27,495 xinference.model.llm.llm_family 80 INFO Cache /root/.xinference/cache/internlm-chat-20b-pytorch-20b exists 2024-03-11 02:55:27,496 xinference.model.llm.core 80 DEBUG Launching internlm-chat-20b-1-0 with VLLMChatModel 2024-03-11 02:55:27,501 xinference.model.llm.vllm.core 249 INFO Loading internlm-chat-20b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096} INFO 03-11 02:55:27 llm_engine.py:72] Initializing an LLM engine with config: model='/root/.xinference/cache/internlm-chat-20b-pytorch-20b', tokenizer='/root/.xinference/cache/internlm-chat-20b-pytorch-20b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0) WARNING 03-11 02:55:27 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. ^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[BINFO 03-11 02:56:57 llm_engine.py:322] # GPU blocks: 54, # CPU blocks: 218 2024-03-11 02:56:57,823 xinference.core.worker 80 ERROR Failed to load model internlm-chat-20b-1-0 Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 553, in launch_builtin_model await model_ref.load() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive result = func(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load self._model.load() File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine return engine_class(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 331, in _init_cache raise ValueError( ValueError: [address=0.0.0.0:39505, pid=249] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. 2024-03-11 02:56:57,987 xinference.core.supervisor 80 DEBUG Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fe8448e49f0>, 'internlm-chat-20b'), kwargs: {'suppress_exception': True} 2024-03-11 02:56:57,987 xinference.core.supervisor 80 DEBUG Leave terminate_model, elapsed time: 0 s 2024-03-11 02:56:57,991 xinference.api.restful_api 27 ERROR [address=0.0.0.0:39505, pid=249] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

问题1：是否有界面或者命令行配置的方式，让用户可以控制 max_model_len的设置？问题2：KV cache (864) 是哪里配置的？用户是否可以干预？还是通过闪存大小预设置的？预设的逻辑是什么？问题3：预设置模型，有没有可能，每个配置，在界面或者命令行，给一下预估的显存使用？问题4：模型的版本，是否可以由用户配置？

谢谢

qinxuye commented 5 months ago

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#run-llama-2

这里有说到如何透传选项，可以参考下。

linhao622 commented 5 months ago

关于问题2，报错是说你设置了最大token是4096，但是剩余的显存只够864个token，所以你注册模型时候把token调小一点（例如512）就可以启动，另外模型启动时的kvcache大小跟注册时模型填的最大token数成正比，并且每种模型的单个token所占用的kv cache都不一样，以我的使用经验：qwen单个token占用的最多，yi占用的最少

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

xorbitsai / inference

KV cache 的大小是哪里控制的？ #1115