xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.66k stars 365 forks source link

KV cache 的大小是哪里控制的? #1115

Closed wuyeguo closed 3 weeks ago

wuyeguo commented 5 months ago

48GB显存load internlm-chat-20b-pytorch-20b 失败,报错如下

2024-03-11 02:55:27,495 xinference.model.llm.llm_family 80 INFO Cache /root/.xinference/cache/internlm-chat-20b-pytorch-20b exists 2024-03-11 02:55:27,496 xinference.model.llm.core 80 DEBUG Launching internlm-chat-20b-1-0 with VLLMChatModel 2024-03-11 02:55:27,501 xinference.model.llm.vllm.core 249 INFO Loading internlm-chat-20b with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096} INFO 03-11 02:55:27 llm_engine.py:72] Initializing an LLM engine with config: model='/root/.xinference/cache/internlm-chat-20b-pytorch-20b', tokenizer='/root/.xinference/cache/internlm-chat-20b-pytorch-20b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0) WARNING 03-11 02:55:27 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead. ^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[B^[[BINFO 03-11 02:56:57 llm_engine.py:322] # GPU blocks: 54, # CPU blocks: 218 2024-03-11 02:56:57,823 xinference.core.worker 80 ERROR Failed to load model internlm-chat-20b-1-0 Traceback (most recent call last): File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 553, in launch_builtin_model await model_ref.load() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send result = await self._run_coro(message.message_id, coro) File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro return await coro File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive result = func(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load self._model.load() File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in init self.engine = self._init_engine(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine return engine_class(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in init self._init_cache() File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 331, in _init_cache raise ValueError( ValueError: [address=0.0.0.0:39505, pid=249] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. 2024-03-11 02:56:57,987 xinference.core.supervisor 80 DEBUG Enter terminate_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fe8448e49f0>, 'internlm-chat-20b'), kwargs: {'suppress_exception': True} 2024-03-11 02:56:57,987 xinference.core.supervisor 80 DEBUG Leave terminate_model, elapsed time: 0 s 2024-03-11 02:56:57,991 xinference.api.restful_api 27 ERROR [address=0.0.0.0:39505, pid=249] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (864). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

问题1: 是否有界面或者命令行配置的方式,让用户可以控制 max_model_len的设置? 问题2:KV cache (864) 是哪里配置的?用户是否可以干预?还是通过闪存大小预设置的?预设的逻辑是什么? 问题3:预设置模型,有没有可能,每个配置,在界面或者命令行,给一下预估的显存使用? 问题4:模型的版本,是否可以由用户配置?

谢谢

qinxuye commented 5 months ago

https://inference.readthedocs.io/zh-cn/latest/getting_started/using_xinference.html#run-llama-2

这里有说到如何透传选项,可以参考下。

linhao622 commented 5 months ago

关于问题2,报错是说你设置了最大token是4096,但是剩余的显存只够864个token,所以你注册模型时候把token调小一点(例如512)就可以启动,另外模型启动时的kvcache大小跟注册时模型填的最大token数成正比,并且每种模型的单个token所占用的kv cache都不一样,以我的使用经验:qwen单个token占用的最多,yi占用的最少

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.