xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.55k stars 357 forks source link

0.12新版本显存激增,0.8旧版本占用显存很小,模型:Qwen1.5-14B-Chat-GPTQ-int4 #1797

Open worm128 opened 1 month ago

worm128 commented 1 month ago

模型:Qwen1.5-14B-Chat-GPTQ-int4 xinference新版本:v0.12.3 容器:docker pull xprobe/xinference:v0.12.3 xinference旧版本:v0.8.5 容器:docker pull xprobe/xinference:v0.8.5

我用的是docker容器版本运行的: 为什么xinference v0.12版本使用vllma加载Qwen1.5-14B-Chat-GPTQ-int4模型显存会到22g这么多,原来旧版本v0.8只需要12g左右,而且回复速度也超快;而在新版有transformer和vllma两种方式加载,其中transformer的回复速度奇慢无比,虽然显存占用小;用vllma占用显存非常多22g,但是速度快。搞不明白xinference改了什么,这是不是开倒车了,没有旧版本好用。

qinxuye commented 1 month ago

vllm 默认会占用 90% 的显存。

worm128 commented 1 month ago

vllm 默认会占用 90% 的显存。

这个可以手动设置少一点吗?

Valdanitooooo commented 1 month ago

可以在启动模型时候,在 vllm 参数中设置,比如: gpu_memory_utilization : 0.5

worm128 commented 1 month ago

可以在启动模型时候,在 vllm 参数中设置,比如: gpu_memory_utilization : 0.5

2024-07-11 18:01:48 INFO 07-11 10:01:48 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/data/Qwen1.5-14B-Chat-GPTQ-int4', speculative_config=None, tokenizer='/data/Qwen1.5-14B-Chat-GPTQ-int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/Qwen1.5-14B-Chat-GPTQ-int4) 2024-07-11 18:01:48 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-11 18:01:48 WARNING 07-11 10:01:48 utils.py:451] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. 2024-07-11 18:12:28 INFO 07-11 10:12:28 model_runner.py:146] Loading model weights took 9.2497 GB 2024-07-11 18:12:30 INFO 07-11 10:12:30 gpu_executor.py:83] # GPU blocks: 76, # CPU blocks: 327 2024-07-11 18:12:30 2024-07-11 10:12:30,915 xinference.core.worker 45 ERROR Failed to load model Qwen1.5-14B-Chat-GPTQ-int4-1-0 2024-07-11 18:12:30 Traceback (most recent call last): 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 673, in launch_builtin_model 2024-07-11 18:12:30 await model_ref.load() 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send 2024-07-11 18:12:30 return self._process_result_message(result) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message 2024-07-11 18:12:30 raise message.as_instanceof_cause() 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send 2024-07-11 18:12:30 result = await self._run_coro(message.message_id, coro) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro 2024-07-11 18:12:30 return await coro 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive 2024-07-11 18:12:30 return await super().on_receive(message) # type: ignore 2024-07-11 18:12:30 File "xoscar/core.pyx", line 558, in on_receive__ 2024-07-11 18:12:30 raise ex 2024-07-11 18:12:30 File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive 2024-07-11 18:12:30 async with self._lock: 2024-07-11 18:12:30 File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive 2024-07-11 18:12:30 with debug_async_timeout('actor_lock_timeout', 2024-07-11 18:12:30 File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive 2024-07-11 18:12:30 result = await result 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 278, in load 2024-07-11 18:12:30 self._model.load() 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 230, in load 2024-07-11 18:12:30 self._engine = AsyncLLMEngine.from_engine_args(engine_args) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args 2024-07-11 18:12:30 engine = cls( 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in init 2024-07-11 18:12:30 self.engine = self._init_engine(*args, *kwargs) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine 2024-07-11 18:12:30 return engine_class(args, **kwargs) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 235, in init 2024-07-11 18:12:30 self._initialize_kv_caches() 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in _initialize_kv_caches 2024-07-11 18:12:30 self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 86, in initialize_cache 2024-07-11 18:12:30 self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks) 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 187, in initialize_cache 2024-07-11 18:12:30 raise_if_cache_size_invalid(num_gpu_blocks, 2024-07-11 18:12:30 File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 375, in raise_if_cache_size_invalid 2024-07-11 18:12:30 raise ValueError( 2024-07-11 18:12:30 ValueError: [address=0.0.0.0:45439, pid=66] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1216). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. 2024-07-11 18:12:31 2024-07-11 10:12:31,058 xinference.api.restful_api 1 ERROR [address=0.0.0.0:45439, pid=66] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1216). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

设置了0.5启动报错,这东西应该不能乱设置吧,要设置大于加载模型的显存的值?24g显存,设置了0.5是占用10g,感觉不够用报错?

worm128 commented 1 month ago

2024-07-11 18:12:30 ValueError: [address=0.0.0.0:45439, pid=66] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1216). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. 2024-07-11 18:12:31 2024-07-11 10:12:31,058 xinference.api.restful_api 1 ERROR [address=0.0.0.0:45439, pid=66] The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1216). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.

关注这个错误,降低max_model_len,或者增加gpu_memory_utilization大小

worm128 commented 1 month ago
await fetch("http://192.168.2.58:9997/v1/models", {
    "credentials": "include",
    "headers": {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:127.0) Gecko/20100101 Firefox/127.0",
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
        "Content-Type": "application/json",
        "Priority": "u=1"
    },
    "referrer": "http://192.168.2.58:9997/ui/",
    "body": "{\"model_uid\":null,\"model_name\":\"Qwen1.5-14B-Chat-GPTQ-int4\",\"model_type\":\"LLM\",\"model_engine\":\"vLLM\",\"gpu_memory_utilization\":0.7,\"model_format\":\"gptq\",\"model_size_in_billions\":14,\"quantization\":\"Int4\",\"n_gpu\":1,\"replica\":1,\"request_limits\":null,\"worker_ip\":null,\"gpu_idx\":[0]}",
    "method": "POST",
    "mode": "cors"
});

我设置了0.7就能加载,要预测一下模型的实际显存大小,设置太小就报错

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.