Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
ValueError: [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size. #1715
使用VLLM运行qwen14B-chat的时候总是报错:
ValueError: [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
2024-06-25 16:32:00,563 xinference.api.restful_api 3118895 ERROR [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
Traceback (most recent call last):
File "/www/NLP/inference/xinference/api/restful_api.py", line 770, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive
result = await result
File "/www/NLP/inference/xinference/core/supervisor.py", line 837, in launch_builtin_model
await _launch_model()
File "/www/NLP/inference/xinference/core/supervisor.py", line 801, in _launch_model
await _launch_one_model(rep_model_uid)
File "/www/NLP/inference/xinference/core/supervisor.py", line 782, in _launch_one_model
await worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper
async with lock:
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
result = await result
File "/www/NLP/inference/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/www/NLP/inference/xinference/core/worker.py", line 665, in launch_builtin_model
await model_ref.load()
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive__
result = await result
File "/www/NLP/inference/xinference/core/model.py", line 278, in load
self._model.load()
File "/www/NLP/inference/xinference/model/llm/vllm/core.py", line 230, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config
model_config = ModelConfig(
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/config.py", line 141, in init
self.max_model_len = _get_and_verify_max_len(
File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/config.py", line 1317, in _get_and_verify_max_len
raise ValueError(
ValueError: [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
使用VLLM运行qwen14B-chat的时候总是报错: ValueError: [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size. 2024-06-25 16:32:00,563 xinference.api.restful_api 3118895 ERROR [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size. Traceback (most recent call last): File "/www/NLP/inference/xinference/api/restful_api.py", line 770, in launch_model model_uid = await (await self._get_supervisor_ref()).launch_builtin_model( File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/www/NLP/inference/xinference/core/supervisor.py", line 837, in launch_builtin_model await _launch_model() File "/www/NLP/inference/xinference/core/supervisor.py", line 801, in _launch_model await _launch_one_model(rep_model_uid) File "/www/NLP/inference/xinference/core/supervisor.py", line 782, in _launch_one_model await worker_ref.launch_builtin_model( File "xoscar/core.pyx", line 284, in pyx_actor_method_wrapper async with lock: File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper result = await result File "/www/NLP/inference/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, **kwargs) File "/www/NLP/inference/xinference/core/worker.py", line 665, in launch_builtin_model await model_ref.load() File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send return self._process_result_message(result) File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 659, in send result = await self._run_coro(message.message_id, coro) File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/backends/pool.py", line 370, in _run_coro return await coro File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive__ with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.on_receive__ result = await result File "/www/NLP/inference/xinference/core/model.py", line 278, in load self._model.load() File "/www/NLP/inference/xinference/model/llm/vllm/core.py", line 230, in load self._engine = AsyncLLMEngine.from_engine_args(engine_args) File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 371, in from_engine_args engine_config = engine_args.create_engine_config() File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 630, in create_engine_config model_config = ModelConfig( File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/config.py", line 141, in init self.max_model_len = _get_and_verify_max_len( File "/home/ai/anaconda3/envs/xinference/lib/python3.10/site-packages/vllm/config.py", line 1317, in _get_and_verify_max_len raise ValueError( ValueError: [address=127.0.0.1:37657, pid=3125985] User-specified max_model_len (4096) is greater than the derived max_model_len (seq_length=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
但是当我使用transform的时候就可以运行,我设置的model_max_length为2048,qwen14B的上限是8192,问问各位大佬这是什么情况,我使用transform运行qwen14B-int4的时候也会出现这个问题