xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.57k stars 462 forks source link

logprobs is not supported for models created with logits_all=False #1911

Open sandro-qiang opened 4 months ago

sandro-qiang commented 4 months ago

System Info / 系統信息

NVIDIA-SMI 555.52.04 Driver Version: 555.52.04 CUDA Version: 12.5 Ubuntu 22.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

0.13.2

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H '0.0.0.0'

Reproduction / 复现过程

gemma-2-it, gguf格式, q4-k-m

Expected behavior / 期待表现

用langflow的openai节点,估计是因为他默认调用时候logprobs不为None,然后xinference创建context时候logits_all=False。建议对于不同的模型后端检查下openai传递的参数。下面是调用栈。

另外,llama.cpp引擎无法设置system_prompt,设置了不生效。

xinference  | 2024-07-21 07:12:16,092 xinference.api.restful_api 1 ERROR    Chat completion stream got an error: [address=0.0.0.0:41803, pid=54] logprobs is not supported for models created with logits_all=False
xinference  | Traceback (most recent call last):
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1656, in stream_results
xinference  |     async for item in iterator:
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 340, in __anext__
xinference  |     return await self._actor_ref.__xoscar_next__(self._uid)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro
xinference  |     return await coro
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = await result
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 431, in __xoscar_next__
xinference  |     raise e
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 417, in __xoscar_next__
xinference  |     r = await asyncio.to_thread(_wrapper, gen)
xinference  |   File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread
xinference  |     return await loop.run_in_executor(None, func_call)
xinference  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
xinference  |     result = self.fn(*self.args, **self.kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 402, in _wrapper
xinference  |     return next(_gen)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 318, in _to_generator
xinference  |     for v in gen:
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/utils.py", line 558, in _to_chat_completion_chunks
xinference  |     for i, chunk in enumerate(chunks):
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/ggml/llamacpp.py", line 212, in generator_wrapper
xinference  |     for index, _completion_chunk in enumerate(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama.py", line 1106, in _create_completion
xinference  |     raise ValueError(
xinference  | ValueError: [address=0.0.0.0:41803, pid=54] logprobs is not supported for models created with logits_all=False
qinxuye commented 4 months ago

有兴趣提交代码修复这个问题吗?

sandro-qiang commented 4 months ago

明天我抽时间看下,如果能解决我发pr。

sandro-qiang commented 4 months ago

界面launch的时候,Additional parameters passed to the inference engine增加logits_all为true就可以了,不用改代码,另外llama-cpp-python把logits_all标记了deprecated了,现在也没必要动它。

system prompt不生效是因为gemma-2本身就不支持system prompt,不是bug。

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 7 days with no activity.