Open sunxichen opened 6 months ago
I also think that this feature will be very useful. Developers can control what the model generates. https://github.com/xorbitsai/inference/blob/d76549b7533a3ed20e225548e1bb2bf89e7296c8/xinference/model/llm/vllm/core.py#L67 I think maybe adding a few fields to this class would accomplish it.
There's a problem with the implementation, vllm's Sampling Parameters don't yet include anything about guided grammar. https://docs.vllm.ai/en/stable/dev/sampling_params.html
Thanks, I found the vllm's implementation is adapted from outlines, and outlines integrates many libraries including llamacpp and transformers other than vllm, I think it's a feature that can be implemented for all the backends.
Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.
Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.
Thanks, we definitely will support it, it's on our schedule table, but we have to wait some time to have resource to support it. If anyone is interested in this feature, please let me know.
This issue is stale because it has been open for 7 days with no activity.
vllm has supported guided decoding in Sampling Parameters. This feature will be supported soon with vllm backend.
Is your feature request related to a problem? Please describe
Currently, when utilizing Xinference with vllm as the backend, users are unable to leverage vllm's advanced guided generation capabilities, which can lead to less controlled outputs from large language models. This poses a challenge particularly in scenarios where precise control over model responses is crucial, such as in classification tasks, structured data generation, or code generation. The lack of compatibility limits the potential for ensuring outputs adhere strictly to predefined formats or choices, leading to inefficiencies in post-processing and potentially inaccurate results.
Describe the solution you'd like
I would like Xinference to integrate support for vllm's guided generation features, specifically enabling the use of
guided_choice
,guided_json
,guided_regex
, andguided_grammar
parameters through theextra_body
option like vllm's OpenAI-compatible API. This would empower users to define strict constraints on the generated outputs, ensuring they conform to specific requirements:example of vllm openai api request when using guided generation:
Describe alternatives you've considered
As alternatives, manual post-processing of model outputs has been considered to enforce the desired formats, but this approach is inefficient and error-prone. Or I can use vllm directly as model serve engine.
Additional context
For a comprehensive understanding of the desired functionality, please refer to vllm's pull request #2819 (Link) which adds guided decoding support for the OpenAI API server. Furthermore, detailed documentation on the extra parameters supported by vllm's chat API, including guided generation mechanisms, can be found at vllm's official documentation. Integrating these features into Xinference would significantly enhance the platform's utility for a wide range of applications requiring fine-grained control over LLM outputs.