Integration of Guided Generation Features from vllm into XInference

sunxichen commented 3 months ago

Is your feature request related to a problem? Please describe

Currently, when utilizing Xinference with vllm as the backend, users are unable to leverage vllm's advanced guided generation capabilities, which can lead to less controlled outputs from large language models. This poses a challenge particularly in scenarios where precise control over model responses is crucial, such as in classification tasks, structured data generation, or code generation. The lack of compatibility limits the potential for ensuring outputs adhere strictly to predefined formats or choices, leading to inefficiencies in post-processing and potentially inaccurate results.

Describe the solution you'd like

I would like Xinference to integrate support for vllm's guided generation features, specifically enabling the use of guided_choice, guided_json, guided_regex, and guided_grammar parameters through the extra_body option like vllm's OpenAI-compatible API. This would empower users to define strict constraints on the generated outputs, ensuring they conform to specific requirements:

Guided Choice: Allow users to restrict model responses to a predefined set of options, enhancing accuracy in tasks such as classification.
Guided JSON: Enable input of JSON schemas to guarantee that model outputs match the expected JSON structure, simplifying function calling and document extraction workflows.
Guided Regex or Guided Grammar: Ensure the output adheres to regular expressions or context-free grammar formats, which is particularly beneficial in some tasks such as code generation tasks.

example of vllm openai api request when using guided generation:

curl --location --request POST 'http://ip:9997/v1/chat/completions' \
--header 'User-Agent: Apifox/1.0.0 (https://apifox.com)' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen1.5-32b-chat-int4",
    "messages": [
        {
            "role": "user",
            "content": "“这家餐厅不好吃”属于正向评论还是负向评论？"
        }
    ],
    "temperature": 0,
    "max_tokens": 1000,
    "stream": true,
    "guided_choice": ["positive", "negative"]
}'

Describe alternatives you've considered

As alternatives, manual post-processing of model outputs has been considered to enforce the desired formats, but this approach is inefficient and error-prone. Or I can use vllm directly as model serve engine.

Additional context

For a comprehensive understanding of the desired functionality, please refer to vllm's pull request #2819 (Link) which adds guided decoding support for the OpenAI API server. Furthermore, detailed documentation on the extra parameters supported by vllm's chat API, including guided generation mechanisms, can be found at vllm's official documentation. Integrating these features into Xinference would significantly enhance the platform's utility for a wide range of applications requiring fine-grained control over LLM outputs.

zhanghx0905 commented 3 months ago

I also think that this feature will be very useful. Developers can control what the model generates. https://github.com/xorbitsai/inference/blob/d76549b7533a3ed20e225548e1bb2bf89e7296c8/xinference/model/llm/vllm/core.py#L67 I think maybe adding a few fields to this class would accomplish it.

zhanghx0905 commented 3 months ago

There's a problem with the implementation, vllm's Sampling Parameters don't yet include anything about guided grammar. https://docs.vllm.ai/en/stable/dev/sampling_params.html

qinxuye commented 2 months ago

Thanks, I found the vllm's implementation is adapted from outlines, and outlines integrates many libraries including llamacpp and transformers other than vllm, I think it's a feature that can be implemented for all the backends.

sunxichen commented 1 month ago

Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.

qinxuye commented 1 month ago

Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.

Thanks, we definitely will support it, it's on our schedule table, but we have to wait some time to have resource to support it. If anyone is interested in this feature, please let me know.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

xorbitsai / inference