xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.68k stars 367 forks source link

Integration of Guided Generation Features from vllm into XInference #1562

Open sunxichen opened 3 months ago

sunxichen commented 3 months ago

Is your feature request related to a problem? Please describe

Currently, when utilizing Xinference with vllm as the backend, users are unable to leverage vllm's advanced guided generation capabilities, which can lead to less controlled outputs from large language models. This poses a challenge particularly in scenarios where precise control over model responses is crucial, such as in classification tasks, structured data generation, or code generation. The lack of compatibility limits the potential for ensuring outputs adhere strictly to predefined formats or choices, leading to inefficiencies in post-processing and potentially inaccurate results.

Describe the solution you'd like

I would like Xinference to integrate support for vllm's guided generation features, specifically enabling the use of guided_choice, guided_json, guided_regex, and guided_grammar parameters through the extra_body option like vllm's OpenAI-compatible API. This would empower users to define strict constraints on the generated outputs, ensuring they conform to specific requirements:

example of vllm openai api request when using guided generation:

curl --location --request POST 'http://ip:9997/v1/chat/completions' \
--header 'User-Agent: Apifox/1.0.0 (https://apifox.com)' \
--header 'Content-Type: application/json' \
--data '{
    "model": "qwen1.5-32b-chat-int4",
    "messages": [
        {
            "role": "user",
            "content": "“这家餐厅不好吃”属于正向评论还是负向评论?"
        }
    ],
    "temperature": 0,
    "max_tokens": 1000,
    "stream": true,
    "guided_choice": ["positive", "negative"]
}'

Describe alternatives you've considered

As alternatives, manual post-processing of model outputs has been considered to enforce the desired formats, but this approach is inefficient and error-prone. Or I can use vllm directly as model serve engine.

Additional context

For a comprehensive understanding of the desired functionality, please refer to vllm's pull request #2819 (Link) which adds guided decoding support for the OpenAI API server. Furthermore, detailed documentation on the extra parameters supported by vllm's chat API, including guided generation mechanisms, can be found at vllm's official documentation. Integrating these features into Xinference would significantly enhance the platform's utility for a wide range of applications requiring fine-grained control over LLM outputs.

zhanghx0905 commented 3 months ago

I also think that this feature will be very useful. Developers can control what the model generates. https://github.com/xorbitsai/inference/blob/d76549b7533a3ed20e225548e1bb2bf89e7296c8/xinference/model/llm/vllm/core.py#L67 I think maybe adding a few fields to this class would accomplish it.

zhanghx0905 commented 3 months ago

There's a problem with the implementation, vllm's Sampling Parameters don't yet include anything about guided grammar. https://docs.vllm.ai/en/stable/dev/sampling_params.html

qinxuye commented 2 months ago

Thanks, I found the vllm's implementation is adapted from outlines, and outlines integrates many libraries including llamacpp and transformers other than vllm, I think it's a feature that can be implemented for all the backends.

sunxichen commented 1 month ago

Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.

qinxuye commented 1 month ago

Hi, any updates about this feature? will this feature be added in future? this could be a very useful feature, and for now i have to use vLLM directly for this. I really hope it can be integrated into xinference.

Thanks, we definitely will support it, it's on our schedule table, but we have to wait some time to have resource to support it. If anyone is interested in this feature, please let me know.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.