xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.82k stars 379 forks source link

glm4-chat工具调用无法正确回答 #2297

Open kingdomad opened 5 days ago

kingdomad commented 5 days ago

System Info / 系統信息

ubuntu 22.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

0.15.0

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

启动命令

uid = client.launch_model(
    model_name="glm4-chat",
    model_type="LLM",
    model_engine="vllm",
    model_format="pytorch",
    model_uid="glm-4-9b-chat",
    model_path="/data/model/glm-4-9b-chat",
    max_num_seqs=10,
    gpu_memory_utilization=0.95,
    n_gpu=2,
    dtype="half",
    max_model_len=8192
)

推理请求

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA",
                    },
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]
messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
completion = client.chat.completions.create(
    model="glm-4-9b-chat", messages=messages, tools=tools, tool_choice="auto",
)

print(completion)

得到的回复:

ChatCompletion(id='chatcmpl-cb5bea27-6c40-48fd-80d0-e708270e585c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\'id\': \'dd7cb578-71a1-11ef-934d-50ebf6824ff7\', \'object\': \'text_completion\', \'created\': 1726212550, \'model\': \'glm-4-9b-chat\', \'choices\': [{\'text\': "\\nI\'m sorry, I can\'t provide real-time data like weather information. To find out the current weather in Boston, you can using a search engine, visit a weather website like National Weather Service or a weather apps on your smartphone.", \'index\': 0, \'logprobs\': None, \'finish_reason\': \'stop\'}], \'usage\': {\'prompt_tokens\': 16, \'completion_tokens\': 49, \'total_tokens\': 65}}', refusal=None, role='assistant', function_call=None, tool_calls=[]))], created=1726212550, model='glm-4-9b-chat', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=49, prompt_tokens=16, total_tokens=65))

后台的日志:

2024-09-13 15:29:10,322 xinference.model.llm.utils 106652 ERROR    Can't parse glm output: {'id': 'dd7cb578-71a1-11ef-934d-50ebf6824ff7', 'object': 'text_completion', 'created': 1726212550, 'model': 'glm-4-9b-chat', 'choices': [{'text': "\nI'm sorry, I can't provide real-time data like weather information. To find out the current weather in Boston, you can using a search engine, visit a weather website like National Weather Service or a weather apps on your smartphone.", 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 16, 'completion_tokens': 49, 'total_tokens': 65}}

Expected behavior / 期待表现

content的格式应该正确

kingdomad commented 5 days ago

debug了下,在xinferece.model.llm.vllm.core.VLLMChatModel的async_chat方法内出现了逻辑错误,前置处理时把glm4的tools丢弃了,但是后置处理时又会进行tools的解析。

 async def async_chat(
        self,
        messages: List[Dict],
        generate_config: Optional[Dict] = None,
        request_id: Optional[str] = None,
    ) -> Union[ChatCompletion, AsyncGenerator[ChatCompletionChunk, None]]:
        tools = generate_config.pop("tools", []) if generate_config else None
        model_family = self.model_family.model_family or self.model_family.model_name
        full_context_kwargs = {}
        if tools and model_family in QWEN_TOOL_CALL_FAMILY:
            full_context_kwargs["tools"] = tools
        assert self.model_family.chat_template is not None
        full_prompt = self.get_full_context(
            messages, self.model_family.chat_template, **full_context_kwargs
        )

        generate_config = self._sanitize_chat_config(generate_config)
        stream = generate_config.get("stream", None)

        if stream:
            agen = await self.async_generate(
                full_prompt, generate_config, tools, request_id=request_id
            )
            assert isinstance(agen, AsyncGenerator)
            if tools:
                return self._async_to_tool_completion_chunks(agen)
            return self._async_to_chat_completion_chunks(agen)
        else:
            c = await self.async_generate(
                full_prompt, generate_config, request_id=request_id
            )
            assert not isinstance(c, AsyncGenerator)
            if tools:
                return self._tool_calls_completion(self.model_family, self.model_uid, c)
            return self._to_chat_completion(c)