xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.71k stars 369 forks source link

GLM-4 chat 9b:'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat' #1962

Open Dravenlll opened 1 month ago

Dravenlll commented 1 month ago

System Info / 系統信息

python 3.11.8

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

Xinference-v 0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

1.启动模型 2.对话报错 47d6e8a939ea7fe0d0a8c6032334bf31

Expected behavior / 期待表现

希望能正常推理

qinxuye commented 1 month ago

目前 0.13.3 对于 GLM4-chat 的非流式支持有问题,但对流式应该能正常支持。请确认服务端版本。

Dravenlll commented 1 month ago

9cb1873e72dd158441ab8f5b34169f81

chinacqzgp commented 1 month ago

同问,求解

jhj033 commented 1 month ago

这个问题出在你下载的glm4-9b模型的~/.cache/modelscope/hub/ZhipuAI/glm-4-9b-chat/modeling_chatglm.py文件里没有stream_chat函数,我从modelscope中找了个带stream_chat函数的文件拷进去就work了

@torch.inference_mode()
def stream_chat(self, tokenizer, query: str, history: List[Dict] = None, role: str = "user",
                past_key_values=None, max_length: int = 8192, do_sample=True, top_p=0.8, temperature=0.8,
                logits_processor=None, return_past_key_values=False, **kwargs):
    if history is None:
        history = []
    if logits_processor is None:
        logits_processor = LogitsProcessorList()
    logits_processor.append(InvalidScoreLogitsProcessor())
    eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|user|>"),
                    tokenizer.convert_tokens_to_ids("<|observation|>")]
    gen_kwargs = {"max_length": max_length, "do_sample": do_sample, "top_p": top_p,
                  "temperature": temperature, "logits_processor": logits_processor, **kwargs}
    if past_key_values is None:
        inputs = tokenizer.apply_chat_template(history + [{"role": role, "content": query}],
                                               add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                               return_dict=True)
    else:
        inputs = tokenizer.apply_chat_template([{"role": role, "content": query}], add_special_tokens=False,
                                               add_generation_prompt=True, tokenize=True, return_tensors="pt",
                                               return_dict=True)
    inputs = inputs.to(self.device)
    if past_key_values is not None:
        past_length = past_key_values[0][0].shape[2]
        inputs.position_ids += past_length
        attention_mask = inputs.attention_mask
        attention_mask = torch.cat((attention_mask.new_ones(1, past_length), attention_mask), dim=1)
        inputs['attention_mask'] = attention_mask
    history.append({"role": role, "content": query})
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
                                        eos_token_id=eos_token_id, return_past_key_values=return_past_key_values,
                                        **gen_kwargs):
        if return_past_key_values:
            outputs, past_key_values = outputs
        outputs = outputs.tolist()[0][len(inputs["input_ids"][0]):-1]
        response = tokenizer.decode(outputs)
        if response and response[-1] != "�":
            response, new_history = self.process_response(response, history)
            if return_past_key_values:
                yield response, new_history, past_key_values
            else:
                yield response, new_history
chinacqzgp commented 1 month ago

stream_chat函数拷贝进去后,出现error:name 'LogitsProcessorList' is not defined

gqchen-dz commented 1 month ago

在最新的 0.14.0 版本仍然有这个问题。一推理就提示error during streaming

qinxuye commented 1 month ago

0.14.0 的错误信息再贴下。

Belye commented 1 month ago

2024-08-03 18:02:53,419 xinference.api.restful_api 1 ERROR Chat completion stream got an error: [address=0.0.0.0:39701, pid=1105] 'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat' Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1671, in stream_results iterator = await model.chat( File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send return self._process_result_message(result) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message raise message.as_instanceof_cause() File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send result = await self._run_coro(message.message_id, coro) File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro return await coro File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive return await super().on_receive(message) # type: ignore File "xoscar/core.pyx", line 558, in on_receive__ raise ex File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive async with self._lock: File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive with debug_async_timeout('actor_lock_timeout', File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive result = await result File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped ret = await func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func ret = await fn(self, *args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 462, in _wrapper r = await func(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 523, in chat response = await self._call_wrapper_json( File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 393, in _call_wrapper_json return await self._call_wrapper("json", fn, *args, kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper return await fn(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 404, in _call_wrapper ret = await asyncio.to_thread(fn, args, kwargs) File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/pytorch/chatglm.py", line 481, in chat stream_chat = self._model.stream_chat File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: [address=0.0.0.0:39701, pid=1105] 'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

qinxuye commented 1 month ago

更新最新的模型文件试下。

chinacqzgp commented 1 month ago

更新最新的模型文件在xinference中怎么试?还是说用模型文件中的demo.py来启动?

Dravenlll commented 1 month ago

更新最新的模型文件试下。

最新的模型文件是10几天前提交的,还是存在这个问题。模型modeling_chatglm.py里没有stream_chat函数

Dravenlll commented 1 month ago

同问,求解

transformers==4.41.2 将glm-4-9b-chat-1m模型的modeling_chatglm.py拷贝替换glm-4-9b-chat的modeling_chatglm.py。 如果出现ValueError: too many values to unpack (expected 2),可以参考https://huggingface.co/THUDM/glm-4-9b-chat/discussions/58

qinxuye commented 1 month ago

transformers 可以升级到最新版了。我们适配了最新的模型版本和最新的 transformers。

wzhty86 commented 1 month ago

transformers 可以升级到最新版了。我们适配了最新的模型版本和最新的 transformers。

将xinference升级到 0.14.0.post1后,仍然报错: File "/home/aiuser/.conda/envs/xinference/lib/python3.11/site-packages/xinference/model/llm/pytorch/chatglm.py", line 481, in chat stream_chat = self._model.stream_chat ^^^^^^^^^^^^^^^^^ File "/home/aiuser/.conda/envs/xinference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") ^^^^^^^^^^^^^^^^^ AttributeError: [address=0.0.0.0:33882, pid=34764] 'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

xinference版本:0.14.0.post1 transformers版本:4.42.4

DentistCode commented 4 weeks ago

猜测你和我一样,是微调模型无法使用。折腾了好几天,找到了个曲线救国的方案,直接加路径里就行: Snipaste_2024-08-11_20-14-42

尝试过这些方案,以防有后人重复尝试:

lergliu commented 4 weeks ago

猜测你和我一样,是微调模型无法使用。折腾了好几天,找到了个曲线救国的方案,直接加路径里就行: Snipaste_2024-08-11_20-14-42

尝试过这些方案,以防有后人重复尝试:

请问您的modeling_chatglm.py代码是用的最初的还是您自行逐步补充过的?我这边加了路径,还是报GenerationConfig' object has no attribute '_eos_token_tensor’

DentistCode commented 4 weeks ago

@lergliu 最初的,没动过

lergliu commented 4 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

DentistCode commented 4 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

lergliu commented 4 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

我的还是不行,还是报'GenerationConfig' object has no attribute '_eos_token_tensor'

lergliu commented 4 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

能把您的代码发给我看一下吗?我的邮箱lergiu@126.com,方便的话能否发到邮箱里。

vivien8261 commented 3 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

我的还是不行,还是报'GenerationConfig' object has no attribute '_eos_token_tensor'

同样的问题,推理会报错'GenerationConfig' object has no attribute '_eos_token_tensor',glm3-6b,请问你解决了吗

lergliu commented 3 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

我的还是不行,还是报'GenerationConfig' object has no attribute '_eos_token_tensor'

同样的问题,推理会报错'GenerationConfig' object has no attribute '_eos_token_tensor',glm3-6b,请问你解决了吗

没有呢

vivien8261 commented 3 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

我的还是不行,还是报'GenerationConfig' object has no attribute '_eos_token_tensor'

同样的问题,推理会报错'GenerationConfig' object has no attribute '_eos_token_tensor',glm3-6b,请问你解决了吗

没有呢

transformers 降级到 4.42 可以解决,但是会有其他问题,我目前降级到 4.41 了。参考这里

chinacqzgp commented 3 weeks ago

transformers 可以升级到最新版了。我们适配了最新的模型版本和最新的 transformers。

你们适配的transformers版本号是多少?glm4-chat部署测试成功过没有呢?我们安装xinference时都是自动安装的transformers版本啊

lergliu commented 3 weeks ago

@lergliu 最初的,没动过

刚刚试了最初的,我的还是报'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

哦sry,刚看了,我用的是1m的那个modeling_chatglm.py

我的还是不行,还是报'GenerationConfig' object has no attribute '_eos_token_tensor'

同样的问题,推理会报错'GenerationConfig' object has no attribute '_eos_token_tensor',glm3-6b,请问你解决了吗

没有呢

transformers 降级到 4.42 可以解决,但是会有其他问题,我目前降级到 4.41 了。参考这里

我这边4.42和4.41都有问题

chinacqzgp commented 3 weeks ago

同问,求解

transformers==4.41.2 将glm-4-9b-chat-1m模型的modeling_chatglm.py拷贝替换glm-4-9b-chat的modeling_chatglm.py。 如果出现ValueError: too many values to unpack (expected 2),可以参考https://huggingface.co/THUDM/glm-4-9b-chat/discussions/58

替换后对话测试弹出错误:generationmixin._get_logits_warper() missing 1 required positional argument:'device'

同问,求解

transformers==4.41.2 将glm-4-9b-chat-1m模型的modeling_chatglm.py拷贝替换glm-4-9b-chat的modeling_chatglm.py。 如果出现ValueError: too many values to unpack (expected 2),可以参考https://huggingface.co/THUDM/glm-4-9b-chat/discussions/58

替换后对话测试弹出错误:generationmixin._get_logits_warper() missing 1 required positional argument:'device'

vivien8261 commented 3 weeks ago

@chinacqzgp 继续降到4.36

lergliu commented 3 weeks ago

同问,求解

transformers==4.41.2 将glm-4-9b-chat-1m模型的modeling_chatglm.py拷贝替换glm-4-9b-chat的modeling_chatglm.py。 如果出现ValueError: too many values to unpack (expected 2),可以参考https://huggingface.co/THUDM/glm-4-9b-chat/discussions/58

这个方案可行,折腾了好久,终于搞定了,感谢!

DentistCode commented 3 weeks ago

同问,求解

transformers==4.41.2 将glm-4-9b-chat-1m模型的modeling_chatglm.py拷贝替换glm-4-9b-chat的modeling_chatglm.py。 如果出现ValueError: too many values to unpack (expected 2),可以参考https://huggingface.co/THUDM/glm-4-9b-chat/discussions/58

这个方案可行,折腾了好久,终于搞定了,感谢!

能work就行,刚才我又重新部署了一下,还是能正常运行。

过程如下: pip install "xinference[all]" pip install "xinference[transformers]" pip install "xinference[vllm]" pip install tiktoken pip install sentence-transformers pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

用的显卡是L20

用的是基础的glm4-9b-chat模型,是git clone huggingface里面的 git clone https://github.com/hiyouga/LLaMA-Factory.git 然后lfs pull这样