xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.55k stars 356 forks source link

Failed to do inference with latest GLM-4 chat 9b model #1882

Closed jsyqrt closed 1 month ago

jsyqrt commented 1 month ago

System Info / 系統信息

python --version Python 3.11.0

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

xinference -v xinference, version 0.13.1

The command used to start Xinference / 用以启动 xinference 的命令

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

启动模型 xinference launch --model-engine http://0.0.0.0:9997 -n glm4-chat -s 9 -f pytorch -q none -en transformers 然后做推理

xinference 服务端详细日志

2024-07-17 10:36:26,322 transformers.configuration_utils 235951 INFO     loading configuration file /home/jason/.xinference/cache/glm4-chat-pytorch-9b/config.json
2024-07-17 10:36:26,323 transformers.configuration_utils 235951 INFO     loading configuration file /home/jason/.xinference/cache/glm4-chat-pytorch-9b/config.json
2024-07-17 10:36:26,323 transformers.configuration_utils 235951 INFO     Model config ChatGLMConfig {
  "_name_or_path": "/home/jason/.xinference/cache/glm4-chat-pytorch-9b",
  "add_bias_linear": false,
  "add_qkv_bias": true,
  "apply_query_key_layer_scaling": true,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "ChatGLMModel"
  ],
  "attention_dropout": 0.0,
  "attention_softmax_in_fp32": true,
  "auto_map": {
    "AutoConfig": "configuration_chatglm.ChatGLMConfig",
    "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForCausalLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration",
    "AutoModelForSequenceClassification": "modeling_chatglm.ChatGLMForSequenceClassification"
  },
  "bias_dropout_fusion": true,
  "classifier_dropout": null,
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "ffn_hidden_size": 13696,
  "fp32_residual_connection": false,
  "hidden_dropout": 0.0,
  "hidden_size": 4096,
  "kv_channels": 128,
  "layernorm_epsilon": 1.5625e-07,
  "model_type": "chatglm",
  "multi_query_attention": true,
  "multi_query_group_num": 2,
  "num_attention_heads": 32,
  "num_hidden_layers": 40,
  "num_layers": 40,
  "original_rope": true,
  "pad_token_id": 151329,
  "padded_vocab_size": 151552,
  "post_layer_norm": true,
  "rmsnorm": true,
  "rope_ratio": 500,
  "seq_length": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.42.4",
  "use_cache": true,
  "vocab_size": 151552
}

2024-07-17 10:36:26,688 transformers.modeling_utils 235951 INFO     loading weights file /home/jason/.xinference/cache/glm4-chat-pytorch-9b/model.safetensors.index.json
2024-07-17 10:36:26,688 transformers.modeling_utils 235951 INFO     Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float32.
2024-07-17 10:36:26,689 transformers.generation.configuration_utils 235951 INFO     Generate config GenerationConfig {
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "pad_token_id": 151329
}

Loading checkpoint shards: 100%|██████████████████████| 10/10 [00:02<00:00,  4.06it/s]
2024-07-17 10:36:29,209 transformers.modeling_utils 235951 INFO     All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

2024-07-17 10:36:29,209 transformers.modeling_utils 235951 INFO     All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /home/jason/.xinference/cache/glm4-chat-pytorch-9b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
2024-07-17 10:36:29,211 transformers.generation.configuration_utils 235951 INFO     loading configuration file /home/jason/.xinference/cache/glm4-chat-pytorch-9b/generation_config.json
2024-07-17 10:36:29,212 transformers.generation.configuration_utils 235951 INFO     Generate config GenerationConfig {
  "do_sample": true,
  "eos_token_id": [
    151329,
    151336,
    151338
  ],
  "max_length": 128000,
  "pad_token_id": 151329,
  "temperature": 0.8,
  "top_p": 0.8
}

2024-07-17 10:37:14,866 xinference.api.restful_api 233925 ERROR    Chat completion stream got an error: [address=0.0.0.0:28743, pid=235951] 'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'
Traceback (most recent call last):
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1584, in stream_results
    async for item in iterator:
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 656, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 367, in _run_coro
    return await coro
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/api.py", line 431, in __xoscar_next__
    raise e
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/api.py", line 417, in __xoscar_next__
    r = await asyncio.to_thread(_wrapper, gen)
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
      ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xoscar/api.py", line 402, in _wrapper
    return next(_gen)
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xinference/core/model.py", line 318, in _to_json_generator
    for v in gen:
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xinference/model/llm/utils.py", line 558, in _to_chat_completion_chunks
    for i, chunk in enumerate(chunks):
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/xinference/model/llm/pytorch/chatglm.py", line 259, in _stream_generator
    for chunk_text, _ in self._model.stream_chat(
    ^^^^^^^^^^^^^^^^^
  File "/home/jason/.conda/envs/conda-env-for-xinference/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
    ^^^^^^^^^^^^^^^^^
AttributeError: [address=0.0.0.0:28743, pid=235951] 'ChatGLMForConditionalGeneration' object has no attribute 'stream_chat'

Expected behavior / 期待表现

希望能正常推理,不会报错

KOBEBRYANTand commented 1 month ago

我也是这个错,xinference部署出现问题,请问解决了吗

jsyqrt commented 1 month ago

seems like they deleted the function "stream_chat"

https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/files/59a0d59f0befb468b895fcd204f4fd1f99c68fd6#diff_view_modeling_chatglm.py

image

qinxuye commented 1 month ago

It's addressed in #1876 and will be included in next version. Feel free to reopen this issue if it does not work when new version released.