ChatGLM3貌似会忽略system消息

wuyeguo commented 7 months ago

部署模型： chatglm3-32k

POST消息

{"messages": [{"role": "system", "content": "You will play the role of an interviewer for a technology company, examining the user's web front-end development skills and posing 5-10 sharp technical questions.

Please note:
- Only ask one question at a time.
- After the user answers a question, ask the next question directly, without trying to correct any mistakes made by the candidate.
- If you think the user has not answered correctly for several consecutive questions, ask fewer questions.
- After asking the last question, you can ask this question: Why did you leave your last job? After the user answers this question, please express your understanding and support.
"}, {"role": "user", "content": "你好"}], "model": "chatglm3-32k", "max_tokens": 512, "stream": false, "temperature": 0.01, "top_p": 1, "user": "933ee52d-ae01-4704-9229-2b15c4a81571"}

命令行

curl -X POST -H "Content-Type: application/json" -d@msg_chatglm3.json http://172.16.1.76:9998/v1/chat/completions -s

获得相应

{
  "id": "chat47755010-dc5c-11ee-b991-0242ac110003",
  "object": "chat.completion",
  "created": 1709799938,
  "model": "chatglm3-32k",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "你好！很高兴见到你，欢迎问我任何问题。"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

并且后端报错

2024-03-07 02:55:07,158 xinference.core.supervisor 80 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,158 xinference.core.worker 80 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,158 xinference.core.worker 80 DEBUG    Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,158 xinference.core.supervisor 80 DEBUG    Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,165 xinference.core.supervisor 80 DEBUG    Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG    Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG    Leave get_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG    Leave get_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG    Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG    Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG    Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG    Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,168 xinference.core.model 99 DEBUG    Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7f88c9563f60>, '你好', "You will play the role of an interviewer for a technology company, examining the user's web front-end development skills and posing 5-10 sharp technical questions.\n\nPlease note:\n- Only ask one question at a time.\n- After the user answers a question, ask the next question directly, without trying to correct any mistakes made by the candidate.\n- If you think the user has not answered correctly for several consecutive questions, ask fewer questions.\n- After asking the last question, you can ask this question: Why did you leave your last job? After the user answers this question, please express your understanding and support.\n", [], {'max_tokens': 512, 'temperature': 0.01, 'top_p': 1.0, 'stream': True}), kwargs: {}
2024-03-07 02:55:07,168 xinference.core.model 99 DEBUG    Request chat, current serve request count: 0, request limit: None for the model chatglm3-32k
2024-03-07 02:55:07,169 xinference.core.model 99 DEBUG    After request chat, current serve request count: 0 for the model chatglm3-32k
2024-03-07 02:55:07,169 xinference.core.model 99 DEBUG    Leave wrapped_func, elapsed time: 0 s
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/logging/handlers.py", line 73, in emit
    if self.shouldRollover(record):
  File "/opt/conda/lib/python3.10/logging/handlers.py", line 196, in shouldRollover
    msg = "%s\n" % self.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/conda/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 402, in _wrapper
    return next(_gen)
  File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 257, in _to_json_generator
    for v in gen:
  File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/utils.py", line 470, in _to_chat_completion_chunks
    for i, chunk in enumerate(chunks):
  File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/pytorch/chatglm.py", line 149, in _stream_generator
    for chunk_text, _ in self._model.stream_chat(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-32k-pytorch-6b/modeling_chatglm.py", line 1072, in stream_chat
    for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
    response = gen.send(None)
  File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-32k-pytorch-6b/modeling_chatglm.py", line 1121, in stream_generate
    logger.warn(
Message: 'Both `max_new_tokens` (=512) and `max_length`(=520) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)'
Arguments: (<class 'UserWarning'>,)

辛苦帮确认下是模型本身问题还是inference的问题