Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
{"messages": [{"role": "system", "content": "You will play the role of an interviewer for a technology company, examining the user's web front-end development skills and posing 5-10 sharp technical questions.
Please note:
- Only ask one question at a time.
- After the user answers a question, ask the next question directly, without trying to correct any mistakes made by the candidate.
- If you think the user has not answered correctly for several consecutive questions, ask fewer questions.
- After asking the last question, you can ask this question: Why did you leave your last job? After the user answers this question, please express your understanding and support.
"}, {"role": "user", "content": "你好"}], "model": "chatglm3-32k", "max_tokens": 512, "stream": false, "temperature": 0.01, "top_p": 1, "user": "933ee52d-ae01-4704-9229-2b15c4a81571"}
命令行
curl -X POST -H "Content-Type: application/json" -d@msg_chatglm3.json http://172.16.1.76:9998/v1/chat/completions -s
2024-03-07 02:55:07,158 xinference.core.supervisor 80 DEBUG Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,158 xinference.core.worker 80 DEBUG Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,158 xinference.core.worker 80 DEBUG Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,158 xinference.core.supervisor 80 DEBUG Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,165 xinference.core.supervisor 80 DEBUG Enter get_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG Enter get_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG Leave get_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG Leave get_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG Enter describe_model, args: (<xinference.core.supervisor.SupervisorActor object at 0x7f5042acc900>, 'chatglm3-32k'), kwargs: {}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG Enter describe_model, args: (<xinference.core.worker.WorkerActor object at 0x7f5042b189f0>,), kwargs: {'model_uid': 'chatglm3-32k-1-0'}
2024-03-07 02:55:07,166 xinference.core.worker 80 DEBUG Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,166 xinference.core.supervisor 80 DEBUG Leave describe_model, elapsed time: 0 s
2024-03-07 02:55:07,168 xinference.core.model 99 DEBUG Enter wrapped_func, args: (<xinference.core.model.ModelActor object at 0x7f88c9563f60>, '你好', "You will play the role of an interviewer for a technology company, examining the user's web front-end development skills and posing 5-10 sharp technical questions.\n\nPlease note:\n- Only ask one question at a time.\n- After the user answers a question, ask the next question directly, without trying to correct any mistakes made by the candidate.\n- If you think the user has not answered correctly for several consecutive questions, ask fewer questions.\n- After asking the last question, you can ask this question: Why did you leave your last job? After the user answers this question, please express your understanding and support.\n", [], {'max_tokens': 512, 'temperature': 0.01, 'top_p': 1.0, 'stream': True}), kwargs: {}
2024-03-07 02:55:07,168 xinference.core.model 99 DEBUG Request chat, current serve request count: 0, request limit: None for the model chatglm3-32k
2024-03-07 02:55:07,169 xinference.core.model 99 DEBUG After request chat, current serve request count: 0 for the model chatglm3-32k
2024-03-07 02:55:07,169 xinference.core.model 99 DEBUG Leave wrapped_func, elapsed time: 0 s
--- Logging error ---
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/logging/handlers.py", line 73, in emit
if self.shouldRollover(record):
File "/opt/conda/lib/python3.10/logging/handlers.py", line 196, in shouldRollover
msg = "%s\n" % self.format(record)
File "/opt/conda/lib/python3.10/logging/__init__.py", line 943, in format
return fmt.format(record)
File "/opt/conda/lib/python3.10/logging/__init__.py", line 678, in format
record.message = record.getMessage()
File "/opt/conda/lib/python3.10/logging/__init__.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/opt/conda/lib/python3.10/threading.py", line 973, in _bootstrap
self._bootstrap_inner()
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 83, in _worker
work_item.run()
File "/opt/conda/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 402, in _wrapper
return next(_gen)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 257, in _to_json_generator
for v in gen:
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/utils.py", line 470, in _to_chat_completion_chunks
for i, chunk in enumerate(chunks):
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/pytorch/chatglm.py", line 149, in _stream_generator
for chunk_text, _ in self._model.stream_chat(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-32k-pytorch-6b/modeling_chatglm.py", line 1072, in stream_chat
for outputs in self.stream_generate(**inputs, past_key_values=past_key_values,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-32k-pytorch-6b/modeling_chatglm.py", line 1121, in stream_generate
logger.warn(
Message: 'Both `max_new_tokens` (=512) and `max_length`(=520) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)'
Arguments: (<class 'UserWarning'>,)
部署模型: chatglm3-32k
POST消息
命令行
获得相应
并且后端报错
辛苦帮确认下是模型本身问题还是inference的问题