Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`. #1970
024-07-29 20:19:15,166 transformers.generation.utils 38069 WARNING Both `max_new_tokens` (=4096) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-2 (generate):
Traceback (most recent call last):
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 1900, in generate
self._get_logits_warper(generation_config, device=input_ids.device)
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 761, in _get_logits_warper
warpers.append(TemperatureLogitsWarper(generation_config.temperature))
File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 292, in __init__
raise ValueError(except_msg)
ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`.
llm:
api_type: "openai" # or azure / ollama / groq etc. Check LLMType for more options
model: "glm4-chat"
base_url: "http://10.168.3.164:9997/v1" # or forward url / other llm url
api_key: "EMPT"
System Info / 系統信息
目前Xinference更新到0.13.3,transformers为4.42.1,GPU为4090、3090。若使用transformers引擎:
,无论使用单卡或者是双卡并行部署GLM4-chat模型,通过OpenAPI接口进行访问(例如:MetaGPT、GraphRAG),都会出现如下异常:
若改用Vllm引擎,则一切正常。但是Vllm引擎并不是像transformers引擎那样,对模型的显存占用进行了双卡均分,而是分别快占满了24GB空间了。因此我更希望能得到transformers引擎部署GLM4-chat 9B在双卡上的正常使用。
我尝试着修改~/.xinference/cache/glm4-chat-pytorch-9b/generation_config.json中do_sample参数为false,但是无济于事。请求社区的大牛们给予帮助,谢谢!
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
Xinference更新到0.13.3,transformers为4.42.1
The command used to start Xinference / 用以启动 xinference 的命令
xinference-local -H 0.0.0.0
Reproduction / 复现过程
4.执行测试样例:
Expected behavior / 期待表现
希望Xinference上通过transformers引擎部署GLM4-chat 9B能像Vllm引擎那样正常使用OpenAI API服务。