ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`.

System Info / 系統信息

目前Xinference更新到0.13.3，transformers为4.42.1，GPU为4090、3090。若使用transformers引擎：

xinference launch --model-engine transformers -u glm4-chat -n glm4-chat -s 9 -f pytorch --max_model_len 16608 --gpu-idx 0,1

，无论使用单卡或者是双卡并行部署GLM4-chat模型，通过OpenAPI接口进行访问（例如：MetaGPT、GraphRAG），都会出现如下异常：

024-07-29 20:19:15,166 transformers.generation.utils 38069 WARNING  Both `max_new_tokens` (=4096) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-2 (generate):
Traceback (most recent call last):
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 1900, in generate
    self._get_logits_warper(generation_config, device=input_ids.device)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 761, in _get_logits_warper
    warpers.append(TemperatureLogitsWarper(generation_config.temperature))
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 292, in __init__
    raise ValueError(except_msg)
ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`.

若改用Vllm引擎，则一切正常。但是Vllm引擎并不是像transformers引擎那样，对模型的显存占用进行了双卡均分，而是分别快占满了24GB空间了。因此我更希望能得到transformers引擎部署GLM4-chat 9B在双卡上的正常使用。

我尝试着修改~/.xinference/cache/glm4-chat-pytorch-9b/generation_config.json中do_sample参数为false，但是无济于事。请求社区的大牛们给予帮助，谢谢！

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[ ] docker / docker
[X] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

Xinference更新到0.13.3，transformers为4.42.1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H 0.0.0.0

Reproduction / 复现过程

xinference launch --model-engine transformers -u glm4-chat -n glm4-chat -s 9 -f pytorch --max_model_len 16608
安装metaGPT或者部署graphRAG，例如：pip install --upgrade metagpt
在~/.metagpt/config2.yaml中指向Xinference地址，例如：

llm:
  api_type: "openai"  # or azure / ollama / groq etc. Check LLMType for more options
  model: "glm4-chat"
  base_url: "http://10.168.3.164:9997/v1"  # or forward url / other llm url
  api_key: "EMPT"

4.执行测试样例：

metagpt "Create a 2048 game"

Expected behavior / 期待表现

希望Xinference上通过transformers引擎部署GLM4-chat 9B能像Vllm引擎那样正常使用OpenAI API服务。

xorbitsai / inference