xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.1k stars 411 forks source link

ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`. #1970

Open readbyte-ai opened 2 months ago

readbyte-ai commented 2 months ago

System Info / 系統信息

目前Xinference更新到0.13.3,transformers为4.42.1,GPU为4090、3090。若使用transformers引擎:

xinference launch --model-engine transformers -u glm4-chat -n glm4-chat -s 9 -f pytorch --max_model_len 16608 --gpu-idx 0,1

,无论使用单卡或者是双卡并行部署GLM4-chat模型,通过OpenAPI接口进行访问(例如:MetaGPT、GraphRAG),都会出现如下异常:

024-07-29 20:19:15,166 transformers.generation.utils 38069 WARNING  Both `max_new_tokens` (=4096) and `max_length`(=8192) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
Exception in thread Thread-2 (generate):
Traceback (most recent call last):
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 1900, in generate
    self._get_logits_warper(generation_config, device=input_ids.device)
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/utils.py", line 761, in _get_logits_warper
    warpers.append(TemperatureLogitsWarper(generation_config.temperature))
  File "/home/fangshun/miniconda3/envs/inference/lib/python3.10/site-packages/transformers/generation/logits_process.py", line 292, in __init__
    raise ValueError(except_msg)
ValueError: `temperature` (=0.0) has to be a strictly positive float, otherwise your next token scores will be invalid. If you're looking for greedy decoding strategies, set `do_sample=False`.

若改用Vllm引擎,则一切正常。但是Vllm引擎并不是像transformers引擎那样,对模型的显存占用进行了双卡均分,而是分别快占满了24GB空间了。因此我更希望能得到transformers引擎部署GLM4-chat 9B在双卡上的正常使用。

我尝试着修改~/.xinference/cache/glm4-chat-pytorch-9b/generation_config.json中do_sample参数为false,但是无济于事。请求社区的大牛们给予帮助,谢谢!

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

Xinference更新到0.13.3,transformers为4.42.1

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H 0.0.0.0

Reproduction / 复现过程

  1. xinference launch --model-engine transformers -u glm4-chat -n glm4-chat -s 9 -f pytorch --max_model_len 16608
  2. 安装metaGPT或者部署graphRAG,例如:pip install --upgrade metagpt
  3. 在~/.metagpt/config2.yaml中指向Xinference地址,例如:
llm:
  api_type: "openai"  # or azure / ollama / groq etc. Check LLMType for more options
  model: "glm4-chat"
  base_url: "http://10.168.3.164:9997/v1"  # or forward url / other llm url
  api_key: "EMPT"

4.执行测试样例:

metagpt "Create a 2048 game"

Expected behavior / 期待表现

希望Xinference上通过transformers引擎部署GLM4-chat 9B能像Vllm引擎那样正常使用OpenAI API服务。

qinxuye commented 2 months ago

We will see if we can reproduce the error.