xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.75k stars 373 forks source link

Both `max_new_tokens` (=512) and `max_length`(=518) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. #1872

Open fpy10 opened 1 month ago

fpy10 commented 1 month ago

System Info / 系統信息

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.1, V12.1.105 Build cuda_12.1.r12.1/compiler.32688072_0

torch 2.3.1+cu121 torchaudio 2.3.1+cu121 torchvision 0.18.1 vector-quantize-pytorch 1.15.3

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

pip install "xinference[transformers]"

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local

Reproduction / 复现过程

与GLM4沟通第一次后,便运行报错,无法继续沟通 --- Logging error --- Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\handlers.py", line 73, in emit if self.shouldRollover(record): File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\handlers.py", line 196, in shouldRollover msg = "%s\n" % self.format(record) File "C:\Users\87952\miniconda3\envs\xinference\lib\logging__init.py", line 943, in format return fmt.format(record) File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\init.py", line 678, in format record.message = record.getMessage() File "C:\Users\87952\miniconda3\envs\xinference\lib\logging\init__.py", line 368, in getMessage msg = msg % self.args TypeError: not all arguments converted during string formatting Call stack: File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 973, in _bootstrap self._bootstrap_inner() File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 1016, in _bootstrap_inner self.run() File "C:\Users\87952\miniconda3\envs\xinference\lib\threading.py", line 953, in run self._target(*self._args, *self._kwargs) File "C:\Users\87952\miniconda3\envs\xinference\lib\concurrent\futures\thread.py", line 83, in _worker work_item.run() File "C:\Users\87952\miniconda3\envs\xinference\lib\concurrent\futures\thread.py", line 58, in run result = self.fn(self.args, self.kwargs) File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xoscar\api.py", line 402, in _wrapper return next(_gen) File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\core\model.py", line 318, in _to_json_generator for v in gen: File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\model\llm\utils.py", line 558, in _to_chat_completion_chunks for i, chunk in enumerate(chunks): File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\xinference\model\llm\pytorch\chatglm.py", line 259, in _stream_generator for chunktext, in self._model.stream_chat( File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py", line 1139, in stream_chat for outputs in self.stream_generate(inputs, past_key_values=past_key_values, File "C:\Users\87952\miniconda3\envs\xinference\lib\site-packages\torch\utils_contextlib.py", line 35, in generator_context response = gen.send(None) File "C:\Users\Administrator.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py", line 1188, in stream_generate logger.warn( Message: 'Both max_new_tokens (=512) and max_length(=518) seem to have been set. max_new_tokens will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)' Arguments: (<class 'UserWarning'>,) C:\Users\Administrator.cache\huggingface\modules\transformers_modules\glm4-chat-pytorch-9b\modeling_chatglm.py:271: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x000002357017F900> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x000001B41194EAF0> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x000001B41194EAF0> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x0000023570BDB750> still has pending operation at deallocation, the process may crash Traceback (most recent call last): File "C:\Users\87952\miniconda3\envs\xinference\lib\asyncio\windows_events.py", line 444, in select self._poll(timeout) RuntimeError: <_overlapped.Overlapped object at 0x0000023570BDB750> still has pending operation at deallocation, the process may crash

Expected behavior / 期待表现

可以正常运行

qinxuye commented 1 month ago

标题的只是 warning,不影响使用。

GLM4 是什么格式,什么量化?已经 nvidia-smi 的结果看下。

fpy10 commented 1 month ago

标题的只是 warning,不影响使用。

GLM4 是什么格式,什么量化?已经 nvidia-smi 的结果看下。

微信截图_20240716132533 glm4-chat,就是xinference上默认下载的 但是与GLM4 沟通第一句后就会卡死,在提问就问不了了 微信截图_20240716133054

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

koko426 commented 1 month ago

me too

fpy10 commented 3 weeks ago

我也是

Did you solve this problem? Can you tell me how you solved it?