Open mufenzhimi opened 3 months ago
../aten/src/ATen/native/cuda/Indexing.cu:1236: indexSelectSmallIndex: block: [19,0,0], thread: [95,0,0] Assertion srcIndex < srcSelectDimSize
failed.
2024-07-12 07:16:15,567 xinference.api.restful_api 92661 ERROR [address=0.0.0.0:40059, pid=93726] CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1566, in create_chat_completion
data = await model.chat(
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 659, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
return await coro
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive
return await super().on_receive(message) # type: ignore
File "xoscar/core.pyx", line 558, in on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive
result = await result
File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 90, in wrapped_func
ret = await fn(self, *args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 462, in _wrapper
r = await func(self, args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 505, in chat
response = await self._call_wrapper(
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 114, in _async_wrapper
return await fn(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 388, in _call_wrapper
ret = await asyncio.to_thread(fn, *args, *kwargs)
File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread
return await loop.run_in_executor(None, func_call)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(self.args, self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/pytorch/chatglm.py", line 315, in chat
response = self._model.chat(self._tokenizer, prompt, chat_history, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 1104, in chat
outputs = self.generate(inputs, *gen_kwargs, eos_token_id=eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1622, in generate
result = self._sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2791, in _sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 1005, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 901, in forward
hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 726, in forward
layer_ret = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 647, in forward
layernorm_output = self.post_attention_layernorm(layernorm_input)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/glm-4-9b-chat-1m/modeling_chatglm.py", line 164, in forward
hidden_states = hidden_states torch.rsqrt(variance + self.eps)
RuntimeError: [address=0.0.0.0:40059, pid=93726] CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
当我在dify+xinference组合时,用difyapi调用glm4-9b模型,也会出现类似的错误。不知道和你是否一样?
报错一样,请问怎么解决了?模型model = client.get_model("glm-4v-9b"),传入的是base64图片,结果报错:
RuntimeError: Failed to generate chat completion, detail: [address=0.0.0.0:43623, pid=87138] CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
没有解决 在等官方回复
没有解决 在等官方回复,我是dify remark模型。
This issue is stale because it has been open for 7 days with no activity.
我也遇到了一样的问题,不知道原因和解决方法
System Info / 系統信息
Linux 单卡4090D
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
0.12.0
The command used to start Xinference / 用以启动 xinference 的命令
XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997
Reproduction / 复现过程
Expected behavior / 期待表现
修复bug