xinference 使用 transformers 部署 glm-4-chat-1m, 使用langchain + langchain-openai , 发送 function call 请求，响应结果时会报错

pandaTED commented 1 week ago

System Info / 系統信息

xinference:0.15 langchain 0.2.14 langchain-core 0.2.35 langchain-experimental 0.0.58 langchain-openai 0.1.22

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[X] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference:0.15 langchain 0.2.14 langchain-core 0.2.35 langchain-experimental 0.0.58 langchain-openai 0.1.22

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose文件：

services: xinference: image: xprobe/xinference:v0.15.0 restart: always command: xinference-local -H 0.0.0.0 ports: # 不使用 host network 时可打开.

"9998:9997"
network_mode: "host"

将本地路径(~/xinference)挂载到容器路径(/root/.xinference)中,

详情见: https://inference.readthedocs.io/zh-cn/latest/getting_started/using_docker_image.html

volumes:
./xinference:/root/.xinference
./cache:/root/.cache
- ~/xinference/cache/huggingface:/root/.cache/huggingface

- ~/xinference/cache/modelscope:/root/.cache/modelscope

deploy: resources: reservations: devices:
- driver: nvidia capabilities: [gpu] device_ids: ['2', '3']
  模型源更改为 ModelScope, 默认为 HuggingFace
  
  environment:
XINFERENCE_MODEL_SRC=modelscope

Reproduction / 复现过程

使用 xinference：0.15 ， transformers 部署 glm4-chat-1m
使用 langchain + langchain-openai 请求 xinference http 服务的 ip:port/v1 接口，使用 llm_with_tools
响应的报错： 1 validation error for AIMessage tool_calls -> 0 -> args value is not a valid dict (type=type_error.dict)

Expected behavior / 期待表现

其他如 xinference 0.14 版时正常输出。

pandaTED commented 1 week ago

此外，如果并发请求 ip:port/v1 接口，会报错： Error code: 500 - {'detail': '[address=0.0.0.0:35395, pid=157] probability tensor contains either inf, nan or element < 0'} Traceback: Traceback (most recent call last): File "D:\GLM-4-main\function_calling_demo\src\langchainClient_duojincheng.py", line 552, in process_string result = llm_with_tools.invoke(wenti2) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\runnables\base.py", line 5092, in invoke return self.bound.invoke( ^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\language_models\chat_models.py", line 276, in invoke self.generate_prompt( File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\language_models\chat_models.py", line 776, in generate_prompt return self.generate(prompt_messages, stop=stop, callbacks=callbacks, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\language_models\chat_models.py", line 633, in generate raise e File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\language_models\chat_models.py", line 623, in generate self._generate_with_cache( File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_core\language_models\chat_models.py", line 845, in _generate_with_cache result = self._generate( ^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\langchain_openai\chat_models\base.py", line 649, in _generate response = self.client.create(payload) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_utils_utils.py", line 274, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai\resources\chat\completions.py", line 668, in create return self._post( ^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1260, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 937, in request return self._request( ^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1026, in _request return self._retry_request( ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1075, in _retry_request return self._request( ^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1026, in _request return self._retry_request( ^^^^^^^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1075, in _retry_request return self._request( ^^^^^^^^^^^^^^ File "C:\Users\panda\miniconda3\envs\chatchat\Lib\site-packages\openai_base_client.py", line 1041, in _request raise self._make_status_error_from_response(err.response) from None openai.InternalServerError: Error code: 500 - {'detail': '[address=0.0.0.0:35395, pid=157] probability tensor contains either inf, nan or element < 0'}

pandaTED commented 1 week ago

以下是并发请求 xinference 后端时，xinference的报错：

xinference-1 | 2024-09-09 01:41:20,027 xinference.api.restful_api 1 ERROR [address=0.0.0.0:35395, pid=157] CUDA error: device-side assert triggered xinference-1 | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. xinference-1 | For debugging consider passing CUDA_LAUNCH_BLOCKING=1 xinference-1 | Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. xinference-1 | Traceback (most recent call last): xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 1720, in create_chat_completion xinference-1 | data = await model.chat( xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send xinference-1 | return self._process_result_message(result) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message xinference-1 | raise message.as_instanceof_cause() xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 656, in send xinference-1 | result = await self._run_coro(message.message_id, coro) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 367, in _run_coro xinference-1 | return await coro xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in on_receive xinference-1 | return await super().on_receive(message) # type: ignore xinference-1 | File "xoscar/core.pyx", line 558, in on_receive__ xinference-1 | raise ex xinference-1 | File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.on_receive xinference-1 | async with self._lock: xinference-1 | File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.on_receive xinference-1 | with debug_async_timeout('actor_lock_timeout', xinference-1 | File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive xinference-1 | result = await result xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 96, in wrapped_func xinference-1 | ret = await fn(self, *args, kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 462, in _wrapper xinference-1 | r = await func(self, *args, *kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 69, in wrapped xinference-1 | ret = await func(args, kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 560, in chat xinference-1 | response = await self._call_wrapper_json( xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 407, in _call_wrapper_json xinference-1 | return await self._call_wrapper("json", fn, *args, kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 120, in _async_wrapper xinference-1 | return await fn(*args, *kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/core/model.py", line 418, in _call_wrapper xinference-1 | ret = await asyncio.to_thread(fn, args, kwargs) xinference-1 | File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread xinference-1 | return await loop.run_in_executor(None, func_call) xinference-1 | File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run xinference-1 | result = self.fn(*self.args, **self.kwargs) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/transformers/chatglm.py", line 365, in chat xinference-1 | inputs = inputs.to(self._model.device) xinference-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 803, in to xinference-1 | self.data = {k: v.to(device=device) for k, v in self.data.items() if v is not None} xinference-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 803, in xinference-1 | self.data = {k: v.to(device=device) for k, v in self.data.items() if v is not None} xinference-1 | RuntimeError: [address=0.0.0.0:35395, pid=157] CUDA error: device-side assert triggered xinference-1 | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. xinference-1 | For debugging consider passing CUDA_LAUNCH_BLOCKING=1 xinference-1 | Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

github-actions[bot] commented 2 days ago

This issue is stale because it has been open for 7 days with no activity.

xorbitsai / inference