xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.67k stars 365 forks source link

使用xinference的api服务调用,当过多请求的时候,xinference本地api会直接卡死 #1889

Open zhaozhizhuo opened 1 month ago

zhaozhizhuo commented 1 month ago

System Info / 系統信息

cuda 12.4,transformers框架

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

xinference=0.13.0

The command used to start Xinference / 用以启动 xinference 的命令

xinference-local -H 0.0.0.0
xinference launch --model-name qwen0.5b-langchain --model-format pytorch --model-engine Transformers --gpu-idx 0,1,2

Reproduction / 复现过程

1.挂载xinference 2.挂载qwen2-14b模型 3.使用api接口进行模型的调用 4.将api接口使用flask-api进行封装,并一直输入问题 5.即可卡死,无法关闭当前运行的程序 在xinference-local -H 0.0.0.0 会出现链接失败

Expected behavior / 期待表现

询问是否是xinference的问题,因为我换成本地部署以后并没有发生这种错误,两者仅仅多了一个调用xinference所带来的api的步骤。如果是应该如何解决。

zhaozhizhuo commented 1 month ago

2024-07-18 09:57:28,011 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=48320) during chat. 2024-07-18 09:57:28,023 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=48336) during chat. 2024-07-18 09:57:28,035 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=56848) during chat. 2024-07-18 09:57:28,043 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=41498) during chat. 2024-07-18 09:57:28,052 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=53198) during chat. 2024-07-18 09:57:28,058 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=60506) during chat. 2024-07-18 09:57:28,066 xinference.api.restful_api 196054 INFO Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=40334) during chat. 2024-07-18 09:57:28,072 xinference.api.restful_api 196054 ERROR Chat completion stream got an error: invalid state Traceback (most recent call last): File "/copydata2/zhaozhizhuo/anaconda3/envs/langchain-qwen-inference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1537, in stream_results async for item in iterator: File "/copydata2/zhaozhizhuo/anaconda3/envs/langchain-qwen-inference/lib/python3.11/site-packages/xoscar/api.py", line 340, in anext return await self._actor_ref.__xoscar_next__(self._uid) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/copydata2/zhaozhizhuo/anaconda3/envs/langchain-qwen-inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 226, in send result = await self._wait(future, actor_ref.address, send_message) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/copydata2/zhaozhizhuo/anaconda3/envs/langchain-qwen-inference/lib/python3.11/site-packages/xoscar/backends/context.py", line 115, in _wait return await future ^^^^^^^^^^^^ File "/copydata2/zhaozhizhuo/anaconda3/envs/langchain-qwen-inference/lib/python3.11/site-packages/xoscar/backends/core.py", line 88, in _listen future.set_result(message) asyncio.exceptions.InvalidStateError: invalid state

qinxuye commented 1 month ago

Transformers 引擎可能承载不了很高的并发,试下 vllm 引擎。

zhaozhizhuo commented 1 month ago

好的,我换一下试试,多谢啦

zhaozhizhuo commented 1 month ago

xinference launch --model-name qwen0.5b-langchain --model-format pytorch --model-engine Transformers --gpu-idx 0,1,2 这个命令就可以使用transformers进行加载模型。但是将transfromers换成vllm就会显示失败launch是什么原因呀。

qinxuye commented 1 month ago

有报错栈吗?

zhaozhizhuo commented 1 month ago

多谢啦,我好像解决了这个问题,我使用了不同环境来加载和部署xinference,所以他可能找不到通信,然后就没有办法吧模型加载到xinference里边。

zhaozhizhuo commented 1 month ago

你好,我在使用XINFERENCE_TRANSFORMERS_ENABLE_BATCHING=1进行批处理的时候,无法使用launch进行加载模型。我是用了vllm的命令进行加载。具体的加载命令是:xinference launch --model-engine vLLM --model-name qwen2-7b-instruct --size-in-billions 7 --model-format pytorch --quantization none --gpu-idx 2,3,4,6会在每张卡上加载480m左右的现存,然后就跟卡住了一样,没有任何反应。持续了三四个小时也没有挂载成功。

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.