xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.71k stars 369 forks source link

调用/v1/chat/completions接口,用jmeter10并发进行压测,压测1分钟xinference就挂了,xinference==0.11.3 #1811

Open WangxuP opened 2 months ago

WangxuP commented 2 months ago

Describe the bug

我们在压测xinference时候发现,V100 2卡,调用/v1/chat/completions接口,stream参数是True,模型用qwen-14b-chat,用jmeter10并发进行压测,压测1分钟xinference就挂了,如果stream是False,是可以的.

报错日志

2024-07-08 11:34:32,621 xinference.api.restful_api 8 INFO     Disconnected from client (via refresh/close) Address(host='192.168.32.13', port=30733) during chat.
INFO 07-08 11:34:32 async_llm_engine.py:158] Aborted request fcdb2432-3cda-11ef-af98-7e88271d2e8e.
2024-07-08 11:34:32,630 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state
2024-07-08 11:34:32,633 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state
2024-07-08 11:34:32,635 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state
2024-07-08 11:34:32,639 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state
2024-07-08 11:34:32,641 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state
2024-07-08 11:34:32,643 xinference.api.restful_api 8 ERROR    Chat completion stream got an error: invalid state
Traceback (most recent call last):
  File "/app/xinference/xinference/api/restful_api.py", line 1554, in stream_results
    async for item in iterator:
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/api.py", line 340, in __anext__
    return await self._actor_ref.__xoscar_next__(self._uid)
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 226, in send
    result = await self._wait(future, actor_ref.address, send_message)  # type: ignore
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/context.py", line 115, in _wait
    return await future
  File "/opt/xinference/xinference_venv/lib/python3.10/site-packages/xoscar/backends/core.py", line 88, in _listen
    future.set_result(message)
asyncio.exceptions.InvalidStateError: invalid state

requtirements.txt

accelerate==0.30.1
addict==2.4.0
aiobotocore==2.7.0
aiofiles==23.2.1
aiohttp==3.9.5
aioitertools==0.11.0
aioprometheus==23.12.0
aiosignal==1.3.1
aliyun-python-sdk-core==2.15.1
aliyun-python-sdk-kms==2.16.3
altair==5.3.0
annotated-types==0.7.0
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
async-timeout==4.0.3
attrs==23.2.0
azure-core==1.30.1
azure-storage-blob==12.20.0
bcrypt==4.1.3
botocore==1.31.64
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
cloudpickle==3.0.0
cmake==3.29.3
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.2.1
crcmod==1.7
cryptography==42.0.7
cycler==0.12.1
dataclasses-json==0.6.6
datasets==2.18.0
diffusers==0.28.2
dill==0.3.8
diskcache==5.6.3
distro==1.9.0
ecdsa==0.19.0
einops==0.8.0
environs==9.5.0
exceptiongroup==1.2.1
fastapi==0.110.3
ffmpy==0.3.2
filelock==3.14.0
flatbuffers==24.3.25
fonttools==4.53.0
frozenlist==1.4.1
fsspec==2023.10.0
gast==0.5.4
gradio==4.26.0
gradio_client==0.15.1
greenlet==3.0.3
grpcio==1.60.0
h11==0.14.0
httpcore==1.0.5
httptools==0.6.1
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
importlib_metadata==7.1.0
importlib_resources==6.4.0
interegular==0.3.3
isodate==0.6.1
jieba==0.42.1
Jinja2==3.1.4
jmespath==0.10.0
joblib==1.4.2
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
langchain==0.1.0
langchain-community==0.0.20
langchain-core==0.1.23
langsmith==0.0.87
lark==1.1.9
llvmlite==0.42.0
lm-format-enforcer==0.10.1
lxml==5.2.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
marshmallow==3.21.3
matplotlib==3.9.0
mdurl==0.1.2
minio==7.2.7
modelscope==1.14.0
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.5
multiprocess==0.70.16
mypy-extensions==1.0.0
nest-asyncio==1.6.0
networkx==3.3
ninja==1.11.1
numba==0.59.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.555.43
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
onnxruntime==1.15.0
openai==1.30.5
opencv-contrib-python==4.9.0.80
orjson==3.10.3
oss2==2.18.5
outlines==0.0.34
packaging==23.2
pandas==2.2.2
passlib==1.7.4
pdfminer.six==20231228
pdfplumber==0.11.0
peft==0.11.1
pillow==10.3.0
platformdirs==4.2.2
prometheus-fastapi-instrumentator==7.0.0
prometheus_client==0.20.0
protobuf==5.27.0
psutil==5.9.8
py-cpuinfo==9.0.0
pyarrow==16.1.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pycparser==2.22
pycryptodome==3.20.0
pydantic==2.7.2
pydantic_core==2.18.3
pydub==0.25.1
Pygments==2.18.0
pymilvus==2.4.0
pynvml==11.5.0
pyparsing==3.1.2
PyPDF2==3.0.1
pypdfium2==4.30.0
python-dateutil==2.9.0.post0
python-docx==1.1.2
python-dotenv==1.0.1
python-jose==3.3.0
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
quantile-python==1.1
ray==2.23.0
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
rich==13.7.1
rpds-py==0.18.1
rsa==4.9
ruff==0.4.7
s3fs==2023.10.0
safetensors==0.4.3
scikit-learn==1.5.0
scipy==1.13.1
semantic-version==2.10.0
sentence-transformers==3.0.0
sentencepiece==0.2.0
shellingham==1.5.4
simplejson==3.19.2
six==1.16.0
sniffio==1.3.1
sortedcontainers==2.4.0
SQLAlchemy==2.0.30
sse-starlette==2.1.0
starlette==0.37.2
sympy==1.12.1
tabulate==0.9.0
tblib==3.0.0
tenacity==8.3.0
threadpoolctl==3.5.0
tiktoken==0.6.0
timm==1.0.3
tokenizers==0.19.1
tomli==2.0.1
tomlkit==0.12.0
toolz==0.12.1
torch==2.3.0
torchvision==0.18.0
tqdm==4.66.4
transformers==4.41.0
triton==2.3.0
typer==0.11.1
typing-inspect==0.9.0
typing_extensions==4.12.1
tzdata==2024.1
ujson==5.10.0
urllib3==2.0.7
uvicorn==0.30.1
uvloop==0.19.0
vllm==0.4.3
vllm-flash-attn==2.5.8.post2
vllm_nccl_cu12==2.18.1.0.3.0
watchfiles==0.22.0
websockets==11.0.3
wrapt==1.16.0
xformers==0.0.26.post1
xinference==0.11.3
xoscar==0.3.0
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.19.1

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

WangxuP commented 2 months ago

我在使用vllm的 /v1/chat/completions 接口的时候,是可以正常使用的,而且速度是要比xinference快。

yunfwe commented 1 month ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 https://github.com/xorbitsai/xoscar/pull/87 使用 pip install xoscar==0.3.2 升级后再压测试试

Dawnfz-Lenfeng commented 1 month ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 xorbitsai/xoscar#87 使用 pip install xoscar==0.3.2 升级后再压测试试

升级了好像还是有问题,报错信息基本一致,好像是 stream == True 之后就会触发这个BUG

yunfwe commented 1 month ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 xorbitsai/xoscar#87 使用 pip install xoscar==0.3.2 升级后再压测试试

升级了好像还是有问题,报错信息基本一致,好像是 stream == True 之后就会触发这个BUG

升级后重启xinference了吗?粘贴下报错日志看看

Dawnfz-Lenfeng commented 1 month ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 xorbitsai/xoscar#87 使用 pip install xoscar==0.3.2 升级后再压测试试

升级了好像还是有问题,报错信息基本一致,好像是 stream == True 之后就会触发这个BUG

升级后重启xinference了吗?粘贴下报错日志看看

2024-07-19 15:01:38,014 transformers.models.llama.modeling_llama 63561 WARNING  We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-07-19 15:01:49,005 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 3.22 tokens/s.
2024-07-19 15:01:50,378 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 20.42 tokens/s.
2024-07-19 15:01:50,871 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 15.04 tokens/s.
2024-07-19 15:02:03,688 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 25.22 tokens/s.
2024-07-19 15:02:18,889 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36816) during chat.
2024-07-19 15:02:24,799 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 0.85 tokens/s.
2024-07-19 15:02:29,739 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 1.02 tokens/s.
2024-07-19 15:02:33,928 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36870) during chat.
2024-07-19 15:02:33,939 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36872) during chat.
2024-07-19 15:02:33,951 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36874) during chat.
2024-07-19 15:02:33,955 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36876) during chat.
2024-07-19 15:02:33,963 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36864) during chat.
2024-07-19 15:02:33,978 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36884) during chat.
2024-07-19 15:02:33,983 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36888) during chat.
2024-07-19 15:02:33,987 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36886) during chat.

版本是

xinference                              0.13.1
xoscar                                  0.3.2
yunfwe commented 1 month ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 xorbitsai/xoscar#87 使用 pip install xoscar==0.3.2 升级后再压测试试

升级了好像还是有问题,报错信息基本一致,好像是 stream == True 之后就会触发这个BUG

升级后重启xinference了吗?粘贴下报错日志看看

2024-07-19 15:01:38,014 transformers.models.llama.modeling_llama 63561 WARNING  We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-07-19 15:01:49,005 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 3.22 tokens/s.
2024-07-19 15:01:50,378 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 20.42 tokens/s.
2024-07-19 15:01:50,871 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 15.04 tokens/s.
2024-07-19 15:02:03,688 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 25.22 tokens/s.
2024-07-19 15:02:18,889 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36816) during chat.
2024-07-19 15:02:24,799 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 0.85 tokens/s.
2024-07-19 15:02:29,739 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 1.02 tokens/s.
2024-07-19 15:02:33,928 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36870) during chat.
2024-07-19 15:02:33,939 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36872) during chat.
2024-07-19 15:02:33,951 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36874) during chat.
2024-07-19 15:02:33,955 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36876) during chat.
2024-07-19 15:02:33,963 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36864) during chat.
2024-07-19 15:02:33,978 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36884) during chat.
2024-07-19 15:02:33,983 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36888) during chat.
2024-07-19 15:02:33,987 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36886) during chat.

版本是

xinference                              0.13.1
xoscar                                  0.3.2

换成vllm引擎呢,之前InvalidStateError: invalid state之后,会导致整个接口挂掉,无法继续响应任何请求,即使推理引擎还是正常的。

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

vierachen commented 2 weeks ago

这个问题是xoscar库的问题,目前已经合并到0.3.2版本 xorbitsai/xoscar#87 使用 pip install xoscar==0.3.2 升级后再压测试试

升级了好像还是有问题,报错信息基本一致,好像是 stream == True 之后就会触发这个BUG

升级后重启xinference了吗?粘贴下报错日志看看

2024-07-19 15:01:38,014 transformers.models.llama.modeling_llama 63561 WARNING  We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
2024-07-19 15:01:49,005 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 3.22 tokens/s.
2024-07-19 15:01:50,378 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 20.42 tokens/s.
2024-07-19 15:01:50,871 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 15.04 tokens/s.
2024-07-19 15:02:03,688 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 25.22 tokens/s.
2024-07-19 15:02:18,889 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36816) during chat.
2024-07-19 15:02:24,799 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 0.85 tokens/s.
2024-07-19 15:02:29,739 xinference.model.llm.pytorch.utils 63561 INFO     Average generation speed: 1.02 tokens/s.
2024-07-19 15:02:33,928 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36870) during chat.
2024-07-19 15:02:33,939 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36872) during chat.
2024-07-19 15:02:33,951 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36874) during chat.
2024-07-19 15:02:33,955 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36876) during chat.
2024-07-19 15:02:33,963 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36864) during chat.
2024-07-19 15:02:33,978 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36884) during chat.
2024-07-19 15:02:33,983 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36888) during chat.
2024-07-19 15:02:33,987 xinference.api.restful_api 63191 INFO     Disconnected from client (via refresh/close) Address(host='127.0.0.1', port=36886) during chat.

版本是

xinference                              0.13.1
xoscar                                  0.3.2

换成vllm引擎呢,之前InvalidStateError: invalid state之后,会导致整个接口挂掉,无法继续响应任何请求,即使推理引擎还是正常的。

用vllm引擎,出现同样问题。请问有什么解决方案吗?