不支持单个请求的abort

xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Apache License 2.0

5.41k stars 438 forks source link

System Info / 系統信息

xinference==v0.13.3

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[X] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference==v0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

opt/conda/bin/python /opt/conda/bin/xinference-worker --metrics-exporter-port 9998 -e http://10.6.208.95:9997/ -H 10.6.208.95

Reproduction / 复现过程

对于单个请求，如果client主动断开了请求，请求没有正常的被abort而是继续执行完成了。

Expected behavior / 期待表现

当client断开请求后，应该迅速调用engine.abort(request_id)关闭当前的请求，而不浪费GPU推理资源。

我阅读了xinference的代码，现在支持batching的应该只有Transformer的引擎，它的逻辑里面是把消息丢到Queue里面，然后通过scheduler_actor去管理消息。如果想要abort之前的request，需要显式的调用abort的http接口，传入request_id才能abort之前的请求。

阅读vllm的openai的接口，它的实现是会监控fastapi request的connection，如果这个connection断开了，它就自己主动engine.abort(request_id)。而不用使用方外部调用abort的接口。

vllm的实现是非常有用处的，首先：在使用场景下，不是所有客户端都能记住request_id。其次：如果连接都断开了情况下，engine没有必要再进行推理了，也浪费资源。最后：vllm的openai兼容的接口是支持batching的，但是xinference里面如果使用vllm的engine，反而不能batching推理了。(虽然这个问题不是这个stack，但是也蛮受到影响的）。[下周我会提交一个PR，修复vllm不支持batching的问题]。

PS: vllm的fastapi接口实现效率应该要比xinference效率高。我觉得可以学习一下。

xorbitsai / inference