xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5k stars 397 forks source link

Support Optional Configurations for Embedding models #1661

Open zhangever opened 3 months ago

zhangever commented 3 months ago

Is your feature request related to a problem? Please describe

部署BCE embedding模型, 3副本,分布在3张3090卡上,但是没法得到比单副本3倍的提升。 实质上吞吐量跟单卡一样

Describe the solution you'd like

希望能增加可选配置,例如request limit。

Describe alternatives you've considered

N/A

Additional context

N/A

codingl2k1 commented 3 months ago

你用 3 卡 3 副本时,有没有哪些进程 CPU 打满 100% 的?

zhangever commented 3 months ago

你用 3 卡 3 副本时,有没有哪些进程 CPU 打满 100% 的?

当时只有压测请求, 没有别的请求呢 gpu的利用率不高,每张卡断断续续,高峰期也不到10%

codingl2k1 commented 3 months ago

我是说 CPU 使用率有没有 100% 的?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 7 days with no activity.