xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.55k stars 357 forks source link

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0 #1650

Open WholeWorld-Timothy opened 2 months ago

WholeWorld-Timothy commented 2 months ago

Describe the bug

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0.

To Reproduce

docker image has been upgraded to 0.12.1, which is much slower when running Qwen1.5-14B-Chat-GPTQ-Int4 compared to 0.11.0.

Expected behavior

The number of tockens per second after the upgrade is the same as that before the upgrade.

Additional context

Our startup parameter configuration: image

worm128 commented 2 months ago

用vllma就快了,transfomer很慢,不知道为什么;旧版不知道用哪个,没 这个选项

WholeWorld-Timothy commented 2 months ago

显存有限,vllm用不起,应该说这是一个特性的变化,没有什么特别的改动么?没有改动就变慢了,那就奇怪了,有改动说一下改动在哪,我尝试单独打个版本改回来都成。

worm128 commented 2 months ago

显存有限,vllm用不起,应该说这是一个特性的变化,没有什么特别的改动么?没有改动就变慢了,那就奇怪了,有改动说一下改动在哪,我尝试单独打个版本改回来都成。

我都不知道原来那个什么版本,因为拉的镜像是lastest,原来旧版很快的,就是升级后分开了vllma和transfomer,vllma占用显存多了,但是速度快;transfomer就速度很慢,虽然占用显存和原来旧版一样。

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.