Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0

xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

https://inference.readthedocs.io

Apache License 2.0

4.55k stars 357 forks source link

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0 #1650

Open WholeWorld-Timothy opened 2 months ago

WholeWorld-Timothy commented 2 months ago

Describe the bug

Image upgrade to 0.12.1, running Qwen1.5-14B-Chat-GPTQ-Int4 is much slower compared to 0.11.0.

To Reproduce

docker image has been upgraded to 0.12.1, which is much slower when running Qwen1.5-14B-Chat-GPTQ-Int4 compared to 0.11.0.

Expected behavior

The number of tockens per second after the upgrade is the same as that before the upgrade.

Additional context

Our startup parameter configuration:

worm128 commented 2 months ago

用vllma就快了，transfomer很慢，不知道为什么；旧版不知道用哪个，没这个选项

WholeWorld-Timothy commented 2 months ago

显存有限，vllm用不起，应该说这是一个特性的变化，没有什么特别的改动么？没有改动就变慢了，那就奇怪了，有改动说一下改动在哪，我尝试单独打个版本改回来都成。

worm128 commented 2 months ago

显存有限，vllm用不起，应该说这是一个特性的变化，没有什么特别的改动么？没有改动就变慢了，那就奇怪了，有改动说一下改动在哪，我尝试单独打个版本改回来都成。

我都不知道原来那个什么版本，因为拉的镜像是lastest，原来旧版很快的，就是升级后分开了vllma和transfomer，vllma占用显存多了，但是速度快；transfomer就速度很慢，虽然占用显存和原来旧版一样。

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.