xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.53k stars 355 forks source link

【性能优化】bge-reranker-v2-minicpm-layerwise 部署性能问题 #1377

Closed coswind closed 2 weeks ago

coswind commented 4 months ago

Describe the bug

使用最新版本 xinference 部署 bge-reranker-v2-minicpm-layerwise,modescope 无法下载,更换 huggingface 后部署成功,但在使用的时候耗时特别严重,基本无法应用。

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2663: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.

To Reproduce

To help us to reproduce this bug, please provide information below:

Python 3.10.8 Xinference v0.10.3

其他信息

我在 huggingface 讨论社区找到以下线索:

https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise/discussions/1

qinxuye commented 4 months ago

FlagEmbeding 不发版本的话,增加这个参数很容易导致错误。

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.