Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/root/miniconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2663: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
To Reproduce
To help us to reproduce this bug, please provide information below:
Describe the bug
使用最新版本 xinference 部署 bge-reranker-v2-minicpm-layerwise,modescope 无法下载,更换 huggingface 后部署成功,但在使用的时候耗时特别严重,基本无法应用。
To Reproduce
To help us to reproduce this bug, please provide information below:
Python 3.10.8 Xinference v0.10.3
其他信息
我在 huggingface 讨论社区找到以下线索:
https://huggingface.co/BAAI/bge-reranker-v2-minicpm-layerwise/discussions/1