BUG embedding和rerank模型持续显存占用

xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Apache License 2.0

5.12k stars 414 forks source link

Docker部署xinference0.12.2版本，embedding和rerank模型持续显存占用

物理机cuda版本：12.1 操作系统：win10 Docker版本：Docker Desktop 4.29.0 (145265) 显卡：RTX3090-24G

由于xinference的0.12.1与0.12.2版本，Docker部署存在启动后自动停止问题，按照前面问题的解决方案，执行了以下安装程序 RUN pip install -U "llama-cpp-python==0.2.77" --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 然后重新生成了新的镜像文件，镜像启动后无错误提示。加载bge-reranker-v2-m3模型，使用fastgpt+oneapi调用模型主要问题现象： 1.每一次搜索调用模型后，显存就会累加一次，4，5次24G显存就已占满 2.显存会一直处于被占状态，不会自动释放，需要手动关闭对应模型，如上面的bge-reranker-v2-m3模型，才会立即释放显存希望的解决方案： 1.按照设置只建立一个副本，每次搜索不会进行显存累加 2.在空闲情况下，自动释放显存，不长期占用

xorbitsai / inference

BUG embedding和rerank模型持续显存占用 #1741