xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.41k stars 438 forks source link

web ui 部署小模型的时候一个slot 只能部署一个模型?无法部署多个模型,即使gpu空间绰绰有余 #2237

Closed Songjiadong closed 1 month ago

Songjiadong commented 2 months ago

System Info / 系統信息

Server error: 503 - [address=0.0.0.0:35434, pid=25922] No available slot found for the model

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

xinference 0.13.3

The command used to start Xinference / 用以启动 xinference 的命令

cd /data/galileo/models/xinference . xinference/bin/activate xinference-local --host 0.0.0.0 --port 8888

Reproduction / 复现过程

web ui 部署模型的时候一个slot 只能部署一个模型?无法部署多个模型,即使gpu空间绰绰有余? 我首先部署了一个qwen0.5b 这是我的GPU占用比5999MiB / 49152MiB,当我在部署一个qwen0.5b的时候就会报错

Expected behavior / 期待表现

合理的利用插槽

Valdanitooooo commented 2 months ago

same here 等版本更新ing 我目前是用起多个 xinference 实例的方式跑多个模型

wenzhaoabc commented 2 months ago

在WebUI的模型启动参数界面,强制指定gpu_index,可以单卡跑多个模型 image

Songjiadong commented 2 months ago

@wenzhaoabc 你试试 vllm 貌似不行

zhangxianglink commented 2 months ago

@wenzhaoabc 你试试 vllm 貌似不行 试了,vllm要独占一块卡,改成Transformers 能在一块4090运行下面俩模型 --model-engine Transformers --gpu-idx 1 -n qwen2-instruct -f pytorch --gpu_memory_utilization 0.7 --model-engine Transformers --gpu-idx 1 -n qwen2-instruct -f pytorch

wenzhaoabc commented 2 months ago

vllm默认会将载入模型后剩余的显存全部用来做kv cache,vllm也可以通过参数--gpu-memory-utilization控制显存使用率,默认是0.9

https://github.com/vllm-project/vllm/issues/2430 https://docs.vllm.ai/en/latest/models/engine_args.html

Songjiadong commented 1 month ago

@wenzhaoabc 你加--gpu-memory-utilization 改成0.2 也没用

Songjiadong commented 1 month ago

@zhangxianglink 是的transformer可以多个实例

guoping1127 commented 1 month ago

same here 等版本更新ing 我目前是用起多个 xinference 实例的方式跑多个模型

我现在也是,vllm控制了显存使用率后还是无法加载别的模型

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 5 days since being marked as stale.

Songjiadong commented 1 month ago

这个问题什么时候能够修复?