xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.7k stars 368 forks source link

ENH: enable multiple models running on a single device #531

Closed UranusSeven closed 3 weeks ago

UranusSeven commented 11 months ago

Is your feature request related to a problem? Please describe

Currently, our system assigns each model to a unique GPU device. While this approach ensures protection against out-of-memory (OOM) errors, it isn't optimal in terms of resource allocation and can lead to underutilization of GPU memory.

To enhance our GPU resource management and utilization:

By calculating and monitoring the GPU memory requirements of each model, we can intelligently allocate multiple models to run on a single GPU device when memory allows. This will ensure efficient GPU resource utilization without compromising on performance or risking OOM errors.

BradKML commented 4 months ago

Would like to see this being developed further as default or easily manageable through Docker. Currently this feature (use with Docker) is not in the docs, and alternatives like Ollama does not include rerankers. Cross-referencing these two issue: https://github.com/xorbitsai/inference/issues/503 https://github.com/xorbitsai/inference/issues/1228

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been inactive for 5 days since being marked as stale.