xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.94k stars 391 forks source link

vllm方式多卡加载模型卡住,GPU占用率100% #2360

Open Dax-Zhang-pxy opened 4 days ago

Dax-Zhang-pxy commented 4 days ago

System Info / 系統信息

cuda版本12.3,卡为4090,python10,本地运行非docker方式。

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

Version info / 版本信息

0.15.0和0.15.2均存在相同问题

The command used to start Xinference / 用以启动 xinference 的命令

XINFERENCE_HOME=/data/program/xinference XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

Reproduction / 复现过程

双卡加载模型就会出现。 ![Uploading 微信图片_20240924195713.jpg…]() ![Uploading 微信图片_20240924195729.png…]()

Expected behavior / 期待表现

期望能找到有效的解决办法。

Dax-Zhang-pxy commented 1 hour ago

最后解决办法是设置环境变量: export NCCL_IB_DISABLE=1 export NCCL_P2P_DISABLE=1