xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
4.99k stars 396 forks source link

BUG:UI端关闭模型,资源没释放 #1682

Closed leoterry-ulrica closed 1 month ago

leoterry-ulrica commented 3 months ago

Describe the bug

通过UI端点击关闭模型,但显存资源没有释放,或者释放不干净的问题。

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version. Python-3.10
  2. The version of xinference you use. 0.12.0
  3. vllm version 0.4.3

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

LLM:qwen2-72b-4bit/8bit

image image image

ChengjieLi28 commented 3 months ago

@leoterry-ulrica 4卡启动的qwen2-72b-4bit/8bit?启动成功后running models页面什么样发下截图。 然后点了ternimate按钮之后的日志是什么样,也发下

leoterry-ulrica commented 3 months ago

@leoterry-ulrica 4卡启动的qwen2-72b-4bit/8bit?启动成功后running models页面什么样发下截图。 然后点了ternimate按钮之后的日志是什么样,也发下

  1. 启动成功后running models image

  2. 点了terminate之后日志 image

资源占用情况(看资源应该只释放其中一个卡的占用): image

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 7 days with no activity.

leoterry-ulrica commented 1 month ago

@ChengjieLi28 目前在4张L20和8张H800都能复现这个问题。

ChengjieLi28 commented 1 month ago

@ChengjieLi28 目前在4张L20和8张H800都能复现这个问题。

首先我这边没出现多卡无法释放资源的问题。先建议升级xinference和vllm。然后建议如下测试:

  1. 不管xinference,只用vllm接口去4卡加载你的模型,然后使用一下再杀掉,看看他本身有没有问题。
  2. 杀掉指直接干掉进程。
  3. 多试几个vllm版本,如果你可以的话。

xinference没有额外做什么管理显存的事情,vllm加载后就完全交给vllm。

leoterry-ulrica commented 1 month ago

新版本v-0.14.1测试通过,已解决这个问题,点赞! @qinxuye @ChengjieLi28