xusenlinzy / api-for-open-llm

Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.35k stars 270 forks source link

GPU KV cache usage: 100.0% 之后卡死? #131

Closed BenRood8165290 closed 1 year ago

BenRood8165290 commented 1 year ago

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

模型推理和部署 | Model inference and deployment

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

Baichuan2模型(基于Base使用LLaMA-Efficient-Tuning项目微调),用vllm方式加载推理。 硬件V100-32G,Cuda 11.7,对话正常。但是持续对话到某个阶段就会卡死,查看日志发现 GPU KV cache usage 达到100%就卡死。期间如果清除history,GPU KV Cache会下降。感觉是最大context超过了导致问题,这个有地方配置吗?

CUDA_VISIBLE_DEVICES=3 python server.py &
streamlit run streamlit_app.py --server.port 7861
PORT=8000

# model related
MODEL_NAME=baichuan-13b
MODEL_PATH=/DaTa/.local/home/hai.li/dl/BaiCuan/B2-13b-left_sft
PROMPT_NAME=xdefault
EMBEDDING_NAME=

# api related
API_PREFIX=/v1

# vllm related
USE_VLLM=true
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1
DTYPE=half

Dependencies

peft                          0.5.0
sentence-transformers         2.2.2
torch                         2.0.1
torchaudio                    0.12.1
torchvision                   0.15.2
transformers                  4.33.2
transformers-stream-generator 0.0.4

运行日志或截图 | Runtime logs or screenshots

第一次

INFO 09-22 17:23:58 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 98.6%, CPU KV cache usage: 0.0%
INFO 09-22 17:24:03 llm_engine.py:613] Avg prompt throughput: 2642.7 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 100.0%, CPU KV cache usage: 0.0%

第二次,中间清除history GPU KV Cache会变小

INFO 09-22 18:25:08 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 91.3%, CPU KV cache usage: 0.0%
INFO 09-22 18:25:13 llm_engine.py:613] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 17.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 100.0%, CPU KV cache usage: 0.0%
INFO 09-22 18:25:19 llm_engine.py:613] Avg prompt throughput: 2640.3 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 100.0%, CPU KV cache usage: 0.0%
BenRood8165290 commented 1 year ago

发现V100-32G通过vllm运行Baichuan2 实质上只能支持到大约 2280 长度context_window(或者是汉字的token占用较高),在streamlit-demo/streamlit_gallery/components/chat/streamlit_app.py 里面将发到后端的history + prompt + max_tokens 总数限制到2280左右基本解决(去掉更早的history),按说应该在server.py服务端限制的,但是没找到办法。

liHai001 commented 11 months ago

请问下如何清除 kv cache?我用stream跑,kv cache一直新增,直到显存爆了

Tendo33 commented 7 months ago

请问下如何清除 kv cache?我用stream跑,kv cache一直新增,直到显存爆了

同问