Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.16k
stars
252
forks
source link
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. #259
提交前必须检查以下项目 | The following items must be checked before submission
[X] 请确保使用的是仓库最新代码(git pull),一些问题已被解决和修复。 | Make sure you are using the latest code from the repository (git pull), some issues have already been addressed and fixed.
[X] 我已阅读项目文档和FAQ章节并且已在Issue中对问题进行了搜索,没有找到相似问题和解决方案 | I have searched the existing issues / discussions
启动
python server.py时报错
如下:
ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.
提交前必须检查以下项目 | The following items must be checked before submission
问题类型 | Type of problem
启动命令 | Startup command
操作系统 | Operating system
Linux
详细描述问题 | Detailed description of the problem
使用vllm方式部署api,使用的模型是qwen1.5-14b-chat .env配置如下:
PORT=8000
model related
MODEL_NAME=qwen MODEL_PATH=./models/qwen-1.5-14b-chat PROMPT_NAME= EMBEDDING_NAME=
device related
GPU设备并行化策略
DEVICE_MAP=auto
GPU数量
NUM_GPUs=2
api related
API_PREFIX=/v1
vllm related
ENGINE=vllm TRUST_REMOTE_CODE=true TOKENIZE_MODE=slow TENSOR_PARALLEL_SIZE=1
开启半精度,可以加快运行速度、减少GPU占用
DTYPE=half
API_KEY,此处随意填一个字符串即可
OPENAI_API_KEY=
启动 python server.py时报错 如下: ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (15248). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine.看了一些处理方案python server.py --max-model-len 24320,也不起作用
另外我设置了NUM_GPUs=2,显示还是只用了一张卡
Dependencies
peft 0.10.0 sentence-transformers 2.6.1 torch 2.1.2 transformers 4.39.3 transformers-stream-generator 0.0.5
运行日志或截图 | Runtime logs or screenshots