xusenlinzy / api-for-open-llm

Openai style api for open large language models, using LLMs just as chatgpt! Support for LLaMA, LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, Xverse, SqlCoder, CodeLLaMA, ChatGLM, ChatGLM2, ChatGLM3 etc. 开源大模型的统一后端接口
Apache License 2.0
2.16k stars 252 forks source link

34b 模型 ,用int4 ,vllm进行推理,单张24G 4090GPU,显示显存不足 #239

Closed haohuisss closed 2 months ago

haohuisss commented 4 months ago

提交前必须检查以下项目 | The following items must be checked before submission

问题类型 | Type of problem

None

操作系统 | Operating system

Linux

详细描述问题 | Detailed description of the problem

# 请在此处粘贴运行代码(如没有可删除该代码块)
# Paste the runtime code here (delete the code block if you don't have it)
API_PREFIX=/v1

# device related
DEVICE=cuda
GPUS=0
NUM_GPUs=1
DTYPE=half

# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1
load_in_4bit = True

用的是完整权重,然后使用load_in_4bit = True 来部署。理论上24g应该是支持34b的int4量化模型推理的。

Dependencies

# 请在此处粘贴依赖情况
# Please paste the dependencies here

运行日志或截图 | Runtime logs or screenshots

# 请在此处粘贴运行日志
# Please paste the run log here
xusenlinzy commented 4 months ago

vllm不支持bnb的4bit在线量化方式

haohuisss commented 4 months ago

VLLM不支持BNB的4bit在线量化方式

意思是需要将模型参数先转换为int4的结构,再使用这个项目来进行vllm推理,才能在24g显存中运行是吗?

xusenlinzy commented 4 months ago

是的,你可以用GPTQ或者AWQ的权重

haohuisss commented 4 months ago

是的,你可以用GPTQ或者AWQ的权重

项目读取模型的类型是支持GPTQ或者AWQ的吧(用vllm来推理)。只需要将模型转换为GPTQ或者AWQ的int4权重,再用项目启动就可以了是吧?

xusenlinzy commented 4 months ago

是的

haohuisss commented 4 months ago

是的

你好,我这边使用autoawq将训练后的sus-chat-34b模型进行int4量化了,然后用本项目的vllm来启动,但是任然显示超出显存。

# model related
MODEL_NAME=sus-chat
MODEL_PATH=/data/llm_model
PROMPT_NAME=sus-chat

# api related
API_PREFIX=/v1

# device related
DEVICE=cuda
DEVICE_MAP=
GPUS=5
NUM_GPUs=1

# vllm related
ENGINE=vllm
TRUST_REMOTE_CODE=true
TOKENIZE_MODE=slow
TENSOR_PARALLEL_SIZE=1
# DTYPE=half
LOAD_IN_4BIT=true
GPU_MEMORY_UTILIZATION=1.0
QUANTIZATION_METHOD=awq
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 5.00 GiB. GPU 0 has a total capacty of 23.65 GiB of which 3.83 GiB is free. Process 44539 has 19.81 GiB memory in use. Of the allocated memory 18.19 GiB is allocated by PyTorch, and 1.04 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Tendo33 commented 2 months ago

用GPTQ的权重模型,在env file 里还需要设置LOAD_IN_4BIT=true,QUANTIZATION_METHOD=awq 这几个参数吗

haohuisss commented 2 months ago

用GPTQ的权重模型,在env file 里还需要设置LOAD_IN_4BIT=true,QUANTIZATION_METHOD=awq 这几个参数吗

对于这个问题我不是很了解,没有使用过GPTQ的权重,不过量化方法中应该要进行设置,可以看看参数。我这边已经成功运行起来34b-awq-int4的模型了,需要对上下文长度进行限制再部署,这样就不会超内存。

Tendo33 commented 2 months ago

所以显存占用是量化之后的模型对吧

haohuisss commented 2 months ago

所以显存占用是量化之后的模型对吧

是量化后的,但是vllm启动会根据你的参数来占用内存,上下文长度设置过长会爆显存。