PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline". [blog]
We find that large language models (LLMs) have the remarkable ability to perceive the length of their generated responses in advance. Leveraging this LLM ability, we propose a novel technique called Sequence Scheduling to improve the efficiency of LLM batch inference. By grouping queries with similar perceived response lengths together, we significantly reduce redundant computations and achieve an impressive 86% improvement in inference throughput without compromising performance.
Perception in advance asks LLM to perceive the length of the response in advance. LLMs (e.g. ChatGPT) have the ability to conduct this task.
Perceived length: 10, real length: 6.
Perceived length: 112, real length: 119.
Install the required packages.
pip install -r requirements.txt
Get the original LLaMA weights in the Hugging Face format by following the instructions here.
Get the Vicuna-7B weights by following the instructions here
python3 -m fastchat.model.apply_delta \
--base-model-path ./ckpts/llama-7b \
--target-model-path ./ckpts/vicuna-7b \
--delta-path lmsys/vicuna-7b-delta-v1.1
(Optional) Data Preparation: use the following command to generate alpaca-train-10k.json
and alpaca-val-10k.json
, or use the data in data
folder directly.
python3 -m src.sample
(Optional) Collect the training dataset: first, use the following command to perform multiple inference for the training dataset, or use alpaca-train-10k-length.json
directly.
CUDA_VISIBLE_DEVICES=0 python -m src.lenpred
Then, use the following command to construct the training dataset for instruction tunning, or use alpaca-train-10k-instruct.json
directly.
python3 -m src.construct
Instruction Tunning: use the following command to perform instruction tunning.
bash train.sh
or you can download the LoRA weight from huggingface ckpts/vicuna-response-length-perception-module.
git clone https://huggingface.co/zangwei/vicuna-response-length-perception-module ckpts/vicuna-response-length-perception-module
(Optional) Evaluation: use the following command to evaluate the length perception performance.
CUDA_VISIBLE_DEVICES=0 python3 -m src.eval
Run following commands to benchmark different inference strategy.
CUDA_VISIBLE_DEVICES=0 python -m src.benchmark --num-data 1024
CUDA_VISIBLE_DEVICES=0 python -m src.benchmark --num-data 1024 --strategy seqsch --vbs --fcr --lora-path ./ckpts/vicuna-response-length-perception-module
CUDA_VISIBLE_DEVICES=0 python -m src.benchmark --num-data 1024 --strategy seqsch --lora-path ./ckpts/vicuna-response-length-perception-module
CUDA_VISIBLE_DEVICES=0 python -m src.benchmark --num-data 1024 --strategy po
CUDA_VISIBLE_DEVICES=0 python -m src.benchmark --num-data 1024 --strategy gt --vbs --fcr
@article{zheng2023response,
title={Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline},
author={Zangwei Zheng and Xiaozhe Ren and Fuzhao Xue and Yang Luo and Xin Jiang and Yang You},
journal={arXiv preprint arXiv:2305.13144},
year={2023}
}