π₯ The First 1-bit LLM Quantization Method π₯
ππ» The SOTA Method of 1-bit LLM Quantization (till 2024.06.20)ππ»
π± Compressing 90%+ Space yet Maintaining 80%+ Performance π±
π€ Welcome to Pull Requests and Build our OneBit Model Center π€
Please refer to our πpaper for more details.
[2024/05/20] We released the training code, evaluation code, and data for reproducing our result.
[2024/02/17] We released the preprint paper to share this OneBit discovery.
Please note that due to the use of different checkpoints, the evaluation results listed here have very minor differences from the results in our paper, but this does not affect the existing conclusions.
Method | Wiki2 | C4 | Winograde | Hellaswag | PIQA | BoolQ | ARC-e | ARC-c | avg. |
---|---|---|---|---|---|---|---|---|---|
FP16 | 5.68 | 7.08 | 66.85 | 72.99 | 77.37 | 73.21 | 52.53 | 41.38 | 64.06 |
OmniQuant | 15.34 | 26.21 | 52.96 | 43.68 | 62.79 | 58.69 | 41.54 | 29.35 | 48.17 |
OneBit | 10.20 | 11.41 | 58.80 | 51.60 | 67.57 | 56.94 | 42.60 | 30.20 | 51.29 |
Method | Wiki2 | C4 | Winograde | Hellaswag | PIQA | BoolQ | ARC-e | ARC-c | avg. |
---|---|---|---|---|---|---|---|---|---|
FP16 | 5.09 | 6.61 | 70.17 | 76.24 | 79.05 | 68.47 | 59.85 | 44.54 | 66.39 |
OmniQuant | 13.43 | 19.33 | 53.83 | 54.16 | 68.99 | 62.20 | 45.50 | 30.38 | 52.51 |
OneBit | 9.19 | 10.26 | 62.12 | 56.75 | 70.35 | 63.82 | 44.57 | 31.83 | 54.91 |
Method | Wiki2 | C4 | Winograde | Hellaswag | PIQA | BoolQ | ARC-e | ARC-c | avg. |
---|---|---|---|---|---|---|---|---|---|
FP16 | 5.47 | 6.97 | 67.09 | 72.94 | 76.88 | 71.10 | 53.58 | 40.61 | 63.70 |
OmniQuant | 31.21 | 64.34 | 51.22 | 33.87 | 56.53 | 59.14 | 33.63 | 24.32 | 43.12 |
OneBit | 9.76 | 11.14 | 58.17 | 52.51 | 67.79 | 63.30 | 41.67 | 29.35 | 52.13 |
Method | Wiki2 | C4 | Winograde | Hellaswag | PIQA | BoolQ | ARC-e | ARC-c | avg. |
---|---|---|---|---|---|---|---|---|---|
FP16 | 4.88 | 6.47 | 69.77 | 76.62 | 79.05 | 68.99 | 57.95 | 44.20 | 66.10 |
OmniQuant | 16.88 | 27.02 | 53.20 | 50.34 | 62.24 | 62.05 | 40.66 | 29.61 | 49.68 |
OneBit | 8.77 | 10.17 | 61.33 | 56.48 | 69.97 | 64.98 | 42.85 | 33.79 | 54.90 |
All the best results are highlighted in bold.
Download and install the dependencies to use these codes. You should clone this repository to your local machine, enter the code folder, and install it. We suggest the installation order to be transformers first and llama_factory after. Please note that Pytorch v2.0.0 is required to quantize the models, whereas the evaluation process does not.
git clone https://github.com/xuyuzhuang11/OneBit.git
cd OneBit/transformers
pip install -e .
cd ../llama_factory
pip install -r ../requirements.txt
The quantization process can be divided into 3 steps: building an initial checkpoint, training, and packaging. The packaged checkpoint compresses over 90% of the space occupancy and can be used during inference or deployment. We provide the code for this part in the scripts
folder.
First, you can build an start checkpoint in the following recommended way. The start checkpoint will be saved in /home/me/OneBit/scripts/start_llama_7b
here and /home/me/llama_7b
is the path of released model downloaded from Internet. "llama_7b" is a name that you can set freely.
# example: python scripts/build_start_ckpt.py $NAME $FP16_MODEL_PATH $OUTPUT_CKPT_PATH
python scripts/build_start_ckpt.py llama_7b /home/me/llama_7b /home/me/OneBit/scripts/start_llama_7b
Then, you can use the example scripts we provide to complete the model distillation (knowledge transfer) process. Note that the arguments located at the beginning of the script need to be correctly set before running, as follows:
# arguments
MODEL_NAME=llama_7b
SCRIPT_DIR=$(dirname "$0") # scripts folder
DATASET=kd_132k # our released self-generated data, need not to be modified
TEACHER_MODEL_PATH=/home/xyz/models/llama-7b # path of FP16 model
STUDENT_MODEL_PATH=$SCRIPT_DIR/ckpt_$MODEL_NAME # path of intermediate checkpoints to be saved
STUDENT_MODEL_START_PATH=$SCRIPT_DIR/start_$MODEL_NAME # path of start checkpoint
STUDENT_MODEL_EXIST_PATH=$SCRIPT_DIR/ckpt_$MODEL_NAME/checkpoint-5000 # if you want to continue training process from a intermediate checkpoints
# using slurm to train
sbatch llama_7b.sh
# using bash to train
bash llama_7b.sh
Finally, you can perform lossless compression and packaging on the final training checkpoint, which can significantly reduce the runtime space of model inference. Similarly, you need to correctly fill in the checkpoint path at the beginning of the script.
# arguments
ckpt_path = '/home/me/scripts/onebit_llama_7b_trainckpt' # final training checkpoint
inf_ckpt_path = '/home/me/checkpoints/onebit_llama_7b' # compressed checkpoint to be saved
We also present deepspeed configuration file (for deepspeed) and hostfile (for training models >7B). Please refer to ds_config.json
and hostfile
.
To facilitate the accurate reproduction of the experimental results presented in our paper and this page, we provide the necessary evaluation code for your reference, which is located in the evaluation
folder. Users only need to correctly fill in the arguments in the lm_eval.py
to automatically run the evaluation process and obtain results.
# arguments
args = {
"multigpu": False, # this codes support multigpu evaluation
"batch_size": 32, # larger bsz only for compressed checkpoints :-)
"net": "onebitllama", # need not to be modified for llama
"model_family": "onebitllama",
"eval_ppl": True, # perform Perplexity evaluation
"seed": 1234,
"model": '/home/xyz/checkpoints/onebit_llama_7b', # path of model checkpoint, important!
# "tasks": '',
"tasks": 'hellaswag,winogrande,piqa,boolq,arc_easy,arc_challenge',
"num_fewshot": 0, # Zero-shot Accuracy
"limit": -1,
}
Then, you can run it like:
cd evaluation
CUDA_VISIBLE_DEVICES=0 python lm_eval.py
We warmly welcome friends from around the world to join us in building and researching extremely low-bit LLMs. π€π€π€Let's build the open-source OneBit LLM center together!π€π€π€ If you need to use the OneBit framework for your own model, you need to define your model in the transformers
package.
The model folder is transformers/src/transformers/models
. Our model is in bitllama
, and you can define your own model architecture as you want, such as bitcat
, bitdog
, etc.
If you need to perform efficient inference using compressed checkpoints, you should also define the inference class to recognize the compressed model weights. In our models, these correspond to BitLlamaForCausalLM and BitLlamaForCausalLMInf, respectively.
Moreover, our core component, the 1-bit linear layer, is located at transformers/src/transformers/models/bitnet.py
, and you can refer to it for more details.
This code and checkpoints are released under the MIT license, allowing you to freely use them within its framework. However, please note not to forget to cite the work as follows.
If you found this work useful, please consider citing:
@article{xu2024onebit,
title={OneBit: Towards Extremely Low-bit Large Language Models},
author={Xu, Yuzhuang and Han, Xu and Yang, Zonghan and Wang, Shuo and Zhu, Qingfu and Liu, Zhiyuan and Liu, Weidong and Che, Wanxiang},
journal={arXiv preprint arXiv:2402.11295},
year={2024}
}
Our code is developed based on the transformers and llama_factory repositories. We would like to thank them for their very important and meaningful work.ππ€