Recently, llama.cpp introduced importance matrix-aware quantization, which yields further improvements on PPL.
Before quantization, the important matrices are calculated through imatrix application. We use Chinese segmentation training data PKU, and iterate over 100 batches to obtain the imatrix.
During quantization, specify --imatrix with the generated imatrix file to allow im-aware quantization. Note that the process will be longer than without imatrix.
Currently, we have converted all available models (only for K-quants). You can download them directly from our Hugging Face model hub. The model name with -im suffix represents the newly converted im-aware models. These models can be used directly without further actions.
The followings are several benchmarks (PPL). Generally speaking, im-quantized models are better but not always.
Description
Recently, llama.cpp introduced importance matrix-aware quantization, which yields further improvements on PPL. Before quantization, the important matrices are calculated through
imatrix
application. We use Chinese segmentation training data PKU, and iterate over 100 batches to obtain the imatrix.During quantization, specify
--imatrix
with the generated imatrix file to allow im-aware quantization. Note that the process will be longer than without imatrix.Currently, we have converted all available models (only for K-quants). You can download them directly from our Hugging Face model hub. The model name with
-im
suffix represents the newly converted im-aware models. These models can be used directly without further actions.The followings are several benchmarks (PPL). Generally speaking, im-quantized models are better but not always.
Chinese-Alpaca-2-7B-RLHF-GGUF
-im
)Chinese-LLaMA-2-13B-GGUF
-im
)Related Issue
None.