llama.cpp: update GGUF models (with imatrix)

Description

Recently, llama.cpp introduced importance matrix-aware quantization, which yields further improvements on PPL. Before quantization, the important matrices are calculated through imatrix application. We use Chinese segmentation training data PKU, and iterate over 100 batches to obtain the imatrix.

During quantization, specify --imatrix with the generated imatrix file to allow im-aware quantization. Note that the process will be longer than without imatrix.

Currently, we have converted all available models (only for K-quants). You can download them directly from our Hugging Face model hub. The model name with -im suffix represents the newly converted im-aware models. These models can be used directly without further actions.

The followings are several benchmarks (PPL). Generally speaking, im-quantized models are better but not always.

Chinese-Alpaca-2-7B-RLHF-GGUF

Quant	original	imatrix (`-im`)
Q2_K	10.5211 +/- 0.14139	11.9331 +/- 0.16168
Q3_K	8.9748 +/- 0.12043	8.8238 +/- 0.11850
Q4_0	8.7843 +/- 0.11854	-
Q4_K	8.4643 +/- 0.11341	8.4226 +/- 0.11302
Q5_0	8.4563 +/- 0.11353	-
Q5_K	8.3722 +/- 0.11236	8.3336 +/- 0.11192
Q6_K	8.3207 +/- 0.11184	8.3047 +/- 0.11159
Q8_0	8.3100 +/- 0.11173	-

Chinese-LLaMA-2-13B-GGUF

Quant	original	imatrix (`-im`)
Q2_K	14.4701 +/- 0.26107	17.4275 +/- 0.31909
Q3_K	10.1620 +/- 0.18277	9.7486 +/- 0.17744
Q4_0	9.8633 +/- 0.17792	-
Q4_K	9.2735 +/- 0.16793	9.2734 +/- 0.16792
Q5_0	9.3553 +/- 0.16945	-
Q5_K	9.1767 +/- 0.16634	9.1594 +/- 0.16590
Q6_K	9.1326 +/- 0.16546	9.1478 +/- 0.16583
Q8_0	9.1394 +/- 0.16574	-

Related Issue

None.

ymcui / Chinese-LLaMA-Alpaca-2