Closed BaohaoLiao closed 7 months ago
Thanks for pointing out. This is because LoftQ/Llama-2-7b-hf-bit4-rank64
used self-implemented nf4 quantization method which is not exact the same as nf4 quantization in bitsandbytes
. To fix it, please try LoftQ/Llama-2-7b-hf-4bit-64rank
.
Moreover, our method is not constrained to a specific quantization method. Either self-implemented or the one in bitsandbytes
can achieve the on par results.
Thank you for this clarification.
I understand your method is not limited to any quantization function. However, you still use the bitsandbytes as a backend for memory-efficient fine-tuning. If you use custom quantization (like self-implemented nf4 quantization), doesn't it introduce some mismatch because of the different quantization functions between fine-tuning and custom LoRA initialization?
Said you can obtain a perfect LoRA initialization as W = Q + AB, where Q = self_implemented_nf4(W). When you use bitsandbytes to fine-tune, Q_new = bitsandbytes_nf4(Q), results in W is not equal to Q_new + AB.
In addition, may I ask what the default T is for llama?
LoftQ/Llama-2-7b-hf-4bit-64rank
is quantized with bitsandbytes method and does not have discrepancy between true and fake quantization. The default alternating step T for llama-2 is 5.
Hi,
As a debugging way, I want to check whether the fake and true quantized model's weights have the same value. Here is how I implement it:
Then I print out some weight values as:
print(loftq_fp16.state_dict()['model.layers.0.self_attn.q_proj.weight'])
The output is:For loftq_fp4, I do it in this way:
The output is:
We can see they are quite different, which means the fake quantization doesn't truly reflect the true quantization performance.