yuhuixu1993 / qa-lora

Official PyTorch implementation of QA-LoRA
MIT License
113 stars 10 forks source link

Merging problem #35

Open samuelqy opened 4 months ago

samuelqy commented 4 months ago

Im very confused on the merging step. In Appendix B, the proof is solid, however there is no guarantee that the new matrix B is in integer format. In standard linear quantization, zeros are represented by integer, you can’t force a ‘qzeros’ to be a floating matrix . If i misunderstood, how do you do it then? Thanks

xxw11 commented 4 months ago

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

samuelqy commented 4 months ago

Thanks for replying, but is it possible to make the new matrix B be an integer matrix?

LuletterSoul commented 4 months ago

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

yuhuixu1993 commented 4 months ago

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

Hi, @LuletterSoul The inference efficiency is not relevant with the qzeros. The insight of our paper is that our tuned models are still in int4 which can be efficiently inferenced while the tuned models of other Lora based methods are in fp16. Even though weight olny quantization need to be dequantized during inference, it is still faster than fp16 models, as the so-called memory-io bottleneck for llm. Besides some kernels such as marlin has been released which makes weight only quantization extremely faster.

LuletterSoul commented 4 months ago

@yuhuixu1993 I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

yuhuixu1993 commented 4 months ago

@yuhuixu1993

I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

@LuletterSoul, weight only quantization need to be dequantized during inference no matter the format of qzeros are int or float. By the way the scale of the quantized weight are float. In the original GPTQ code, zeros are also float.