yuhuixu1993 / qa-lora

Official PyTorch implementation of QA-LoRA
MIT License
117 stars 11 forks source link

Merging problem #35

Open samuelqy opened 5 months ago

samuelqy commented 5 months ago

Im very confused on the merging step. In Appendix B, the proof is solid, however there is no guarantee that the new matrix B is in integer format. In standard linear quantization, zeros are represented by integer, you can’t force a ‘qzeros’ to be a floating matrix . If i misunderstood, how do you do it then? Thanks

xxw11 commented 5 months ago

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

samuelqy commented 5 months ago

Thanks for replying, but is it possible to make the new matrix B be an integer matrix?

LuletterSoul commented 5 months ago

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

yuhuixu1993 commented 5 months ago

@xxw11 Hello, I believe that storing qzeros in floating-point format does not lead to improved hardware inference performance. During dequantization, the weights have already been quantized to INT4, and they need to be converted to fp16 format to be added with qzeros. Unless the qzeros are initially quantized to INT4, the paper has overlooked this step.

@yuhuixu1993 What do you think ?

Hi, @LuletterSoul The inference efficiency is not relevant with the qzeros. The insight of our paper is that our tuned models are still in int4 which can be efficiently inferenced while the tuned models of other Lora based methods are in fp16. Even though weight olny quantization need to be dequantized during inference, it is still faster than fp16 models, as the so-called memory-io bottleneck for llm. Besides some kernels such as marlin has been released which makes weight only quantization extremely faster.

LuletterSoul commented 5 months ago

@yuhuixu1993 I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

yuhuixu1993 commented 5 months ago

@yuhuixu1993

I apologize, but I may have misunderstood your question. I agree that QA-QLoRA can help alleviate memory-bound issues. As you mentioned, the QA-LoRA weights are in INT4 format, and weight-only quantization requires dequantizing the weights to FP16 for inference, but the qzeros are in the floating-point domain while the weights are INT4. According to the dequantization formula (W_s4 - qzeros) * scales, I'm unsure how to perform the subtraction between W_s4 and qzeros. One approach I can think of is either quantizing the qzeros to S4 or dequantizing the W_s4 to FP16. The former may compromise accuracy as the fine-tuning process did not introduce quantization error for qzeros, while the latter may sacrifice performance due to the higher cost of floating-point operations.

@LuletterSoul, weight only quantization need to be dequantized during inference no matter the format of qzeros are int or float. By the way the scale of the quantized weight are float. In the original GPTQ code, zeros are also float.

freeSoul-SNU commented 2 weeks ago

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for sharing the related code. I have questions of the code you shared.

1. Changes to CUDA Files Is it okay not to modify the CUDA-related code? In the modified qlinear_cuda_old.py, the forward method of QuantLinear calls self.autogptq_cuda.vecquant*matmul depending on the bit width. Is it okay not to modify these functions?

if self.bits == 2: self.autogptq_cuda.vecquant2matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 3: self.autogptq_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 4: self.autogptq_cuda.vecquant4matmul(x.half(), self.qweight, out.half(), self.scales, self.qzeros, self.g_idx, self.infeatures // 2) elif self.bits == 8: self.autogptq_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) else:

2. Modification of qlinear_cuda.py The shared repository only has modifications to qlinear_cuda_old.py, and I'm wondering if it's okay not to modify the similar qlinear_cuda.py.

xxw11 commented 2 weeks ago

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format. If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for sharing the related code. I have questions of the code you shared.

1. Changes to CUDA Files Is it okay not to modify the CUDA-related code? In the modified qlinear_cuda_old.py, the forward method of QuantLinear calls self.autogptq_cuda.vecquant*matmul depending on the bit width. Is it okay not to modify these functions?

if self.bits == 2: self.autogptq_cuda.vecquant2matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 3: self.autogptq_cuda.vecquant3matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) elif self.bits == 4: self.autogptq_cuda.vecquant4matmul(x.half(), self.qweight, out.half(), self.scales, self.qzeros, self.g_idx, self.infeatures // 2) elif self.bits == 8: self.autogptq_cuda.vecquant8matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx) else:

2. Modification of qlinear_cuda.py The shared repository only has modifications to qlinear_cuda_old.py, and I'm wondering if it's okay not to modify the similar qlinear_cuda.py.

Hello, this repository primarily focuses on modifying the types in the GPTQ algorithm. The original QALoRA code execution doesn't involve these CUDA files, so they weren't modified. To implement a comprehensive modification, changes would need to be made to the forward passes of three modes: CUDA, Triton, and PyTorch backend.

freeSoul-SNU commented 2 weeks ago

Hi, You can use this version of the auto_gptq code: https://github.com/xxw11/AutoGPTQ_QALoRA. We've made adjustments to the original quantization process so that the quantized qzero will be in floating-point format.

If you encounter any issues, please feel free to reach out.

@xxw11 Thank you for your response. I checked the shared Git repository, and it seems to be code for keeping the zeros as floats in GPTQ.

However, if I fine-tune using QALoRA and then merge into qzeros for model inference, wouldn't I also need to modify the CUDA files called by the forward function in GPTQ? Could I get some guidance on how to perform inference using the merged parameters?

freeSoul-SNU commented 2 weeks ago

@xxw11 I have another question.

If I use the version of AutoGPTQ that you shared, do I need to reflect the changes in the auto-gptq within the specified Python version as well in order to perform quantization?

Change the peft_utils.py in your own auto-gptq path(python path/auto_gptq/utils/peft_utils.py) with the new one. For the users of [GPTQLORA](https://github.com/qwopqwop200/gptqlora), you only need to change the peft_utils.py file.

When I tried to quantize the llama7B model with AutoGPTQ_QALoRA, an error like the one below occurred, and the quantization didn't proceed as below.

2024-10-29 15:29:34 INFO [auto_gptq.modeling._base] Start quantizing layer 1/32
2024-10-29 15:29:35 INFO [auto_gptq.modeling._base] Quantizing self_attn.k_proj in layer 1/32...
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] duration: 1.2955830097198486
2024-10-29 15:29:36 INFO [auto_gptq.quantization.gptq] avg loss: 765204325466112.0
2024-10-29 15:29:36 INFO [auto_gptq.modeling._base] Quantizing self_attn.v_proj in layer 1/32...
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] duration: 1.7773809432983398
2024-10-29 15:29:38 INFO [auto_gptq.quantization.gptq] avg loss: 766996199243776.0
2024-10-29 15:29:38 INFO [auto_gptq.modeling._base] Quantizing self_attn.q_proj in layer 1/32...
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] duration: 1.6986031532287598
2024-10-29 15:29:40 INFO [auto_gptq.quantization.gptq] avg loss: 764688862281728.0
2024-10-29 15:29:40 INFO [auto_gptq.modeling._base] Quantizing self_attn.o_proj in layer 1/32...
Traceback (most recent call last):
  File "quant_with_alpaca.py", line 178, in <module>
    main()
  File "quant_with_alpaca.py", line 121, in main
    model.quantize(
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/modeling/_base.py", line 361, in quantize
    scale, zero, g_idx = gptq[name].fasterquant(
  File "/home/***/.conda/envs/qalora/lib/python3.8/site-packages/auto_gptq/quantization/gptq.py", line 94, in fasterquant
    H = torch.linalg.cholesky(H)
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).