Equation, algorithm and experimental results question

MarsJacobs commented 1 year ago

Thanks for sharing great work! learned a lot.

This equation is from the paper 3 page, at the very bottom of the page, starting with the sentence "The computation," presented inline. I'm curious as to why $\beta$ is not multiplied by $x$ in this equation. What am I missing?

Additionally, in this code and merge_with_quantization from Algorithm1, could you please provide a clearer explanation as to why beta_new is calculated by subtracting the s * (lora_B @ lora_A).transpose(0,1) from beta?

https://github.com/yuhuixu1993/qa-lora/blob/7470e48247522f2ad47edc602d3fe70991b778c7/merge.py#L12

yuhuixu1993 commented 1 year ago

Thanks for sharing great work! learned a lot.

This equation is from the paper 3 page, at the very bottom of the page, starting with the sentence "The computation," presented inline. I'm curious as to why $\beta$ is not multiplied by $x$ in this equation. What am I missing?

Additionally, in this code and merge_with_quantization from Algorithm1, could you please provide a clearer explanation as to why beta_new is calculated by subtracting the s * (lora_B @ lora_A).transpose(0,1) from beta?

https://github.com/yuhuixu1993/qa-lora/blob/7470e48247522f2ad47edc602d3fe70991b778c7/merge.py#L12

First,about the equation, you are right, the equation is wrong in the paper, I will revise it in the new submission.

Second, actually some simple matrix computation can explain the new beta, I will also update in the appendix. I can give a simple explanation here. As you can see beta can also be treat as an matrix where weights are shared groupwisely multiplied by x. From another point of view, groups of x share the same weight. So we avgpool or sum the x and feed the output to Lora. In the Lora path, also groups of x share the same weight. The Lora parameters A*B is equivalent to a newer beta/group-size or beta. If you use some examples to calculate, you will find out.

Thanks for pointing out my mistake!

MarsJacobs commented 1 year ago

Thank you for the prompt and kind response. To delve deeper into my second question, I was inquiring for additional clarification on why we apply subtraction when calculating new_beta from the result of lora weight. It would be of great assistance in grasping the core idea of the paper if you could provide even a brief explanation.

I also have another question regarding the experimental results. Currently, the results of QA-LoRA are being compared with QLoRA + PTQ (GPTQ) of the same bit precision. It seems the information about the group size used during the GPTQ application is missing. For a fair comparison, shouldn't the GPTQ applied to QLoRA have the same group size as the QA-LoRA for each bit precision? To share my experience, I found that using too small a group size in GPTQ led to unstable convergence. Thus, wouldn't it be better to apply the state-of-the-art PTQ method, AWQ, as a PTQ technique?

Lastly, in section 3.3 on the insight of QA-LoRA, it's mentioned that the number of quantization parameters of QLoRA is the scaling and zero factor of the $D_{out}$ pair. Strictly speaking, doesn't QLoRA inherently apply a group size of 64 at the kernel level? Therefore, I believe it would be more accurate to consider the group size when specifying the number of quantization parameters, rather than basing QLoRA's quantization parameter number on the per-output channel PTQ granurality.

yuhuixu1993 commented 12 months ago

@MarsJacobs ,I have updated the paper and attached a simple proof in the appendix of the paper.

yuhuixu1993 / qa-lora

Equation, algorithm and experimental results question #9