Closed MarsJacobs closed 8 months ago
Thanks for sharing great work! learned a lot.
This equation is from the paper 3 page, at the very bottom of the page, starting with the sentence "The computation," presented inline. I'm curious as to why $\beta$ is not multiplied by $x$ in this equation. What am I missing?
Additionally, in this code and
merge_with_quantization
from Algorithm1, could you please provide a clearer explanation as to whybeta_new
is calculated by subtracting thes * (lora_B @ lora_A).transpose(0,1)
frombeta
?https://github.com/yuhuixu1993/qa-lora/blob/7470e48247522f2ad47edc602d3fe70991b778c7/merge.py#L12
First,about the equation, you are right, the equation is wrong in the paper, I will revise it in the new submission.
Second, actually some simple matrix computation can explain the new beta, I will also update in the appendix. I can give a simple explanation here. As you can see beta can also be treat as an matrix where weights are shared groupwisely multiplied by x. From another point of view, groups of x share the same weight. So we avgpool or sum the x and feed the output to Lora. In the Lora path, also groups of x share the same weight. The Lora parameters A*B is equivalent to a newer beta/group-size or beta. If you use some examples to calculate, you will find out.
Thanks for pointing out my mistake!
Thank you for the prompt and kind response. To delve deeper into my second question, I was inquiring for additional clarification on why we apply subtraction
when calculating new_beta from the result of lora weight. It would be of great assistance in grasping the core idea of the paper if you could provide even a brief explanation.
I also have another question regarding the experimental results. Currently, the results of QA-LoRA are being compared with QLoRA + PTQ (GPTQ)
of the same bit precision. It seems the information about the group size used during the GPTQ application is missing. For a fair comparison, shouldn't the GPTQ applied to QLoRA have the same group size as the QA-LoRA for each bit precision? To share my experience, I found that using too small a group size in GPTQ led to unstable convergence. Thus, wouldn't it be better to apply the state-of-the-art PTQ method, AWQ, as a PTQ technique?
Lastly, in section 3.3 on the insight of QA-LoRA
, it's mentioned that the number of quantization parameters of QLoRA is the scaling and zero factor of the $D_{out}$ pair. Strictly speaking, doesn't QLoRA inherently apply a group size of 64 at the kernel level? Therefore, I believe it would be more accurate to consider the group size when specifying the number of quantization parameters, rather than basing QLoRA's quantization parameter number on the per-output channel PTQ granurality.
@MarsJacobs ,I have updated the paper and attached a simple proof in the appendix of the paper.
Thanks for sharing great work! learned a lot.
This equation is from the paper 3 page, at the very bottom of the page, starting with the sentence "The computation," presented inline. I'm curious as to why $\beta$ is not multiplied by $x$ in this equation. What am I missing?
Additionally, in this code and
merge_with_quantization
from Algorithm1, could you please provide a clearer explanation as to whybeta_new
is calculated by subtracting thes * (lora_B @ lora_A).transpose(0,1)
frombeta
?https://github.com/yuhuixu1993/qa-lora/blob/7470e48247522f2ad47edc602d3fe70991b778c7/merge.py#L12