Closed StiphyJay closed 7 months ago
Thanks for your interest of your work. We don't merger the LoRA adapters into the quantized LLM. When doing inference, we compute $QX$ and $ABX$ separately and add them: $Y = QA + ABX$. We don't do $Y = (Q + AB) X$.
Thanks for your interest of your work. We don't merger the LoRA adapters into the quantized LLM. When doing inference, we compute QX and ABX separately and add them: Y=QA+ABX. We don't do Y=(Q+AB)X.
Therefore, during inference, the AB matrix is Float16, while the Q matrix is in INT8/INT4, and X is in Float16, right?
Yes, you are right.
Thanks for your help.
Thx for your great job!
After reading your paper and code, I have a question: How do you merge LoRA weights to quantized LLM for inference?
Looking forward to your reply!
Regards!