Closed XpracticeYSKM closed 9 months ago
Hi, thanks for your interest of our project. We DO save GPU memory. The fake quantization is just an option for people who have sufficient GPU resources and want to fine-tune their models faster.
If you download the weights from HuggingFace Hub: LoftQ and pass load_in_4bit=True
, you will receive the same quantized backbone as QLoRA does. If you couldn't find your target models in the Hub, you can run quantize.py
. It will save the LoftQed weights on your local machine. You then can load the quantized weights from the local path. I understand it is not straightforward so we are working on integrating LoftQ into PEFT package and will release it soon.
Also keep in mind that bitsandbytes hasn't supported saving and loading 4-bit weights, so we have to save the weights in fp16/fp32 and convert them into 4 bits when loading it to GPUs.
Thanks for your detailed answer. LoftQ is a improvement for the initialization of QLora so it can use the the same quantized NF4 backbone as QLoRA. But I find the quantization bit setting is NF4 and NF2. For NF2, how can you save the memory?
We provide two options: (1) use NF4 to realize NF2, which, unfortunately, does not save half memory compared to NF4; (2) use the self-implemented QLinearLR
in Line 17, which saves the memory but could be a bit slow.
I will close this issue. If you have further questions, please feel free to re-open it.
It seems that you use NF fake quantization, so I guess you can't save the GPU memory like QLoRA. Am I right?