yxli2123 / LoftQ

MIT License
193 stars 18 forks source link

About the GPU memory #1

Closed XpracticeYSKM closed 9 months ago

XpracticeYSKM commented 10 months ago

It seems that you use NF fake quantization, so I guess you can't save the GPU memory like QLoRA. Am I right?

yxli2123 commented 10 months ago

Hi, thanks for your interest of our project. We DO save GPU memory. The fake quantization is just an option for people who have sufficient GPU resources and want to fine-tune their models faster.

If you download the weights from HuggingFace Hub: LoftQ and pass load_in_4bit=True, you will receive the same quantized backbone as QLoRA does. If you couldn't find your target models in the Hub, you can run quantize.py. It will save the LoftQed weights on your local machine. You then can load the quantized weights from the local path. I understand it is not straightforward so we are working on integrating LoftQ into PEFT package and will release it soon.

Also keep in mind that bitsandbytes hasn't supported saving and loading 4-bit weights, so we have to save the weights in fp16/fp32 and convert them into 4 bits when loading it to GPUs.

XpracticeYSKM commented 10 months ago

Thanks for your detailed answer. LoftQ is a improvement for the initialization of QLora so it can use the the same quantized NF4 backbone as QLoRA. But I find the quantization bit setting is NF4 and NF2. For NF2, how can you save the memory?

yxli2123 commented 10 months ago

We provide two options: (1) use NF4 to realize NF2, which, unfortunately, does not save half memory compared to NF4; (2) use the self-implemented QLinearLR in Line 17, which saves the memory but could be a bit slow.

yxli2123 commented 9 months ago

I will close this issue. If you have further questions, please feel free to re-open it.