yxli2123 / LoftQ

MIT License
193 stars 18 forks source link

Why are base weights on HF LoftQ models in 16-bit? #26

Open RonanKMcGovern opened 5 months ago

RonanKMcGovern commented 5 months ago

The script quantize_save_load.py generates a quantized model with LoRA adapters.

The base model is then saved and uploaded to LoftQ repos such as this one.

I'm puzzled why the base model weights are 16-bits there because that implies that the base model is somehow upcasted (dequantized) in the quantize_save_load.py script, but I don't see that anywhere.

My baseline expectation is that either: a) The backbone would be stored in nf4, and then loaded with the 16 bit adapters on top, or b) The backbone would be upcasted to 16-bit, and then quantized in nf4 upon loading with the 16-bit adapters on top. [But then there should be some upcasting code in quantize_save_load.py].

Could someone clarify? Thanks.

yxli2123 commented 5 months ago

Hi, @RonanKMcGovern . In the old version of bitsandbytes, it is not allowed to save nf4 format weight. We avoid this issue by saving it in 16 bits in the disk and transforming it into 4 bits when loading it into GPU. However, as the bitsandbytes gets updated recently, it is possible to save it in nf4 format. We will update the code soon.

RonanKMcGovern commented 5 months ago

Ok, but are you even running the nf4 quantization then?

Or are you just directly saving the bf16 weights? If you're doing that, there is going to be error when reloading the model because the saved bf16 should be the dequantized weights, not the original...

Seems to me something is off because doing even one iteration of loftQ should improve results, but I see worsening results for 1 iteration and more (see this vid), as does kaitchup.substack.com