Open EidosL opened 3 months ago
I'll try it! can you give me more information about exl2 ?~
Thx for the reply here is the repo of the library https://github.com/turboderp/exllamav2
The Exllama v2 format is relatively new and people just have not really seen the benefits yet. In theory, it should be able to produce better quality quantizations of models by better allocating the bits per layer where they are needed the most. That's how you get the fractional bits per weight rating of 2.3 or 2.4 instead of q3 or q4 like with llama.cpp GGUF models.
According to Turboderp (the author of Exllama/Exllamav2), there is very little perplexity difference from 4.0 bpw and higher compared to the full fp16 model precision. It's hard to make an apples-to-apples comparison of the different quantization methods (GPTQ, GGUF, AWQ and exl2), but in theory being smart about where you allocate your precious bits should improve the model's precision.)
As you have discovered, one of the amazing benefits of exl2 is that you can run a 70B model on a single 3090 or 4090 card.
It seems a bit difficult, I will kernel and rewrite xbot.cpp before supporting it~ and you can try this first https://hf-mirror.com/Kooten/MiquMaid-v1-70B-IQ2-GGUF/tree/main
Thx
I'll try it! can you give me more information about exl2 ?~