Open polarathene opened 5 months ago
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L39
Mentions f32 in the description, but not f16? I assume it's a similar concern, just not as much extra memory? Is the default u8/b16?
I've seen some GGUF models on HF with an imatrix.dat
, I assume that's the pretrained one, or a separate artifact produce during the conversion? How do you identify when built-in is valid?
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L43
Is this option redundant/ignored if not using either of the earlier imatrix options?
--built-in
(is that bundling the imatrix into the GGUF? Is the imatrix.dat
file an alternative where that data is separate?)llama.cpp
setting where it sets a layer maximum but won't add/allocate anymore than the memory for those 32 layers?https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L38
I've not tried quantkit
on a compatible model yet, and this isn't touched on in your README examples. What is the benefit of keeping the intermediate files? Can it benefit alternative conversions / quants, or is it fairly limited at where it can minimize conversion time/resources?
In the README Hardware Requirements you have 7B models with 24GB vRAM. Do you know if the memory usage required to perform the conversion would be higher than it is to run the quantized model?
I definitely cannot run 7B .safetensors
on an 8GB 4060, except when using HF transformers loader with the options like load-in-4-bit
+ use_double_quant
with float type nf4
🤔 which is meant to be a way to get benefits of quantization but converting at runtime instead of to separate formats like GGUF.
I'm not sure if the conversion process is the same to that feature (or if you would know since I think quantkit
is providing a unified CLI to delegate to different backends for conversion?).
Right now, vision models are generally not well supported in the various quantization methods. GGUF support for Phi3V is still being worked out: https://github.com/ggerganov/llama.cpp/pull/7705 ; there isn't anything that prevents quantization from working on the LLM portion of these models but there is a lack of skilled developer time and most of the quantization libraries are community projects without serious corporate support.
Other queries with GGUF subcommand
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L39
Mentions f32 in the description, but not f16? I assume it's a similar concern, just not as much extra memory? Is the default u8/b16?
Llama.cpp/GGUF quantization requires (with one exception, it is possible to directly quant to Q8_0) the unquantized model weights to be converted to a GGUF file first, then quantized from that. These descriptions were written before llama.cpp added bf16 support and this flag was added to deal with the fact that bf16 native models lose precision when converted to fp16 (but not fp32). There is code that converts bf16 models to a bf16 GGUF now so you shouldn't need the flag in most cases. A 70B model requires 280GB of disk space in fp32.
I've seen some GGUF models on HF with an
imatrix.dat
, I assume that's the pretrained one, or a separate artifact produce during the conversion? How do you identify when built-in is valid?
Creating GGUF quants with imatrix requires a calibration dataset, the built-in-imatrix flag uses calibration data from exllamav2 to generate imatrix.dat which is a necessary artifact to create an imatrix GGUF. Some GGUF quantizers upload that file to HF along with their quants and it's possible to download and use that instead of generating one with the included calibration data. The code here does support imatrix but without a llama-cpp-conv that has been compiled with hardware acceleration the speed is very slow. There are wheels for various platforms available on the llama-cpp-conv github (https://github.com/xhedit/llama-cpp-conv) but they require manual installation.
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L43
Is this option redundant/ignored if not using either of the earlier imatrix options?
* I see a README example that uses it with `--built-in` (_is that bundling the imatrix into the GGUF? Is the `imatrix.dat` file an alternative where that data is separate?_) * In your example you set 200 layers, while I've noticed the 7-8B models I've tried previously have 32 layers. Does specifying more layers here affect that in some way, or is it like `llama.cpp` setting where it sets a layer maximum but won't add/allocate anymore than the memory for those 32 layers?
What is built-in here is the calibration dataset used to create the imatrix.dat, which is an intermediate artifact used while creating imatrix enabled GGUFs. The imatrix itself has no use outside of creating imatrix-quantized GGUFs and can't be used during runtime at all. Yes, 200 layers was chosen because it is enough to fully offload any currently released model, there is no problem with specifying over the maximum layers in the model. It just passes the number to llama.cpp's imatrix binary during that part of the process. My previous answer mentions that you need llama-cpp-conv with hw support for offloading to work at all.
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L38
I've not tried
quantkit
on a compatible model yet, and this isn't touched on in your README examples. What is the benefit of keeping the intermediate files? Can it benefit alternative conversions / quants, or is it fairly limited at where it can minimize conversion time/resources?
This option keeps the converted and unquantized GGUF that was generated as part of the quantization process. Sometimes you see people uploading fp16/bf16/fp32 GGUF models to HF, either for use or for debugging. Mac users with a lot of unified RAM might be interested.
Hardware requirements
In the README Hardware Requirements you have 7B models with 24GB vRAM. Do you know if the memory usage required to perform the conversion would be higher than it is to run the quantized model?
I definitely cannot run 7B
.safetensors
on an 8GB 4060, except when using HF transformers loader with the options likeload-in-4-bit
+use_double_quant
with float typenf4
🤔 which is meant to be a way to get benefits of quantization but converting at runtime instead of to separate formats like GGUF.I'm not sure if the conversion process is the same to that feature (or if you would know since I think
quantkit
is providing a unified CLI to delegate to different backends for conversion?).
This depends on the quantization method. Yes, quantkit is just a frontend for the varying backends (AutoAWQ, AutoGPTQ, Exllamav2, HQQ, Llama.cpp) and how they deal with their own individual quantization process. GGUF quantization can be done in CPU (though imatrix is EXTREMELY slow, expect it to take several days for larger models) while AutoAWQ and AutoGPTQ require the entire unquantized model to fit in VRAM. Exllamav2 should work if it can fit the individual tensor / matrix being operated on in VRAM.
Model: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
It seems to assume it's an
Orion
model arch? I see there is a Phi3 arch already supported, but I'm not sure how that differs with a vision model variant.Doesn't seem like there's an option for the CLI for it to try again with the Phi 3 support:
https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L34-L47