Support for `Phi-3-vision-128k-instruct`

polarathene commented 5 months ago

Model: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

$ mkdir -p /models/phi-3-vision && cd /models/phi-3-vision
$ git clone https://huggingface.co/microsoft/Phi-3-vision-128k-instruct .
$ quantkit . Q4_K_M --output Phi-3-vision-128k-instruct-Q4_K_M.gguf

Traceback (most recent call last):
  File "/usr/local/bin/quantkit", line 5, in <module>
    from quantkit.cli import main
  File "/usr/local/lib/python3.10/dist-packages/quantkit/cli.py", line 4, in <module>
    from quantkit.quantkit import run_download, run_safetensor, run_gguf, run_awq, run_gptq, run_exl2, run_hqq
  File "/usr/local/lib/python3.10/dist-packages/quantkit/quantkit.py", line 13, in <module>
    from quantkit.convert_hf import do_gguf_conversion as do_gguf_conversion_hf
  File "/usr/local/lib/python3.10/dist-packages/quantkit/convert_hf.py", line 790, in <module>
    class OrionModel(Model):
  File "/usr/local/lib/python3.10/dist-packages/quantkit/convert_hf.py", line 791, in OrionModel
    model_arch = gguf.MODEL_ARCH.ORION
  File "/usr/lib/python3.10/enum.py", line 437, in __getattr__
    raise AttributeError(name) from None
AttributeError: ORION

It seems to assume it's an Orion model arch? I see there is a Phi3 arch already supported, but I'm not sure how that differs with a vision model variant.

Doesn't seem like there's an option for the CLI for it to try again with the Phi 3 support:

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L34-L47

polarathene commented 5 months ago

Other queries with GGUF subcommand

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L39

Mentions f32 in the description, but not f16? I assume it's a similar concern, just not as much extra memory? Is the default u8/b16?

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L40-L41

I've seen some GGUF models on HF with an imatrix.dat, I assume that's the pretrained one, or a separate artifact produce during the conversion? How do you identify when built-in is valid?

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L43

Is this option redundant/ignored if not using either of the earlier imatrix options?

I see a README example that uses it with --built-in (is that bundling the imatrix into the GGUF? Is the imatrix.dat file an alternative where that data is separate?)
In your example you set 200 layers, while I've noticed the 7-8B models I've tried previously have 32 layers. Does specifying more layers here affect that in some way, or is it like llama.cpp setting where it sets a layer maximum but won't add/allocate anymore than the memory for those 32 layers?

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L38

I've not tried quantkit on a compatible model yet, and this isn't touched on in your README examples. What is the benefit of keeping the intermediate files? Can it benefit alternative conversions / quants, or is it fairly limited at where it can minimize conversion time/resources?

Hardware requirements

In the README Hardware Requirements you have 7B models with 24GB vRAM. Do you know if the memory usage required to perform the conversion would be higher than it is to run the quantized model?

I definitely cannot run 7B .safetensors on an 8GB 4060, except when using HF transformers loader with the options like load-in-4-bit + use_double_quant with float type nf4 🤔 which is meant to be a way to get benefits of quantization but converting at runtime instead of to separate formats like GGUF.

I'm not sure if the conversion process is the same to that feature (or if you would know since I think quantkit is providing a unified CLI to delegate to different backends for conversion?).

xhedit commented 5 months ago

Right now, vision models are generally not well supported in the various quantization methods. GGUF support for Phi3V is still being worked out: https://github.com/ggerganov/llama.cpp/pull/7705 ; there isn't anything that prevents quantization from working on the LLM portion of these models but there is a lack of skilled developer time and most of the quantization libraries are community projects without serious corporate support.

Other queries with GGUF subcommand

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L39

Mentions f32 in the description, but not f16? I assume it's a similar concern, just not as much extra memory? Is the default u8/b16?

Llama.cpp/GGUF quantization requires (with one exception, it is possible to directly quant to Q8_0) the unquantized model weights to be converted to a GGUF file first, then quantized from that. These descriptions were written before llama.cpp added bf16 support and this flag was added to deal with the fact that bf16 native models lose precision when converted to fp16 (but not fp32). There is code that converts bf16 models to a bf16 GGUF now so you shouldn't need the flag in most cases. A 70B model requires 280GB of disk space in fp32.

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L40-L41

I've seen some GGUF models on HF with an imatrix.dat, I assume that's the pretrained one, or a separate artifact produce during the conversion? How do you identify when built-in is valid?

Creating GGUF quants with imatrix requires a calibration dataset, the built-in-imatrix flag uses calibration data from exllamav2 to generate imatrix.dat which is a necessary artifact to create an imatrix GGUF. Some GGUF quantizers upload that file to HF along with their quants and it's possible to download and use that instead of generating one with the included calibration data. The code here does support imatrix but without a llama-cpp-conv that has been compiled with hardware acceleration the speed is very slow. There are wheels for various platforms available on the llama-cpp-conv github (https://github.com/xhedit/llama-cpp-conv) but they require manual installation.

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L43

Is this option redundant/ignored if not using either of the earlier imatrix options?

* I see a README example that uses it with `--built-in` (_is that bundling the imatrix into the GGUF? Is the `imatrix.dat` file an alternative where that data is separate?_)

* In your example you set 200 layers, while I've noticed the 7-8B models I've tried previously have 32 layers. Does specifying more layers here affect that in some way, or is it like `llama.cpp` setting where it sets a layer maximum but won't add/allocate anymore than the memory for those 32 layers?

What is built-in here is the calibration dataset used to create the imatrix.dat, which is an intermediate artifact used while creating imatrix enabled GGUFs. The imatrix itself has no use outside of creating imatrix-quantized GGUFs and can't be used during runtime at all. Yes, 200 layers was chosen because it is enough to fully offload any currently released model, there is no problem with specifying over the maximum layers in the model. It just passes the number to llama.cpp's imatrix binary during that part of the process. My previous answer mentions that you need llama-cpp-conv with hw support for offloading to work at all.

https://github.com/xhedit/quantkit/blob/0463293d9f15ea68e94191d3f281907b0abc85e2/quantkit/cli.py#L38

I've not tried quantkit on a compatible model yet, and this isn't touched on in your README examples. What is the benefit of keeping the intermediate files? Can it benefit alternative conversions / quants, or is it fairly limited at where it can minimize conversion time/resources?

This option keeps the converted and unquantized GGUF that was generated as part of the quantization process. Sometimes you see people uploading fp16/bf16/fp32 GGUF models to HF, either for use or for debugging. Mac users with a lot of unified RAM might be interested.

Hardware requirements

In the README Hardware Requirements you have 7B models with 24GB vRAM. Do you know if the memory usage required to perform the conversion would be higher than it is to run the quantized model?

I definitely cannot run 7B .safetensors on an 8GB 4060, except when using HF transformers loader with the options like load-in-4-bit + use_double_quant with float type nf4 🤔 which is meant to be a way to get benefits of quantization but converting at runtime instead of to separate formats like GGUF.

I'm not sure if the conversion process is the same to that feature (or if you would know since I think quantkit is providing a unified CLI to delegate to different backends for conversion?).

This depends on the quantization method. Yes, quantkit is just a frontend for the varying backends (AutoAWQ, AutoGPTQ, Exllamav2, HQQ, Llama.cpp) and how they deal with their own individual quantization process. GGUF quantization can be done in CPU (though imatrix is EXTREMELY slow, expect it to take several days for larger models) while AutoAWQ and AutoGPTQ require the entire unquantized model to fit in VRAM. Exllamav2 should work if it can fit the individual tensor / matrix being operated on in VRAM.

xhedit / quantkit