Problem with exporting GPTQ model to ONNX

Wendy-Xiao commented 6 months ago

Hi there,

I got the same error when I export mistral gptq model to onnx using python -m qllm --load TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --export_onnx=./mistral-7b-chat-v2-gptq-onnx --pack_mode ORT as specified in #81 And it shows the following error:

onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from /home/aiscuser/mistral-7b-chat-v2-gptq-onnx/decoder_merged.onnx failed:This is an invalid model. In Node, ("optimum::if", If, "", -1) : ("use_cache_branch": tensor(bool),) -> ("logits": tensor(float),"present.0.key": tensor(float16),"present.0.value": tensor(float16),"present.1.key": tensor(float16),"present.1.value": tensor(float16),"present.2.key": tensor(float16),"present.2.value": tensor(float16),"present.3.key": tensor(float16),"present.3.value": tensor(float16),"present.4.key": tensor(float16),"present.4.value": tensor(float16),"present.5.key": tensor(float16),"present.5.value": tensor(float16),"present.6.key": tensor(float16),"present.6.value": tensor(float16),"present.7.key": tensor(float16),"present.7.value": tensor(float16),"present.8.key": tensor(float16),"present.8.value": tensor(float16),"present.9.key": tensor(float16),"present.9.value": tensor(float16),"present.10.key": tensor(float16),"present.10.value": tensor(float16),"present.11.key": tensor(float16),"present.11.value": tensor(float16),"present.12.key": tensor(float16),"present.12.value": tensor(float16),"present.13.key": tensor(float16),"present.13.value": tensor(float16),"present.14.key": tensor(float16),"present.14.value": tensor(float16),"present.15.key": tensor(float16),"present.15.value": tensor(float16),"present.16.key": tensor(float16),"present.16.value": tensor(float16),"present.17.key": tensor(float16),"present.17.value": tensor(float16),"present.18.key": tensor(float16),"present.18.value": tensor(float16),"present.19.key": tensor(float16),"present.19.value": tensor(float16),"present.20.key": tensor(float16),"present.20.value": tensor(float16),"present.21.key": tensor(float16),"present.21.value": tensor(float16),"present.22.key": tensor(float16),"present.22.value": tensor(float16),"present.23.key": tensor(float16),"present.23.value": tensor(float16),"present.24.key": tensor(float16),"present.24.value": tensor(float16),"present.25.key": tensor(float16),"present.25.value": tensor(float16),"present.26.key": tensor(float16),"present.26.value": tensor(float16),"present.27.key": tensor(float16),"present.27.value": tensor(float16),"present.28.key": tensor(float16),"present.28.value": tensor(float16),"present.29.key": tensor(float16),"present.29.value": tensor(float16),"present.30.key": tensor(float16),"present.30.value": tensor(float16),"present.31.key": tensor(float16),"present.31.value": tensor(float16),) , Error Node (/model/layers.0/self_attn/q_proj/MatMulNBits) has input size 5 not in range [min=3, max=4].

==> Context: Bad node spec for node. Name: /model/layers.0/self_attn/q_proj/MatMulNBits OpType: MatMulNBits

Is there any way to solve the problem?

Thank you!

wejoncy commented 6 months ago

Same as #81 The right way to fix this is to upgrade onnxruntime to the latest main version or release 1.17.3.

csetanmayjain commented 1 month ago

hi, I'm also getting the same error, and updating the package to 1.17.3 doesn't work

I'm trying to convert gemma 2B model to 2bit. Following is my command:

python -m qllm --model google/gemma-2b --quant_method=hqq --groupsize=16 --wbits=2 --save ./google-gemma-2b_hqq_q2

python -m qllm --load ./google-gemma-2b_hqq_q2 --export_onnx=./google-gemma-2b_hqq_q2_onnx --pack_mode=ORT

Can you please share the fix? Also, can you also upload a requirements.txt file with the packages version?

wejoncy commented 1 month ago

hi, I'm also getting the same error, and updating the package to 1.17.3 doesn't work ... Can you please share the fix? Also, can you also upload a requirements.txt file with the packages version?

Hi @csetanmayjain , the tool can only convert 4bit model to onnx for now. Do you want to use the HQQ++ quantization algorith, QLLM doen't support that yet.

csetanmayjain commented 1 month ago

Hi @wejoncy Thanks! Any algorithm works for me, as long as I can quantize the model to 2-bit, export it to ONNX, and run the ONNX model on a CPU.

wejoncy commented 1 month ago

AFIK, onnx/ORT doesn't support any 2-bit algorithm for now. llama.cpp would be a good candidate.

BTW, There is a 2bit SOTA quantization algorithm and we will make it running in ONNX. VPTQ_arixv.pdf

csetanmayjain commented 1 month ago

I see, Thanks :)

wejoncy / QLLM

Problem with exporting GPTQ model to ONNX #117