wejoncy / QLLM

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
Apache License 2.0
119 stars 5 forks source link

Unsupported model IR version: 10, max supported IR version: 9 #126

Open FlexLaughing opened 1 week ago

FlexLaughing commented 1 week ago

Hi wejoncy, I met an issue we transfer q4 model to onnx model base on nivida 3090 when check after merge onnx model.

decoder_merged.onnx model properites: ONNX v10 optimum-onnx 0 ai.onnx v16 com.microsoft v1 merged

my conda env onnx related version, I didnot see the detail version in requerement.txt. (quan_debug) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep onnx onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.3 onnxruntime-gpu 1.16.3 onnxsim 0.4.8

[notice] A new release of pip is available: 21.3.1 -> 24.0 [notice] To update, run: pip install --upgrade pip (quan_debug) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/home/ubuntu/data/miniconda3/envs/quan_debug/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/home/ubuntu/data/miniconda3/envs/quan_debug/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 472, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/ubuntu/data/wangyan/test_qllm_debug/onnx3//decoder_merged.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

wejoncy commented 1 week ago

Hi @FlexLaughing It's mostly you installed the latest onnx package and old version ONNXRUNTIME package. Try to downgrade onnx may resolve it.

FlexLaughing commented 1 week ago

Hi @FlexLaughing It's mostly you installed the latest onnx package and old version ONNXRUNTIME package. Try to downgrade onnx may resolve it.

Thanks for the quick feedback, is there any suggestion for onnx and onnxruntime version?

wejoncy commented 1 week ago

https://github.com/microsoft/onnxruntime/issues/16638

wejoncy commented 1 week ago

Basiclly, Use both the latest onnx/onnxruntime will compatiable each other, Unless onnx is released days ago while onnxruntime is not catching the update.

wejoncy commented 1 week ago

onnxruntime-gpu 1.16.3-->>>onnxruntime-gpu 1.18.0

FlexLaughing commented 1 week ago

Thanks, I will try to debug and record the different combination for onnx and onnxruntime version. Hope could write soon.

FlexLaughing commented 4 days ago

onnxruntime-gpu 1.16.3-->>>onnxruntime-gpu 1.18.0

Hi wejoncy, I try with combination of different versions ,seems all failed after check onnx model. Do you have a recommended version?it could export onnx model,but failed with check: exporter.verify_correcness(model, sample_inputs, onnx_model_path, with_past)

Version 1: fail with onnx 1.15.0 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.1 onnxruntime-gpu 1.17.1 onnxsim 0.4.8

Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).

Version 2: fail with onnx 1.14.0 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.1 onnxruntime-gpu 1.17.1 onnxsim 0.4.8 led:Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).

Version 3:fail with onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.18.0 onnxruntime-gpu 1.18.0 onnxsim 0.4.8 Load model from /home/ubuntu/data/wangyan/test_qllm_debug/onnx4/decoder_merged.onnx failed:Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).

Version 4: fail with onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.3 onnxruntime-gpu 1.16.3 onnxsim 0.4.8

:IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

wejoncy commented 3 days ago

Could you upload the model "decoder_merged.onnx" or share the minimal code to reproduce it?

FlexLaughing commented 3 days ago

decoder_with_past.zip decoder_with_past.onnx_ext.data file is more than 1G ,over the upload limitation.

wejoncy commented 3 days ago

hmm, It seems everyting is fine. what is the qllm cli command?

FlexLaughing commented 3 days ago

hmm, It seems everyting is fine. what is the qllm cli command?

python -m qllm --model /home/ubuntu/data/mjl/llm-awq-main/llama3-8b --method=awq --dataset=pileval --nsamples=16 --save ./Llama-3-8b_q4/ --export_onnx ./Llama-3-8b_q4_onnx/

failed when I load the model with onnxruntime,could you please try with decoder_with_past.onnx ? onnx_model_path = onnx_path_str+'/decoder.onnx' session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])

wejoncy commented 3 days ago

image

Everyting is fine in my env

conda create -n py310 python=3.10
pip install torch numpy qllm
git clone https://github.com/microsoft/onnxruntime.git && cd onnxruntime
bash build.sh --cmake_generator "Ninja" --config Release --cmake_extra_defines CMAKE_EXPORT_COMPILE_COMMANDS=ON --skip_tests  --build_wheel --use_cuda  --cudnn_home /usr/lib/x86_64-linux-gnu   --cuda_home /usr/local/cuda  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="80"
pip install build/Linux/Release/dist/onnxruntime_gpu-1.19.0-cp310-cp310-linux_x86_64.whl 
FlexLaughing commented 3 days ago

image

Everyting is fine in my env

conda create -n py310 python=3.10
pip install torch numpy qllm
git clone https://github.com/microsoft/onnxruntime.git && cd onnxruntime
bash build.sh --cmake_generator "Ninja" --config Release --cmake_extra_defines CMAKE_EXPORT_COMPILE_COMMANDS=ON --skip_tests  --build_wheel --use_cuda  --cudnn_home /usr/lib/x86_64-linux-gnu   --cuda_home /usr/local/cuda  --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="80"
pip install build/Linux/Release/dist/onnxruntime_gpu-1.19.0-cp310-cp310-linux_x86_64.whl 

I try to compile onnxruntime in my RTX 3090,64G DDR,seems memory exhaust . generating /home/ubuntu/onnxruntime/build/Linux/Release/_deps/onnx-build/onnx/onnx_operators_pb.py [1138/1791] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/h...contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim192_bf16_sm80.cu.o

wejoncy commented 2 days ago

I try to compile onnxruntime in my RTX 3090,64G DDR,seems memory exhaust . generating /home/ubuntu/onnxruntime/build/Linux/Release/_deps/onnx-build/onnx/onnx_operators_pb.py [1138/1791] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/h...contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim192_bf16_sm80.cu.o

Please add --parallel 4 --nvcc_threads 1 in the build command.

FlexLaughing commented 2 days ago

企业微信截图_17199874298357 I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct? ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime onnxruntime-gpu 1.19.0

wejoncy commented 1 day ago

I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct? ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime onnxruntime-gpu 1.19.0

Yeah, It's correct. What's the error message during create session?

FlexLaughing commented 1 day ago

I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct? ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime onnxruntime-gpu 1.19.0

Yeah, It's correct. What's the error message during create session?

Traceback (most recent call last): File "debug_onnx.py", line 14, in session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider']) File "/home/ubuntu/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/home/ubuntu/.local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 472, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/ubuntu/data/wangyan/test_qllm_debug/onnx4//decoder .onnx failed:Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/ self_attn/o_proj/MatMulNBits).

FlexLaughing commented 1 day ago

企业微信截图_17200577731202

FlexLaughing commented 1 day ago

looks like error similar with https://github.com/microsoft/onnxruntime/issues/5581 ,the scales data type only support float?but we generate with float16 ?

wejoncy commented 1 day ago

No, I don't think so. The thing is that I can 't repro it in my local environment.

wejoncy commented 1 day ago

Did you see any error message during export onnx model? I think qllm will check the output of ONNX model and Pytorch ones.

FlexLaughing commented 1 day ago

(auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep onnx onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime-gpu 1.19.0 onnxruntime-training 1.19.0+cu121 onnxsim 0.4.8

[notice] A new release of pip is available: 21.3.1 -> 24.1.1 [notice] To update, run: pip install --upgrade pip (auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ logfile_fail.log

wejoncy commented 1 day ago

I think I got it. image

MatmulNbits requires input_x and scales has the same dtype which is constrained by T1, However, the attention output is float32, which is incompatible with matmulBbits's scale

What's your transformer's version?

FlexLaughing commented 1 day ago

I think I got it. image

MatmulNbits requires input_x and scales has the same dtype which is constrained by T1, However, the attention output is float32, which is incompatible with matmulBbits's scale

What's your transformer's version?

(auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep tran pytorch-transformers 1.0.0 s3transfer 0.6.0 sentence-transformers 2.2.2 transformers 4.41.2 transformers-stream-generator 0.0.4

wejoncy commented 1 day ago

It's a bug of pytorch-2.1 or below. Does your PyTorch is up-to-date? SPDA-attention used float32 as the output-dtype , If you don't want to update PyTorch, you can set attention_impl tp "eager".

FlexLaughing commented 1 day ago

It's a bug of pytorch-2.1 or below. Does your PyTorch is up-to-date? SPDA-attention used float32 as the output-dtype , If you don't want to update PyTorch, you can set attention_impl tp "eager".

emm,got it! I update pytorch version to 2.3, and saw an env issue, I am working on it right now. "from torch._C import * # noqa: F403 ImportError: libcupti.so.12: cannot open shared object file: No such file or directory"

wejoncy commented 1 day ago

This seems to be an issue of your cuda environment. Anyway, to force transformer to use "eager" attention will work for you. https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/llama/modeling_llama.py#L666

FlexLaughing commented 1 day ago

This seems to be an issue of your cuda environment. Anyway, to force transformer to use "eager" attention will work for you. https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/llama/modeling_llama.py#L666

Seems my env have some other errors, is it possible to load the onnx model base on CPU ? 企业微信截图_17200905188992

wejoncy commented 23 hours ago

Yes, you can use CPU EP as long as onnx model is exported with dtype as float32