Open FlexLaughing opened 1 week ago
Hi @FlexLaughing It's mostly you installed the latest onnx package and old version ONNXRUNTIME package. Try to downgrade onnx may resolve it.
Hi @FlexLaughing It's mostly you installed the latest onnx package and old version ONNXRUNTIME package. Try to downgrade onnx may resolve it.
Thanks for the quick feedback, is there any suggestion for onnx and onnxruntime version?
Basiclly, Use both the latest onnx/onnxruntime will compatiable each other, Unless onnx is released days ago while onnxruntime is not catching the update.
onnxruntime-gpu 1.16.3-->>>onnxruntime-gpu 1.18.0
Thanks, I will try to debug and record the different combination for onnx and onnxruntime version. Hope could write soon.
onnxruntime-gpu 1.16.3-->>>onnxruntime-gpu 1.18.0
Hi wejoncy, I try with combination of different versions ,seems all failed after check onnx model. Do you have a recommended version?it could export onnx model,but failed with check: exporter.verify_correcness(model, sample_inputs, onnx_model_path, with_past)
Version 1: fail with onnx 1.15.0 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.1 onnxruntime-gpu 1.17.1 onnxsim 0.4.8
Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).
Version 2: fail with onnx 1.14.0 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.1 onnxruntime-gpu 1.17.1 onnxsim 0.4.8 led:Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).
Version 3:fail with onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.18.0 onnxruntime-gpu 1.18.0 onnxsim 0.4.8 Load model from /home/ubuntu/data/wangyan/test_qllm_debug/onnx4/decoder_merged.onnx failed:Node (optimum::if) Op (If) [TypeInferenceError] Graph attribute inferencing failed: Type Error: Type parameter (T1) of Optype (MatMulNBits) bound to different types (tensor(float) and tensor(float16) in node (/model/layers.0/self_attn/o_proj/MatMulNBits).
Version 4: fail with onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.3 onnxruntime-gpu 1.16.3 onnxsim 0.4.8
:IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9
Could you upload the model "decoder_merged.onnx" or share the minimal code to reproduce it?
decoder_with_past.zip decoder_with_past.onnx_ext.data file is more than 1G ,over the upload limitation.
hmm, It seems everyting is fine. what is the qllm cli command?
hmm, It seems everyting is fine. what is the qllm cli command?
python -m qllm --model /home/ubuntu/data/mjl/llm-awq-main/llama3-8b --method=awq --dataset=pileval --nsamples=16 --save ./Llama-3-8b_q4/ --export_onnx ./Llama-3-8b_q4_onnx/
failed when I load the model with onnxruntime,could you please try with decoder_with_past.onnx ? onnx_model_path = onnx_path_str+'/decoder.onnx' session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
Everyting is fine in my env
conda create -n py310 python=3.10
pip install torch numpy qllm
git clone https://github.com/microsoft/onnxruntime.git && cd onnxruntime
bash build.sh --cmake_generator "Ninja" --config Release --cmake_extra_defines CMAKE_EXPORT_COMPILE_COMMANDS=ON --skip_tests --build_wheel --use_cuda --cudnn_home /usr/lib/x86_64-linux-gnu --cuda_home /usr/local/cuda --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="80"
pip install build/Linux/Release/dist/onnxruntime_gpu-1.19.0-cp310-cp310-linux_x86_64.whl
Everyting is fine in my env
conda create -n py310 python=3.10 pip install torch numpy qllm git clone https://github.com/microsoft/onnxruntime.git && cd onnxruntime bash build.sh --cmake_generator "Ninja" --config Release --cmake_extra_defines CMAKE_EXPORT_COMPILE_COMMANDS=ON --skip_tests --build_wheel --use_cuda --cudnn_home /usr/lib/x86_64-linux-gnu --cuda_home /usr/local/cuda --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="80" pip install build/Linux/Release/dist/onnxruntime_gpu-1.19.0-cp310-cp310-linux_x86_64.whl
I try to compile onnxruntime in my RTX 3090,64G DDR,seems memory exhaust . generating /home/ubuntu/onnxruntime/build/Linux/Release/_deps/onnx-build/onnx/onnx_operators_pb.py [1138/1791] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/h...contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim192_bf16_sm80.cu.o
I try to compile onnxruntime in my RTX 3090,64G DDR,seems memory exhaust . generating /home/ubuntu/onnxruntime/build/Linux/Release/_deps/onnx-build/onnx/onnx_operators_pb.py [1138/1791] Building CUDA object CMakeFiles/onnxruntime_providers_cuda.dir/h...contrib_ops/cuda/bert/flash_attention/flash_fwd_split_hdim192_bf16_sm80.cu.o
Please add --parallel 4 --nvcc_threads 1
in the build command.
I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct?
ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime
onnxruntime-gpu 1.19.0
I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct? ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime onnxruntime-gpu 1.19.0
Yeah, It's correct. What's the error message during create session?
I compile and install but still failed when create session,and i debug to _create_inference_session,if available_providers value is correct? ubuntu@ubuntu-Z690-UD-AX-DDR4:~/onnxruntime$ pip list |grep onnxruntime onnxruntime-gpu 1.19.0
Yeah, It's correct. What's the error message during create session?
Traceback (most recent call last):
File "debug_onnx.py", line 14, in
looks like error similar with https://github.com/microsoft/onnxruntime/issues/5581 ,the scales data type only support float?but we generate with float16 ?
No, I don't think so. The thing is that I can 't repro it in my local environment.
Did you see any error message during export onnx model? I think qllm will check the output of ONNX model and Pytorch ones.
(auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep onnx onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime-gpu 1.19.0 onnxruntime-training 1.19.0+cu121 onnxsim 0.4.8
[notice] A new release of pip is available: 21.3.1 -> 24.1.1 [notice] To update, run: pip install --upgrade pip (auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ logfile_fail.log
I think I got it.
MatmulNbits requires input_x and scales has the same dtype which is constrained by T1
, However, the attention output is float32, which is incompatible with matmulBbits's scale
What's your transformer's version?
I think I got it.
MatmulNbits requires input_x and scales has the same dtype which is constrained by
T1
, However, the attention output is float32, which is incompatible with matmulBbits's scaleWhat's your transformer's version?
(auto_awq) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep tran pytorch-transformers 1.0.0 s3transfer 0.6.0 sentence-transformers 2.2.2 transformers 4.41.2 transformers-stream-generator 0.0.4
It's a bug of pytorch-2.1 or below. Does your PyTorch is up-to-date? SPDA-attention used float32 as the output-dtype , If you don't want to update PyTorch, you can set attention_impl tp "eager".
It's a bug of pytorch-2.1 or below. Does your PyTorch is up-to-date? SPDA-attention used float32 as the output-dtype , If you don't want to update PyTorch, you can set attention_impl tp "eager".
emm,got it! I update pytorch version to 2.3, and saw an env issue, I am working on it right now. "from torch._C import * # noqa: F403 ImportError: libcupti.so.12: cannot open shared object file: No such file or directory"
This seems to be an issue of your cuda environment. Anyway, to force transformer to use "eager" attention will work for you. https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/llama/modeling_llama.py#L666
This seems to be an issue of your cuda environment. Anyway, to force transformer to use "eager" attention will work for you. https://github.com/huggingface/transformers/blob/048f599f3506e57e0a595b455d9d2834c8d45023/src/transformers/models/llama/modeling_llama.py#L666
Seems my env have some other errors, is it possible to load the onnx model base on CPU ?
Yes, you can use CPU EP as long as onnx model is exported with dtype as float32
Hi wejoncy, I met an issue we transfer q4 model to onnx model base on nivida 3090 when check after merge onnx model.
decoder_merged.onnx model properites: ONNX v10 optimum-onnx 0 ai.onnx v16 com.microsoft v1 merged
my conda env onnx related version, I didnot see the detail version in requerement.txt. (quan_debug) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$ pip list |grep onnx onnx 1.16.1 onnx-graphsurgeon 0.3.12 onnxruntime 1.17.3 onnxruntime-gpu 1.16.3 onnxsim 0.4.8
[notice] A new release of pip is available: 21.3.1 -> 24.0 [notice] To update, run: pip install --upgrade pip (quan_debug) ubuntu@ubuntu-Z690-UD-AX-DDR4:~$
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/home/ubuntu/data/miniconda3/envs/quan_debug/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/home/ubuntu/data/miniconda3/envs/quan_debug/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 472, in _create_inference_session sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model) onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from /home/ubuntu/data/wangyan/test_qllm_debug/onnx3//decoder_merged.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9