llama-2-7b-chat gptq quantize & onnx export fail: RuntimeError: The size of tensor a (4096) must match the size of tensor b (2) at non-singleton dimension 2

Thanks for sharing work for LLM quantization & onnx export.
I follow the script in 'Convert to onnx model' section, and got following error below. Do you know any possible reason?
root@6779dc2e2500:/ssd1/geonmin.kim/QLLM/llama-2-7b-chat-4bit-gptq# python -m qllm --model  meta-llama/Llama-2-7b-chat-hf  --quant_method=gptq  --dataset=pileval --nsamples=16 --groupsize=-1 --save ./Llama-2-7b-chat-hf_gptq_q4/ --export_onnx ./Llama-2-7b-chat-hf_gptq_q4_onnx/
Namespace(quant_method='gptq', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='', dataset='pileval', seed=0, nsamples=16, percdamp=0.01, static_groups=False, wbits=4, mix_qlayer_conf=None, groupsize=-1, eval=False, save='./Llama-2-7b-chat-hf_gptq_q4/', save_safetensors='', load='', sym=False, act_order=False, true_sequential=False, allow_mix_bits=False, export_onnx='./Llama-2-7b-chat-hf_gptq_q4_onnx/', use_plugin=False, pack_mode='AUTO')
2024-09-23 08:15:14,576 - qllm - INFO - loading model from meta-llama/Llama-2-7b-chat-hf
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.03it/s]
2024-09-23 08:15:16,175 - qllm - INFO - loading dataset from pileval
Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 471M/471M [00:05<00:00, 94.1MB/s]
Generating validation split: 214670 examples [00:10, 21305.16 examples/s]
Starting ...
Ready.
running GPTQ: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [06:46<00:00, 12.71s/it]
awq_inference_engine not found, will skip it.
ort_ops is not installed. Will fallback to Torch Backend
marlin_cuda is not installed. marlin_cuda is not use
Replacing linear layers...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 455/455 [00:00<00:00, 914.37it/s]
Packing weights....: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 224/224 [00:08<00:00, 27.19it/s]
2024-09-23 08:22:37,915 - qllm - INFO - Finished quantization and packing weight, time cost:418.58441376686096
INFO:qllm:Finished quantization and packing weight, time cost:418.58441376686096
repacking model from pack_mode=`GPTQ` to `ORT`: 100%|██████████████████████████████████████████████████████████████████████| 224/224 [00:18<00:00, 12.38it/s]
2024-09-23 08:23:00,588 - qllm - INFO - Exporting onnx model ...
INFO:qllm:Exporting onnx model ...
Model_Size = 3.512451410293579 GB
total_mem_per_cpu = 23.69110107421875 GB
Export model on a single GPU
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/qllm/__main__.py", line 6, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/qllm/run.py", line 78, in main
    model_quanter.run(args)
  File "/usr/local/lib/python3.10/dist-packages/qllm/auto_model_quantization.py", line 242, in run
    self.export_onnx(model, args.export_onnx, inputs_dataloader[0], True)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/qllm/auto_model_quantization.py", line 160, in export_onnx
    onnx_model_path = exporter.export_onnx(model, onnx_path_str, sample_inputs, with_past, opset)
  File "/usr/local/lib/python3.10/dist-packages/qllm/utils/onnx/exporter.py", line 30, in export_onnx
    input_keys, onnx_inputs, past_key_value = large_model_exporter.retrieve_onnx_inputs(model, sample_inputs, with_past)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/transformers/large_model_exporter.py", line 148, in retrieve_onnx_inputs
    out = model(sample_inputs[0], attention_mask=sample_inputs[1])
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1603, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1189, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1001, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 734, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 617, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/qllm/modeling/q_layers/quant_linear_onnxruntime.py", line 170, in forward
    out = QuantLinearTorchFunction_forward(
  File "/usr/local/lib/python3.10/dist-packages/qllm/modeling/q_layers/quant_linear_onnxruntime.py", line 48, in QuantLinearTorchFunction_forward
    out = QuantLinearTorchFunction().apply(inputs, qweight, scales, qzeros, g_idx, bits, groupsize, in_features, out_features)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib/python3.10/dist-packages/qllm/modeling/q_layers/quant_linear_onnxruntime.py", line 36, in forward
    fp_weight = dequantize_blockwise_4bits(
  File "/usr/local/lib/python3.10/dist-packages/qllm/modeling/q_layers/quant_linear_onnxruntime.py", line 71, in dequantize_blockwise_4bits
    float_values = ((expand_quant_value - expand_zero_point) * aligned_scale).to(scale.dtype)
RuntimeError: The size of tensor a (4096) must match the size of tensor b (2) at non-singleton dimension 2
wejoncy / QLLM

llama-2-7b-chat gptq quantize & onnx export fail: RuntimeError: The size of tensor a (4096) must match the size of tensor b (2) at non-singleton dimension 2 #139