wejoncy / QLLM

A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.
Apache License 2.0
142 stars 13 forks source link

AWQ and GPTQ numerical consistency #26

Closed fxmarty closed 9 months ago

fxmarty commented 9 months ago

Hi @wejoncy, thank you for this great lib & conversion tools. I've been very irregularly contributing to AutoGPTQ and am wondering about the kernel compatibility with AWQ models. I'm seeing some (sometimes large) numerical difference between AWQ model run with AWQ kernel, vs AWQ model converted to GPTQ format and run with GPTQ kernel (or manual torch implementation).

See the following (using https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ):

import os

import torch
import torch.nn as nn
import copy

from qllm.quant.quant_linear_awq import WQLinear_GEMM
from qllm.quant.quant_linear import QuantLinear

torch.set_printoptions(threshold=10000)

group_size = 128
bits = 4

m = 8
k = 11008
n = 4096
device = torch.device("cuda:0")

awq_linear = WQLinear_GEMM(w_bit=bits, group_size=group_size, in_features=k, out_features=n, bias=False)

from safetensors import safe_open

tensors = {}
with safe_open("/fsx/felix/llama_7b_awq_gemm/model.safetensors", framework="pt", device=0) as f:
    scales = f.get_tensor("model.layers.0.mlp.down_proj.scales").to(torch.float16)
    qweight = f.get_tensor("model.layers.0.mlp.down_proj.qweight")
    qzeros = f.get_tensor("model.layers.0.mlp.down_proj.qzeros")

print("awq_linear.qweight.dtype", awq_linear.qweight.dtype)
print("qweight.dtype", qweight.dtype)
print("awq_linear.qweight.shape", awq_linear.qweight.shape)
print("qweight.shape", qweight.shape)
assert awq_linear.qweight.shape == qweight.shape
assert awq_linear.qzeros.shape == qzeros.shape
assert awq_linear.scales.shape == scales.shape
assert awq_linear.qweight.dtype == qweight.dtype
assert awq_linear.qzeros.dtype == qzeros.dtype
assert awq_linear.scales.dtype == scales.dtype

awq_linear = awq_linear.to("cuda")
awq_linear.qweight = qweight.to("cuda")
awq_linear.qzeros = qzeros.to("cuda")
awq_linear.scales = scales.to("cuda")

awq_linear = awq_linear.eval()

inp = torch.rand(1, m, k, dtype=torch.float16).to(device)

with torch.no_grad():
    res_awq_original = awq_linear(inp).to(torch.float32)

# NOTE: Somehow we need this on here to have good results.
os.environ['load_from_autogptq'] = "1"

f16_weight, scales_unpack, zeros_unpack = awq_linear.unpack()

linear_unpacked = nn.Linear(k, n, bias=False).to("cuda").to(torch.float16)
linear_unpacked.weight = torch.nn.Parameter(f16_weight)

linear_unpacked = linear_unpacked.to("cpu")
scales_unpack = scales_unpack.to("cpu")
zeros_unpack = zeros_unpack.to("cpu")

qllm_linear = QuantLinear(bits, groupsize=group_size, infeatures=k, outfeatures=n, bias=False)
qllm_linear.pack(linear_unpacked, scales_unpack.T, zeros_unpack.T, g_idx=None)

# NOTE: Somehow we need this off here have good results.
os.environ['load_from_autogptq'] = "0"

qllm_linear = qllm_linear.to("cuda")

with torch.no_grad():
    res_qllm = qllm_linear(inp).to(torch.float32)

reldiff_qllm = (res_awq_original - res_qllm).abs() / (res_awq_original.abs() + 1e-15)

#print("Reldiff awq/gptq (qllm)", reldiff_qllm)
print("p90 reldiff awq/gptq", torch.quantile(reldiff_qllm, 0.9))
print("Median reldiff awq/gptq (qllm)", reldiff_qllm.median())
print("Mean reldiff awq/gptq (qllm)", reldiff_qllm.mean())
print("numel reldiff > 0.02:", torch.sum(reldiff_qllm > 5e-2).item(), f"out of total={res_awq_original.numel()}")
print("numel reldiff > 0.1:", reldiff_qllm[reldiff_qllm > 0.1].numel(), f"out of total={res_awq_original.numel()}")
print("numel reldiff > 0.5:", reldiff_qllm[reldiff_qllm > 0.5].numel(), f"out of total={res_awq_original.numel()}")

giving

qweight torch.Size([11008, 512])
awq_linear.qweight.dtype torch.int32
qweight.dtype torch.int32
awq_linear.qweight.shape torch.Size([11008, 512])
qweight.shape torch.Size([11008, 512])
------- qllm repacked
p90 reldiff awq/gptq tensor(0.0343, device='cuda:0')
Median reldiff awq/gptq (qllm) tensor(0.0052, device='cuda:0')
Mean reldiff awq/gptq (qllm) tensor(0.8043, device='cuda:0')
numel reldiff > 0.02: 2244 out of total=32768
numel reldiff > 0.1: 1171 out of total=32768
numel reldiff > 0.5: 271 out of total=32768

As we can see, the relative difference median is low (0.5%), and arguably the 90th percentile is low as well (90% of the output values have a relative diff <3.5%). However we still have a relatively large number of outliers, where the relative difference is large. The mean relative difference is large as well.

Do you have an idea why? Has this been an issue for you?

Thank you!

wejoncy commented 9 months ago

Hi @fxmarty This gap shouldn't be there, Will figure it out and let you know.

wejoncy commented 9 months ago

I reproduced it offline. The AWQ kernel leads to the errors.

With AWQ kernels, given prompt: compared with awq, gptq is. it outputs

more efficient in terms of computational complexity.
However, gptq has some limitations. First, it requires a pre-trained language model to generate the next token, which can be computation

However, GPTQ kernels produces

more efficient in terms of computational complexity.
However, gptq has some limitations. First, it requires a large amount of memory to store the entire training dataset, which can be a challenge

When compared with linear_unpacked.cuda(inp), GPTQ kernel has smaller errors.

wejoncy commented 9 months ago

BTW, I pushed some fixs so you don't have to explicitly write os.environ['load_from_autogptq'] = anymore.

fxmarty commented 9 months ago

Thank you @wejoncy. I wonder if it could be just numerical artifacts from one of the kernel, or somehow an issue in the conversion (unpacking/packing).

wejoncy commented 9 months ago

Hi @fxmarty , Good question.

I checked it by a roundtrip pack and unpack behavior and it ends up yileding the completely same results of Qweight, scale and qzeros. So, I think the unpack/pack is correct? Do you have suggestions on the correctness checking?

fxmarty commented 9 months ago

Thank you! No doing a back and forth unpack/pack seem to be the way to go, if that works the issue is not there. Could be just a kernel artifact then

wejoncy commented 9 months ago

Thanks for your experiment on the conversion correctness. I am happy if this tool is helpful in your work, and any suggestions are highly welcome. Thanks again.

fxmarty commented 9 months ago

Did you have a look to the compatibility with AutoGPTQ kernels (exllama, etc.)? For some reason using zeros -= 1 in the pack method + zeros = zeros + 1 in the forward generates <unk> tokens though simply looking at the GEMM output I get equivalent results, as in this post. Do you mind having a look if I open a PR in AutoGPTQ if you see anything blatantly bad?

Edit: nvm somehow got it working with the manual implem, exllama kernel still broken. It seems that torch.bitwise_and(zeros, (2 ** self.bits) - 1) is quite important.

wejoncy commented 9 months ago

Repled in https://github.com/PanQiWei/AutoGPTQ/pull/484#discussion_r1427450937