Closed fxmarty closed 9 months ago
Hi @fxmarty This gap shouldn't be there, Will figure it out and let you know.
I reproduced it offline. The AWQ kernel leads to the errors.
With AWQ kernels, given prompt: compared with awq, gptq is
. it outputs
more efficient in terms of computational complexity.
However, gptq has some limitations. First, it requires a pre-trained language model to generate the next token, which can be computation
However, GPTQ kernels produces
more efficient in terms of computational complexity.
However, gptq has some limitations. First, it requires a large amount of memory to store the entire training dataset, which can be a challenge
When compared with linear_unpacked.cuda(inp)
, GPTQ kernel has smaller errors.
BTW, I pushed some fixs so you don't have to explicitly write os.environ['load_from_autogptq'] =
anymore.
Thank you @wejoncy. I wonder if it could be just numerical artifacts from one of the kernel, or somehow an issue in the conversion (unpacking/packing).
Hi @fxmarty , Good question.
I checked it by a roundtrip pack and unpack behavior and it ends up yileding the completely same results of Qweight, scale and qzeros. So, I think the unpack/pack is correct? Do you have suggestions on the correctness checking?
Thank you! No doing a back and forth unpack/pack seem to be the way to go, if that works the issue is not there. Could be just a kernel artifact then
Thanks for your experiment on the conversion correctness. I am happy if this tool is helpful in your work, and any suggestions are highly welcome. Thanks again.
Did you have a look to the compatibility with AutoGPTQ kernels (exllama, etc.)? For some reason using zeros -= 1
in the pack method + zeros = zeros + 1
in the forward generates <unk>
tokens though simply looking at the GEMM output I get equivalent results, as in this post. Do you mind having a look if I open a PR in AutoGPTQ if you see anything blatantly bad?
Edit: nvm somehow got it working with the manual implem, exllama kernel still broken. It seems that torch.bitwise_and(zeros, (2 ** self.bits) - 1)
is quite important.
Hi @wejoncy, thank you for this great lib & conversion tools. I've been very irregularly contributing to AutoGPTQ and am wondering about the kernel compatibility with AWQ models. I'm seeing some (sometimes large) numerical difference between AWQ model run with AWQ kernel, vs AWQ model converted to GPTQ format and run with GPTQ kernel (or manual torch implementation).
See the following (using https://huggingface.co/TheBloke/Llama-2-7B-Chat-AWQ):
giving
As we can see, the relative difference median is low (0.5%), and arguably the 90th percentile is low as well (90% of the output values have a relative diff <3.5%). However we still have a relatively large number of outliers, where the relative difference is large. The mean relative difference is large as well.
Do you have an idea why? Has this been an issue for you?
Thank you!