Open clevercool opened 3 years ago
I have the same question about APoT.
Also, when you wrote (for regular PoT)
R1 * R2 = (S1 * S2) * (2^m * 2^n)
= (S1 * S2) * 2^(m + n)
Are S1, S2
integers? if not, S1 * S2
is a floating point value, and I'm not sure how (S1 * S2) * 2^(m + n)
can be done with bit shift operations. This was one of the confusing parts of the paper for me (eqn 4) and it seems to assume the activations are uniformly quantized to integers, otherwise I'm not sure how bit shift can be used. Edit: missed the part about "fixed point representation." That might enable bit shifting for this use case
Hi,
Do you have the specific design of the MUL (Multiplication) unit for APOT quantization?
We know that uniform(Int) quantization or POT quantization are friendly to hardware.
Assume that: R = real number S = Scale number T = quantized number R1 = S1 T1 R2 = S2 T2
Uniform quantization simply adopts the INT MUL unit:
So, we have:
For POT:
So, we have:
The POT is similar to the only-exponent float MUL.
And the last two bits:
For the first two bits, the decoder table is not continuous: 0, -1, -3, -5.
Q1: How do you efficiently decode the binary code to the APOT, especially in the MUL unit?
Aussume the two number in APOT:
Obviously, the calculation has 4x (9x) add operations than POT in 4-bit (6-bit). And the result violates the definition of APOT, which won't have the same additive element in a number, such as 2^-5.
Q2: How do you deal with the complex computation and the subnormal number for APOT?
One direct solution is to convert a float with fake quantization. But is it a violation of the principle of quantization?