Open brisker opened 1 year ago
@xijiu9
besides, in the grad_weight calculation process, the code here seems to be not int4 matmul, since sample_x3
is divided by norm_weight_loop
after quantized into INT4 here. The code is a little confusing to me, since I can not quite understand: norm_weight_loop
,which is in N1 shape is involved in the backprop, is your int4 matmul per-channel(batch-channel) quantization? But still this can not be done in hardward(or this will lose the accelerating meaning of quantization) since Cout N(activation gradient) and N * Cin(input activation) matmul can not be per-channel quantized at N(batch) level
Nice work in this paper, I want to know that: the paper mentioned that all linear ops are quantized into int4, what about mat-multiply ops in the attention module? Is the activation gradient in matmul ops float or int4?