xijiu9 / Train_Transformers_with_INT4

131 stars 4 forks source link

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4

Open brisker opened 1 year ago

brisker commented 1 year ago

Nice work in this paper, I want to know that: the paper mentioned that all linear ops are quantized into int4, what about mat-multiply ops in the attention module? Is the activation gradient in matmul ops float or int4?

brisker commented 1 year ago

@xijiu9 besides, in the grad_weight calculation process, the code here seems to be not int4 matmul, since sample_x3 is divided by norm_weight_loop after quantized into INT4 here. The code is a little confusing to me, since I can not quite understand: norm_weight_loop ,which is in N1 shape is involved in the backprop, is your int4 matmul per-channel(batch-channel) quantization? But still this can not be done in hardward(or this will lose the accelerating meaning of quantization) since Cout N(activation gradient) and N * Cin(input activation) matmul can not be per-channel quantized at N(batch) level