the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4?

@xijiu9 besides, in the grad_weight calculation process, the code here seems to be not int4 matmul, since sample_x3 is divided by norm_weight_loop after quantized into INT4 here. The code is a little confusing to me, since I can not quite understand: norm_weight_loop ,which is in N1 shape is involved in the backprop, is your int4 matmul per-channel(batch-channel) quantization? But still this can not be done in hardward(or this will lose the accelerating meaning of quantization) since Cout N(activation gradient) and N * Cin(input activation) matmul can not be per-channel quantized at N(batch) level

xijiu9 / Train_Transformers_with_INT4

the paper mentioned that all linear ops are quantized into int4, what about gradients in mat-multiply ops in the attention module? Float or int4? #4