Open huningxin opened 3 weeks ago
Thanks for the paper link. I'd be surprised if TFLite didn't have some blockwise support somewhere, but if not, it might need decomposition (e.g. scale
and zeroPoint
blockwise expanded up to the input shape via tf.tile or tf.repeats or tf.imaging.resize or some other similar function, then dq = (input - zeroPoint) * scale).
Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.
Native ML API's support
DML
DML_OPERATOR_QUANTIZE
andDML_OPERATOR_DEQUANTIZE
introduced in Feature Level 6.3 CoreML constexpr_blockwise_shift_scale TFLite: ?Proposal
No API signature changes regarding to @fdwr 's proposal of
dequantizeLinear
andquantizeLinear
ops.The
block_size
is an integer and implied byblock_size = input_size / scale_size
(whereinput_size % scale_size == 0
) along a dimension.zeroPoint
andscale
should have the same shape.