webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
397 stars 48 forks source link

Support block-wise quantization #779

Open huningxin opened 3 weeks ago

huningxin commented 3 weeks ago

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3 CoreML constexpr_blockwise_shift_scale TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

fdwr commented 3 weeks ago

Thanks for the paper link. I'd be surprised if TFLite didn't have some blockwise support somewhere, but if not, it might need decomposition (e.g. scale and zeroPoint blockwise expanded up to the input shape via tf.tile or tf.repeats or tf.imaging.resize or some other similar function, then dq = (input - zeroPoint) * scale).