Support block-wise quantization

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3 CoreML constexpr_blockwise_shift_scale TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

webmachinelearning / webnn

Support block-wise quantization #779

Native ML API's support

Proposal