Should `scale` and `bias` be required inputs for `batchNormalization` op?

huningxin commented 9 months ago

(Thanks @wacky6 for raising this issue during reviewing Chromium CL-5034594)

Regarding to the existing batchNormalization definition, scale and bias operands are optional members of MLBatchNormalizationOptions dictionary. Regarding to its calculation, if scale is not present, the element-wise multiplication can be eliminated, and if bias is not present, the element-wise addition can be eliminated too.

// Assume input tensor is 4-D of the "nchw" layout.
const shape = [1, c, 1, 1];
let output = builder.div(
    builder.sub(input, builder.reshape(mean, shape)),
    builder.sqrt(builder.add(builder.reshape(variance, shape), builder.constant(options.epsilon))));
if (options.scale)
    output = builder.mul(builder.reshape(options.scale, shape), output);
if (options.bias)
    output = builder.add(builder.reshape(options.bias, shape), output);
return output;

However, the optional scale and bias are not widely supported across frameworks and native ML APIs. This would cause the implementation more complex for those native ML APIs which don't support optional scale and bias, e.g., by making bias tensor with all 0 and scale tensor with all 1 at graph building time, if the scale and bias are not present.

Frameworks:

TensorFlow's tf.nn.batch_normalization: offset (equivalent to bias) and scale are required parameters.
ONNX's BatchNormalization: scale and B (equivalent to bias) are required inputs.
Pytorch's BatchNorm: gamma (equivalent to scale) and beta (equivalent to bias) are optional inputs, controlled by affine parameter.

Native ML APIs:

DirectML's DML_BATCH_NORMALIZATION_OPERATOR_DESC: ScaleTensor and BiasTensor are not annotated with _Maybenull_, so they are supposed to be required.
MPS's MPSCNNBatchNormalizationDataSource: beta (equivalent to bias) and gamma (equivalent to scale) are annotated with Required.
Stable HLO's batch_norm_inference: scale and offset (equivalent to bias) are required inputs.

The proposal is to make the two operands required , for example

dictionary MLBatchNormalizationOptions {
  unsigned long axis = 1;
  float epsilon = 1e-5;
  MLActivation activation;
};

partial interface MLGraphBuilder {
  MLOperand batchNormalization(MLOperand input, MLOperand mean, MLOperand variance,
                               MLOperand scale, MLOperand bias,
                               optional MLBatchNormalizationOptions options = {});
};

For some models that won't use scale and bias at the inference time, e.g., DenseNet 121, the frameworks can set scale's values to 1 and bias's values to 0.

/cc @wchao1115 @fdwr

fdwr commented 9 months ago

@huningxin : Your analysis is persuasive. It might be convenient to callers to allow scale and bias to be optional, but if underlying backends do not support it (forcing implementations to add dummy 0 and 1 tensors), and frameworks are unlikely to generate such a call anyway, then making them required makes sense to me. (and yes, your reading of DML_BATCH_NORMALIZATION_OPERATOR_DESC is correct).

wchao1115 commented 9 months ago

@huningxin If I read this correctly, are you saying that tensor params should never be optional b/c it causes the implementation to have to allocate unnecessary buffer resources for them when dealing with a platform API that already treats them as required?

huningxin commented 9 months ago

@wchao1115

@huningxin If I read this correctly, are you saying that tensor params should never be optional b/c it causes the implementation to have to allocate unnecessary buffer resources for them when dealing with a platform API that already treats them as required?

The buffer resources are less concerned, because I suppose frameworks have to allocate dummy 0 and 1 tensors if the models don't need scale and bias, like DenseNet.

My point is if majority frameworks and native ML APIs have the scale and bias required, WebNN might be worth aligning with that, because this would simplify WebNN's implementation to deal with this uncommon usages.

However, as I mentioned in last WG call, on the other hand, this may prevent the potential future optimizations that a native implementation may eliminate the unnecessary element-wise multiplication (for scale) and addition (for bias) if the two are not present. So, I am wondering whether there is a such plan to do that optimization in native implementations. We may want to make this interface future-proof.

webmachinelearning / webnn

Should `scale` and `bias` be required inputs for `batchNormalization` op? #481

Frameworks:

Native ML APIs: