Consider removing `MLActivation` parameters used for op fusion

a-sully commented 1 month ago

MLActivation currently has two uses:

With operators like batchNormalization and conv2d, as a hint that this activation may be fusable with the given operator
With recurrent operators like lstm and gru, as activations applied to gates within the given recurrent unit

Operator	Recurrent?	Purpose
batchNormalization	No	Maybe fusable
conv2d	No	Maybe fusable
convTranspose2d	No	Maybe fusable
gru	Yes	Applied to gate
gruCell	Yes	Applied to gate
lstm	Yes	Applied to gate
lstmCell	Yes	Applied to gate

I have some thoughts on (2), but I'll leave that to another issue. Let's focus on (1) here :)

My claim is that there's no benefit to passing MLActivations as parameters to MLOperands for the purpose of op fusion. Here's why:

False positives: For any given permutation of operator, activation, and backend, op fusion may not be possible (it's quite unlikely, actually)
- CoreML does not natively support fusing activations with any of these operators
- DirectML only supports fusing some activations with some operators. There's an existing Chromium bug to un-fuse MLActivations from their respective MLOperand if the combo is not supported for the given version of DML
False negatives: For any given operator which does not take currently take an MLActivation in the WebNN spec, it may in fact be fusible with its input or output
- DML claims to support op fusion for DML_OPERATOR_ELEMENT_WISE_ADD1 and DML_OPERATOR_GEMM, which ~map to WebNN's add and gemm operators
- TFLite supports fusing various operators, as well

What this means in practice is:

Implementations wrapping backends which can't fuse operators, or can't fuse some operators with some activations, must trivially break apart the MLOperand and its MLActivation into what's effectively just two MLOperands connected to each other (as Chromium's CoreML backend currently does):
```
 input
   |
operator
   |
activation
   |
 output
```
Implementations wrapping backends which can fuse operators must do an optimization pass anyways to fuse operators which do not have MLActivation parameters in WebNN (as Chromium's DML backend currently does)

Whether a given operator can be fused with a given activation is a very backend-specific quirk. Presumably we don't want to plumb through a new MLActivation operator to the web for every operator which any backend decides it can now fuse with some activation! This seems best left as an implementation detail either by the user agent (as described above) or the framework (who knows how much op fusion is happening in Core ML under the hood?! 🤔)

Thoughts?

fdwr commented 1 month ago

🤔 Reviewing so many code reviews the past year, these fields...

MLConv2dOptions::activation
MLConvTranspose2dOptions::activation
MLBatchNormalizationOptions::activation

...have indeed shown to be problematic in the following aspects:

Front-end complexity - high-level WebNN clients need more complicated graph node enumeration and mapping logic to fuse activation operators, fusions which may or may not actually even fuse under the hood to provide any perf advantage, depending on the particular backend used and the hardware driver that backend uses; but it adds definite repeated complexity to each caller vs a more straight-forward mapping between their operator currency and WebNN calls. Ideally there should be no performance difference (from the WebNN caller POV) between calling {conv+relu} vs {conv,relu}, as the backend is what knows best which ops are fusible and can fuse them when it calls the platform.
Back-end complexity - more backend platforms do not support the fused activations than support them, and if a platform doesn't support a combination, then the backend must treat them as if they were two separately issued operators to begin with by unfusing them. Even within the same backend, different platform versions support different sets of activations; and backends needing to internally support both fusion and unfusion is more burdensome than only supporting one direction - internal fusion of adjacent operators.
Test complexity - combinatorial test cases are needed for every base operator with every potential activation.
Spec validation issues - some operators (like softmax with the axis parameter) cannot be validated at creation time because it's not until they are combined with a base operator that the input shape is known.
Inconsistencies and incompleteness - some MLActivations are only valid with some operators, like clamp which was added as a stand-alone MLActivation useable with convolution but invalid to use with lstm/lstmCell/gru/gruCell, which leads to user surprises and documentation that's not straight-forward. Conversely, some other operators are actually fusible in the backend platform (like GEMM+softmax in DML) but are not expressible at the WebNN layer. Mingming can attest to the bugs and interesting considerations that caused so far. Trying to come up with a complete "activation set" to satisfy this across all the platforms is untenable, given the divergences. There's also semantic awkwardness to just what an "activation" function really is as operators don't have clear buckets that can all fit into - e.g. softmax is a normalization operator, and clamp is an elementwise math operator, but they can also both be activation functions (and that list will grow in the future, something each backend can keep up with relative to the underlying platform better than the WebNN API can satisfy broadly).

This added complexity is for performance, and that performance is worthwhile, but of the currently implemented backends, the one that benefits most from these fusions is the DirectML one in Chromium, and this backend already does a one operator look-ahead anyway for simple adjacent fusions (like Conv+Relu) which means that any WebNN caller can call either {conv, relu} or {conv+relu} and get the same performance benefit - no fusion logic required from the caller at the WebNN API level. So, I support removing activation fields from MLConv2dOptions, MLConvTranspose2dOptions, and MLBatchNormalizationOptions, as the intended benefit still happens, and it eases implementation burden of front-end/back-end/conformance.

FYI @mingmingtasd.

(I also want to check with Chai and Rafael too for their opinions...)

a-sully commented 1 month ago

Thanks for the support, @fdwr. I've marked #664 as ready for review!

huningxin commented 1 month ago

Thanks for the proposal @a-sully and discussion @fdwr !

I think it is good to leave the activation functions fusion to underlying graph optimizer (either in Chromium backends, such as DirectML backend in Chromium, or native ML frameworks, such as XNNPACK's xnn_subgraph_fusion()). That would make the WebNN API clearer.

Previously when WebNN is used as a target backend of frameworks, such as ONNXRuntime Web and TFLite Wasm runtime, WebNN is likely given a convolution with fused activation function thanks to the framework's graph optimizer. A WebNN backend could leverage that and map to native convolution operator supporting fused activation function directly. Now the WebNN backend should ensure whether the native framework can fuse activation functions. If the native framework doesn't fuse, WebNN backend should add an optimization pass and fuse by itself in order to avoid the performance regression.

Would it be worth adding an implementation note?

a-sully commented 4 weeks ago

Thanks, @huningxin. I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

If you have more suggestions, please comment on that PR! :)

huningxin commented 4 weeks ago

@a-sully

I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

That one is good, but it is more about the graph compilation time optimization that is usually done by native ML APIs. I think we won't want to miss the graph construction time optimization opportunity when it is natively supported, as the spec text mentions for previous MLActivation interface

These activations function types are used to create other operations. One such use of this interface is for when an activation function is fused into another operation such as conv2d() or batchNormalization() during a graph construction session. Such fused activation functions can provide a significant performance improvement when supported natively by the underlying implementation. This is intended as an optimization opportunity for implementers.

DirectML backend in Chromium exploits this op fusion opportunity when constructing the DirectML graph.

If you have more suggestions, please comment on that PR! :)

Will do!

mingmingtasd commented 4 weeks ago

Spec validation issues - some operators (like softmax with the axis parameter) cannot be validated at creation time because it's not until they are combined with a base operator that the input shape is known.

@fdwr Correct! Such as the softmax with axis, if the base operator's output shape is unknown, we can't fuse them.

webmachinelearning / webnn

Consider removing `MLActivation` parameters used for op fusion #658