webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
335 stars 42 forks source link

Consider removing `MLActivation` parameters used for op fusion #658

Closed a-sully closed 3 weeks ago

a-sully commented 1 month ago

MLActivation currently has two uses:

  1. With operators like batchNormalization and conv2d, as a hint that this activation may be fusable with the given operator
  2. With recurrent operators like lstm and gru, as activations applied to gates within the given recurrent unit
Operator Recurrent? Purpose
batchNormalization No Maybe fusable
conv2d No Maybe fusable
convTranspose2d No Maybe fusable
gru Yes Applied to gate
gruCell Yes Applied to gate
lstm Yes Applied to gate
lstmCell Yes Applied to gate

I have some thoughts on (2), but I'll leave that to another issue. Let's focus on (1) here :)

My claim is that there's no benefit to passing MLActivations as parameters to MLOperands for the purpose of op fusion. Here's why:

  1. False positives: For any given permutation of operator, activation, and backend, op fusion may not be possible (it's quite unlikely, actually)
    • CoreML does not natively support fusing activations with any of these operators
    • DirectML only supports fusing some activations with some operators. There's an existing Chromium bug to un-fuse MLActivations from their respective MLOperand if the combo is not supported for the given version of DML
  2. False negatives: For any given operator which does not take currently take an MLActivation in the WebNN spec, it may in fact be fusible with its input or output

What this means in practice is:

Whether a given operator can be fused with a given activation is a very backend-specific quirk. Presumably we don't want to plumb through a new MLActivation operator to the web for every operator which any backend decides it can now fuse with some activation! This seems best left as an implementation detail either by the user agent (as described above) or the framework (who knows how much op fusion is happening in Core ML under the hood?! 🤔)

Thoughts?

fdwr commented 1 month ago

🤔 Reviewing so many code reviews the past year, these fields...

...have indeed shown to be problematic in the following aspects:

This added complexity is for performance, and that performance is worthwhile, but of the currently implemented backends, the one that benefits most from these fusions is the DirectML one in Chromium, and this backend already does a one operator look-ahead anyway for simple adjacent fusions (like Conv+Relu) which means that any WebNN caller can call either {conv, relu} or {conv+relu} and get the same performance benefit - no fusion logic required from the caller at the WebNN API level. So, I support removing activation fields from MLConv2dOptions, MLConvTranspose2dOptions, and MLBatchNormalizationOptions, as the intended benefit still happens, and it eases implementation burden of front-end/back-end/conformance.

FYI @mingmingtasd.

(I also want to check with Chai and Rafael too for their opinions...)

a-sully commented 1 month ago

Thanks for the support, @fdwr. I've marked #664 as ready for review!

huningxin commented 1 month ago

Thanks for the proposal @a-sully and discussion @fdwr !

I think it is good to leave the activation functions fusion to underlying graph optimizer (either in Chromium backends, such as DirectML backend in Chromium, or native ML frameworks, such as XNNPACK's xnn_subgraph_fusion()). That would make the WebNN API clearer.

Previously when WebNN is used as a target backend of frameworks, such as ONNXRuntime Web and TFLite Wasm runtime, WebNN is likely given a convolution with fused activation function thanks to the framework's graph optimizer. A WebNN backend could leverage that and map to native convolution operator supporting fused activation function directly. Now the WebNN backend should ensure whether the native framework can fuse activation functions. If the native framework doesn't fuse, WebNN backend should add an optimization pass and fuse by itself in order to avoid the performance regression.

Would it be worth adding an implementation note?

a-sully commented 4 weeks ago

Thanks, @huningxin. I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

If you have more suggestions, please comment on that PR! :)

huningxin commented 4 weeks ago

@a-sully

I didn't mention anything about op fusion in #664 because there's already this text in the spec:

Before the execution, the computation graph that is used to compute one or more specified outputs needs to be compiled and optimized. The key purpose of the compilation step is to enable optimizations that span two or more operations, such as operation or loop fusion.

That one is good, but it is more about the graph compilation time optimization that is usually done by native ML APIs. I think we won't want to miss the graph construction time optimization opportunity when it is natively supported, as the spec text mentions for previous MLActivation interface

These activations function types are used to create other operations. One such use of this interface is for when an activation function is fused into another operation such as conv2d() or batchNormalization() during a graph construction session. Such fused activation functions can provide a significant performance improvement when supported natively by the underlying implementation. This is intended as an optimization opportunity for implementers.

DirectML backend in Chromium exploits this op fusion opportunity when constructing the DirectML graph.

If you have more suggestions, please comment on that PR! :)

Will do!

mingmingtasd commented 4 weeks ago
  • Spec validation issues - some operators (like softmax with the axis parameter) cannot be validated at creation time because it's not until they are combined with a base operator that the input shape is known.

@fdwr Correct! Such as the softmax with axis, if the base operator's output shape is unknown, we can't fuse them.