mingmingtasd commented 10 months ago

(raised by @junwei in Chromium CL review https://chromium-review.googlesource.com/c/chromium/src/+/4920337/comment/6ed907ff_2f86a723/

WebNN's softmax can be an MLActivation.

While the DirectML does not support fusing softmax according to the doc: https://learn.microsoft.com/en-us/windows/ai/directml/dml-fused-activations

/cc @fdwr @huningxin

fdwr commented 10 months ago

While the DirectML does not support fusing softmax according to the doc

It looks like a documentation issue, because the actual code accepts it. Will follow up with team and doc writer...

[update] It was added in DML 1.12. So for the older DML target, we could always issue a separate Softmax call upon seeing that activation.

inexorabletash commented 4 months ago

Given the above - implementations can either (1) only support backends with support for fusing softmax (e.g. by bundling a backend, refusing to run on older backend, etc) or (2) implement support in user-land - I think we can close this out.

Do you agree @fdwr ?

fdwr commented 4 months ago

Given the above - implementations can either (1) only support backends with support for fusing softmax (e.g. by bundling a backend, refusing to run on older backend, etc) or (2) implement support in user-land - I think we can close this out.

Do you agree @fdwr ?

@inexorabletash Yes, in any case, this is closeable because the backend can choose to execute it separately. I don't think it's good to croak on the model (your option 1), as backends should instead gracefully execute the convolution and the MLConv2dOptions::activation separately. @mingmingtasd: Concur?

huningxin commented 4 months ago

as backends should instead gracefully execute the convolution and the MLConv2dOptions::activation separately.

Should we add a note into the spec mentioning this fallback path?

wacky6 commented 4 months ago

Purely as a thought experiment:

Softmax requires summing over a dimension (out[i] = exp(input[i]) / sum(exp(input))). In this sense, softmax feels more like a normalization than an activation.

Other activations defined in the spec are element-wise (out[i] = fn(input[i])), so they can be trivially applied when filling the output buffer.

Just curious, how could a fuzed softmax improve performance (or reduce memory consumption)? How is a fused softmax activation different from syntax sugar of "Op + Softmax"?

Intuitively, the entire pre-softmax output (i.e. the intermediate result) needs to be computed before softmax can produce the result (because of the dependency on sum(input)).

So it seems the intermediate result is unavoidable (as opposed to being inlined for activations such as ReLU), negating the benefit of a fuzed activation.

fdwr commented 4 months ago

In this sense, softmax feels more like a normalization than an activation.

@wacky6: Softmax is also normalization. I tried stuffing ML ops into a few tidy categories here of my own arbitrary grouping, but then I realized they overlap (e.g. softmax in the table).

https://towardsdatascience.com/softmax-activation-function-how-it-actually-works-d292d335bd78 "Softmax Activation Function — How It Actually Works"
https://en.wikipedia.org/wiki/Softmax_function "The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes."

fdwr commented 3 months ago

how could a fuzed softmax improve performance (or reduce memory consumption)? How is a fused softmax activation different from syntax sugar of "Op + Softmax"?

@wacky6: I haven't seen the algorithm firsthand, but I recall there's a clever optimization where GPU/NPU drivers can factor out the reduction of softmax's denominator directly into the convolution pass, without needing a separate resummation.

Should we add a note into the spec mentioning this fallback path?

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

a-sully commented 3 months ago

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

Agreed, I propose we close this issue once #664 merges

huningxin commented 3 months ago

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

Agreed, I propose we close this issue once #664 merges

SGTM!

a-sully commented 3 months ago

664 merged 🎉

fdwr commented 3 months ago

Closing as the activation field is deleted now, and the DML backend will make the decision to fuse or not based on the DML feature level (a tangentially related enabling CR: https://chromium-review.googlesource.com/c/chromium/src/+/5501114)

webmachinelearning / webnn

Should softmax be a fused activation ? #471

664 merged 🎉