webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
364 stars 45 forks source link

Consider removing `steepness` parameter of softplus #645

Closed shiyi9801 closed 4 months ago

shiyi9801 commented 4 months ago

It's raised by @a-sully in CL review, thanks!

Softplus calculates ln(1 + exp(steepness * x)) / steepness, when the steepness is 0 it might result in division by zero.

I tried Pytorch torch.nn.Softplus(beta=0) and the results are all inf, and TF and ONNX don't have this attribute. Also DirectML doesn't support steepness < 1.0.

huningxin commented 4 months ago

Does negative steepness value make sense? softplus should produce positive results (as a smooth approximation to relu), but negative steepness would result in negative results.

a-sully commented 4 months ago

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place...

As mentioned above, TF and ONNX only support a more basic variant of softplus which computes log( 1 + e^x ) elementwise. This also matches the behavior of CoreML's softplus (though CoreML also supports a "parametric" variant of softplus which computes alpha_i * log( 1 + e^( beta_i * x_i ) ). This more generic operand can emulate the variant specified by DML, but without introducing the undefined behavior of division by 0. It will also happily compute negative values for alpha and beta)

How important is steepness? Many ML frameworks don't support it - including ONNX, which (for now, at least) is the primary consumer of WebNN. What would be the impacts of removing it?

fdwr commented 4 months ago

The expected results of division by zero for floating point values is:

(unlike division by zero for integers, there's nothing ambiguous in the IEEE standard for floating point)

Also DirectML doesn't support steepness < 1.0.

It does now 😉. Coincidentally we relaxed DirectML's SOFTPLUS validation in February to permit < 1 (but that version is not out yet, and the docs are still valid for DML 1.13) which includes negative values, and even 0 for the sake of PyTorch and potentially WebNN.

Does negative steepness value make sense?

🤔 I don't know the use case, but like you say, the graph is smooth, and PyTorch supports it without complaint:

import torch

s = torch.nn.Softplus(beta=-1.0)
x = torch.tensor([0.5930860043, 0.9014285803, -0.6331304312, 0.4639878273], dtype=torch.float32)
y = s(x)

print("value:", y)
print("shape:", y.shape)
print("dtype:", y.dtype)

# value: tensor([-0.4399, -0.3407, -1.0590, -0.4878])
# shape: torch.Size([4])
# dtype: torch.float32

Other libraries would need to (assuming the parameter was kept) support it via decomposition, in which case the same question would arise anyway, just that the question occurs in the div operator instead. I feel the cleanest way to answer questions like this for operators is to simply ask what result would an equivalent decomposition produce?

fdwr commented 4 months ago

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place What would be the impacts of removing it?

Considerations I see include:

Semantics: I would be interested to know where this parameter came from in the first place, like maybe a paper that introduced it, and why PyTorch has it.

Front end complexity: Currently the biggest known front-end is ORT WebNN EP graph builder which just passes the default steepness (=1) to WebNN. Now, some small performance could be gained if the builder looked before and after by one operator for a mul&div (or mul & recip+mul), but the salient question is how often does that occur? (see below) If a web version of PyTorch called WebNN, then having a compatible softplus would make it a little easier, but calling composing mul&softplus&div isn't hard.

Backend complexity and WPT complexity: If only one front-end caller (PyTorch) supports it and only one backend (DML) supports it, then keeping it is more dubious. Removing steepness simplifies WPT and conformance some.

Usage: Scanning 700 models I have locally, I see very few that even use softplus activation to begin with. A notable one is Yolo V4, but it just uses steepness value = 1, and another internal product model uses 4 as a steepness value (which when converted to ONNX becomes a mul and recip&mul), but it only has 2 softplus nodes in the graph:

image

*of course my little hard drive collection doesn't represent the full world of ML 🌍, but a 🍰 of it.

Performance: Since GPU's are primarily memory bound for very simple math operations, then having 2 extra intermediates tensors to write out/read back reduces perf by 3x for the pattern mul&softplus&div.

Precision: For float16 tensors, computing float32 intermediate values (for this part ln(1 + exp(x))) and truncating them into a float16 intermediate tensor is lossier than computing the fused pattern in float32. It's small though, probably not more than 2-3 ULP.

weirdly I feel like we already discussed this before, but I can't find the issue 🤷

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

huningxin commented 4 months ago

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

@fdwr , thanks for your nice summary. I think we can just retitle this post. And removing steepness makes sense to me.

wacky6 commented 4 months ago

:) I think removing steepness is fine.

From a API design perspective, adding it later is much easier than deprecation (if we find steepness isn't a good fit down the line).