Closed anssiko closed 3 years ago
@cynthia As @anssiko pointed out, there are a lot of activation functions and new ones are discovered from time to time, however not all activations are suitable for recurrent network and other categories of networks, so having a single enum that defines every possible activation functions would create a maintenance burden and that it could be misused i.e. an unsupported activation in a network would then need to be handled as a runtime error, which complicates the caller's code somewhat.
@anssiko Can this issue be closed? Is there any other aspect of the original feedback?
so having a single enum that defines every possible activation functions would create a maintenance burden and that it could be misused
Is there any prior art (as in, a neural network API) that restricts this though? I checked and activating a GRU with relu seemed to work fine in all of the frameworks I tried - even if it's nonsensical.
A problem with defining all the known activations as a single enum that is used everywhere is that every operation that uses it will be forced to either support all of it, or having an ever-changing logic over time that accepts some and rejects some. This is very unpredictable and hard to maintain the right set of expectation over time, from the API longevity standpoint.
In the use case of gruCell
there is in fact a known use of relu
activation in place of tanh
for the recurrent gate (also known as the 'new' gate), but stretching that to cover many more may be a bridge too far. That is why relu
is currently part of the MLRecurrentNetworkActivation
enum, but not others.
Remember that the main purpose of defining these networks such as the recurrent ones as standalone operations is to provide a pathway for the implementer to hook them up directly with the optimizations already built into many hardware platforms today. For the apps, these "big" operations are seen as just short hands for the common known cases. But from the implementer's point of view, it is a huge performance boost to be able to tap into the existing silicon blocks or optimized instructions specially built for it.
For a lesser known use case of using other less common activations e.g. hardSigmoid
at the recurrent gate, there is nothing stopping the app from constructing the entire network out of the smaller operations (take a look at the note section of gruCell
for an example).
So in short I think helper operations such as gruCell
should be defined with common known use cases in mind, as opposed to API maximum coverage, since it's already possible to achieve the latter through operation composition.
The fact that hardware-accelerated implementations of common blocks can have limitations is an implementation detail that should not be ossified into the design of the API. Instead of baking the silicon limitations into the API, shouldn't it throw based on the limitations of the context?
Given the GRU example, if the underlying silicon does not support a relu activation, then that restriction should be raised as an error. Right now that restriction is baked into the API - since hardware is constantly evolving this limitation may become invalid in the foreseeable future, at which point we have to figure out how to evolve the API so that it is supported. Changing APIs, once it has shipped on the web is incredibly difficult if designed without flexibility for evolution in mind.
if the underlying silicon does not support a relu activation, then that restriction should be raised as an error
A hardware limitation should not be the reason a Web API fails. I only raised an example of the API's ability to utilize a specific hardware block in the context of performance, not functionality. The point is that a recurrent network API such as GRU should only support known use cases because it's actually a complete topology, and not just a single atomic operation. An uncommon use case can already be achieved through composition -- you have all the flexibility there.
This section in the explainer discusses the topic of level of abstraction reasonably clearly. Maybe that could help explaining my rationale.
A hardware limitation should not be the reason a Web API fails.
There is precedent for such patterns, such as codec support. (although this isn't always because of hardware)
@cynthia I believe you're referring to canPlayType()
precedent?
I think it'd be helpful for the group to discuss and document the pros and cons of adopting such a "failure is an option" pattern before proposing a resolution to this issue.
@cynthia's feedback suggests such a pattern would better future-proof the API and that there's precedent for this type of design on the web platform, while @wchao1115's comments suggest error handling would complicate the API caller's code, and make performance more unpredictable on today's hardware in particular.
Let's use this issue to come up with any other pros and cons I may have overlooked.
Please chime in @cynthia @wchao1115 @huningxin. I'll bring this issue up again on our next call to have a checkpoint.
@cynthia I believe you're referring to canPlayType() precedent?
Yes. There are also numerous places in WebRTC that do similar things. (usually related to codecs)
The reason why I think this pattern might be better is because it discourages preemptively implementing different code branches (e.g. if accelerator is A at the time of implementation, ossify model to the capabilities of A at the time being) like how user agent based branching is abused as of today.
Having it throw would encourage developers to try accelerated paths and fallback to hand-rolled if the accelerator does not support it. This allows graceful upgrades in the event the underlying acceleration runtime later supports the accelerated path.
It is worth noting that this is speculative, since there isn't much to refer to in terms of prior art for accelerated neural networks.
In practice, an ML graph once loaded and compiled is expected to run successfully regardless of the underlying device, although the degrees of efficiency may vary depending on the level of optimization supported. A kind of fallback that may happen in some cases, such as when an operator falls back to the CPU, does not alter the topology of the graph, and as such does not break the user's expectation. For the context, the topology of the graph is either explicitly created or converted from a pre-trained model; in either case, it is finalized before it is eventually executed. The topology-altering kind of fallback at the graph execution time, as suggested here, would be far too late for the users of ML models.
From the API design point of view, it is also a bit odd to allow use cases that are known to be invalid by deferring them to the implementer of the API. The implementer will have no choice but to fail on them, and the users will have no choice but to prepare for the failures even though they aren't going to be in a position to handle them well. This is different in nature from the codec failures.
Per https://www.w3.org/2021/05/13-webmachinelearning-minutes.html#t05 @wchao1115 to draft an informative note to be incorporated into the spec to explain current design principles around activations.
This issue is resolved by this PR #188 as linked here -- we no longer need any string enum to represent fused activation functions. Please take a look.
As discussed in WebML WG Teleconference – 2 Sep 2021 the group feels this issue has been addressed by #188.
via https://github.com/w3ctag/design-reviews/issues/570#issuecomment-768875996
Some activations can not be supported in GRU, will clarify in the spec.