Open a-sully opened 2 weeks ago
Thanks @a-sully for the proposal.
A couple of thoughts.
Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).
Which WebNN backend is expected to fail after build() but before execution? Seems undesirable. Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.
Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).
That's more or less what I've proposed :) See MLObjectDescriptorBase
(and its usages) in the Tentative IDL section.
Which WebNN backend is expected to fail after build() but before execution? Seems undesirable.
The bigger problem we're seeing right now is backends failing during graph execution. That being said, there's a class of failures where an inconsistency in system state (or assumed system state, in the example below) between build()
and dispatch()
leads to failures such that build()
succeeds and dispatch()
will always fail. From the Observations section:
...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...
I agree it's undesirable, but I argue that it's unavoidable:
- Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent
Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.
Could you elaborate on what you mean by this?
@a-sully , thank you very much for putting this together.
Re: Labeling object. I am always in favor giving web developers a way to label objects and use those labels in subsequent diagnostic output, or errors flagged by the browser. Should we derive MLTensor
and MLGraph
off of MLObjectBase
as well?
In the example above, should graph1 be put into an errored state, too?
For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad? Would failing dispatches subsequently succeed if you used an input with different values or is the input object doomed to fail no matter what graph you use it with? Knowing this would inform which object we should put into an error state, or propagating error state to.
When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next?
Should we derive
MLTensor
andMLGraph
off ofMLObjectBase
as well?
Yes, I should have been more explicit meant to proposing this in the Tentative IDL section. I proposed adding MLObjectDescriptorBase
to each the creation of these objects, but they should also be extended by a corresponding MLObjectBase
. So this:
partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {} partial interface MLGraphBuilder { // To label the resulting MLGraph. Promise<MLGraph> build( MLNamedOperands outputs, optional MLObjectDescriptorBase options = {}); };
should be augmented by this:
interface mixin MLObjectBase {
attribute USVString label;
};
MLTensor includes MLObjectBase;
MLGraph includes MLObjectBase;
For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad?
The former is case 4 and the latter is case 3 from the State of the World section. Notably, it's hard to distinguish these cases at runtime:
- Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail...
- you may not know whether you're actually in case 4
- Graph execution fails due to a runtime error caused by the specific graph inputs and outputs...
- you may not know whether you're actually in case 3
I'm tempted to say we should always invalidate the MLGraph
if it is the cause of a dispatch()
failure. While this may lead to some false-positives, in practice I expect many "transient" issues (such as bumping up on memory limits or issues specific to a given input) are not as transient as they seem. Some snippets from above:
- it may be reasonable to just blow away the
MLGraph
e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM- it may also be reasonable to assume that the website may attempt to
dispatch()
with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away theMLGraph
So, in the example above we'd invalidate graph1
(and free its associated resources) but not graph2
, which failed only due to propagation of a cascading failure.
When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next?
This relates to this open question:
Do we need a more structured format for reporting errors?
- I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
Ideally the error message should point to the cause of the failure. If we always invalidate the cause of a failure, then developers can use string-matching to identify that graph1
is invalidated... which isn't great. It seems nice to provide a more structured error format, but readTensor()
currently uses promise rejection to report an error, which returns a string
The least-bad option I can think of (suggestions welcome!) is to store the "last error" on the MLTensor
and allow the developer to check it after a promise rejection. For instance:
partial interface MLTensor {
attribute USVString causeOfLastError;
};
Alternatively we could add some sort of error-checking getter to MLTensor
and MLGraph
, but that may be misleading due to TOCTOU issues. For example:
// If this dispatch fails, `graph1` is invalidated.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});
// This cannot be known synchronously and seems likely to be misused.
// If this is async, it is unnecessarily expensive.
graph1.isValid();
WDYT?
Having MLObjectBase
take a label SGTM!
Could you elaborate on what you mean by this?
The "isValid" approach is similar to getError() in WebGL. This approach was abandoned in WebGPU because web developers often couldn't determine which object had caused an error (context or graph?). This uncertainty made it difficult for sites to respond appropriately and frequently led to excessive error checks scattered throughout the code.
Another common approach is to register a call-back which is invoked for the specific errors the web developer can react to and if it goes unhandled, propagates to become a context lost.
let errorQueue = ctx.createErrorQueue();
errorQueue.pushErrorFilter('internal');
ctx.dispatch(graph, inputs, outputs);
errorQueue.popErrorFilter(internalErrorHandler);
@a-sully for dispatch specific errors: having a pushErrorScope/popErrorScope as @bbernhar describes is similar to what WebGPU does and seems like it gives us the best of both worlds. Though it can be improved by additionally providing the labels of the objects involved. The WebGPU Error Handling best practices is a good article on the subject.
For case 4, is it possible that a particular MLTensor
can cause failures when used as input for one graph but be fine to use in a different graph? If so, seems like we shouldn't condemn the MLTensor
object to be invalid for the rest of its life.
If there exist platforms where MLGraph
can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with a valid
attribute that becomes false when the promise resolves.
For case 4, is it possible that a particular
MLTensor
can cause failures when used as input for one graph but be fine to use in a different graph? If so, seems like we shouldn't condemn theMLTensor
object to be invalid for the rest of its life.
Yes. I didn't explain this well, but I've been using "invalid" and "errored" to represent different error states. This proposal was initially aimed at the latter, but once we started talking about invalidating MLGraph
s then we started muddling the two.
"Invalid" objects could be specified to behave as if their respective destroy()
method was called. We'd invalidate the object only if it's the cause of the failure, so we can reasonably extrapolate that the object is no longer usable. Concretely:
MLGraph
which fails to dispatch()
MLTensor
which fails to writeTensor()
Meanwhile, "errored" objects (for now, only MLTensor
s) are affected by this initial failure since they now contain junk data. The idea is that using "errored" objects as inputs to subsequent operations should cascade this error state to the outputs of these operations. Implementations may also be able to short-circuit these subsequent operations, but that's an implementation detail. But as you mention, there's no reason to condemn this tensor forever, at least once the junk data is overwritten. I call out in the proposal:
- An object's errored state may be reset if it is the output of a successful operation
- e.g.
writeTensor()
writes new data
If there exist platforms where
MLGraph
can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with avalid
attribute that becomes false when the promise resolves.
I think this is true of all the WebNN backends in the current Chromium implementation? It's true of CoreML and TFLite, and I assume it's also true of DML for case 2 failures?
Having a promise similar to what we currently have on the MLContext
SGTM to surface newly "invalid" objects...
...there's then a question of whether we need to care about "errored" objects at all. The original reasoning for using this cascading error failure mechanism was to:
dispatch()
failures,If we invalidate the MLGraph
when dispatch()
fails, the first case is covered by the invalidation promise. From the perspective of the web developer, this is arguably less ergonomic in the standard writeTensor()
+ dispatch()
+ readTensor()
flow (with the error being observed from readTensor()
), but if the invalidation promise points to the label
passed to dispatch()
then this seems workable...
The question is then whether we care to avoid exposing the contents of MLTensor
s from failed operations, or allow implementations the ability to early-exit from operations with junk inputs. I expect the output tensor of a failed dispatch()
to typically be unmodified, but given the implementation-defined nature of the failure cases, I don't expect we can make that assertion. For instance:
// If this dispatch fails:
// - `graph1` is invalidated and can no longer be used. This resolves
// an invalidation promise on `graph1`
// - `tensorA` is probably unmodified... but we should probably treat
// its contents as undefined?
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});
// What should happen here?
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});
@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?
@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?
Good question. WebGPU resources can be invalidated if they cannot be created (e.g., due to OOM) or if the device is lost or destroyed. If the operation is non-fatal (i.e., validation failure or OOM), they could also remain valid. However, 'non-fatal' does not include internal errors raised during queue operations—pipeline creation is the only exception where an 'internal error' is not considered a device loss AFAIK. Similarly, dispatch() could raise an MLGraphCompilationError
, allowing tensors to be reused in another graph. The nice thing about error queues is that it avoids errors leaking to/from unrelated code.
@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?
Nothing. If a validation error happens in the GPU process, the error is raised to the error scope and the call is ignored.
[Rafael] If there exist platforms where MLGraph can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with a valid attribute that becomes false when the promise resolves.
[Austin] I think this is true of all the WebNN backends in the current Chromium implementation? It's true of CoreML and TFLite, and I assume it's also true of DML for case 2 failures?
In Chromium, there are places in the DML backend where errors during graph building can cause the build
command to fail with an error string. We've been gradually eliminating these as they can be difficult for web developers to reason about, especially when the errors are machine specific and do not happen during local development. Failures during dispatch and tensor reading/writing, on the other hand, usually result in the context becoming lost and the GPU process ending.
If there exist platforms where an error during dispatch and readTensor/writeTensor, results in undefined tensor output and the browser is able to detect this has happened, we can have the browser clear the output tensors to defined values such as zeros. If we can subsequently determine with certainty that the graph will no longer produce valid output ever again, we should mark it as invalid and leave it in the same effective state as a destroyed graph.
If the system gets into a state where it is not clear whether forward progress can be made and random tensors can be in undefined states, then making the context as "lost" and starting over might be the safest option.
WebNN has been good at surfacing errors as early as possible during graph building. If that's not always possible due to platform limitations, then introducing an "errorScope" (like WebGPU does) or error queue with errors referring to labeled objects seems like the best alternative.
The Problem (see #477)
Our current method for surfacing
dispatch()
errors is to "lose" theMLContext
. As I mentioned in https://github.com/webmachinelearning/webnn/pull/754#discussion_r1747441955 I don't think it makes sense for this to be the only option for surfacing errors fromdispatch()
:Losing the
MLContext
is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think theMLContext
is always the right blast radius for adispatch()
error.There is also no way whatsoever to surface an error from
writeTensor()
!State of the World
Here are examples of how I've observed
dispatch()
fail in the current Chromium implementation:MLContext
may indeed be the only optionMLContext
e.g. if you assume an OOM is imminent,MLGraph
e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOMMLGraphBuilder.build()
, but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:MLContext
is not a useful optionMLGraph
, especially if you're confident it will never execute successfullygather
ops (see #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:MLContext
is not a useful optiondispatch()
with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away theMLGraph
Observations
MLContext
(or the entire GPU process) would be usefuldispatch()
failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessiblewhere
operator may fail to hit the affected branch(es).MLGraph
is a reasonable (though not strictly necessary) response to examples 2, 3, and 4Failures are cascading
dispatch()
fails but its output tensors are never read back...dispatch()
fails but its output tensors are later overwritten by new data...readTensor()
importExternalBuffer()
Proposal
writeTensor()
,dispatch()
) catastrophically fails, continue to lose theMLContext
MLTensor
s, though possibly also anMLGraph
, TBD) are put into an errored statewriteTensor()
writes new dataExample:
Open Questions
graph1
be put into an errored state, too?graph1
will always fail to execute?importExternalBuffer()
method?GPUError
scopes will be able to handle this casecreateBuffer()
be made synchronous and use this error reporting mechanism?MLTensor.error
), since the errored state exists on the WebNN timeline. Is that sufficient?Tentative IDL: