Proposal: Report non-fatal errors from the WebNN timeline

The Problem (see #477)

Our current method for surfacing dispatch() errors is to "lose" the MLContext. As I mentioned in https://github.com/webmachinelearning/webnn/pull/754#discussion_r1747441955 I don't think it makes sense for this to be the only option for surfacing errors from dispatch():

I don't think we can assume every failed dispatch() results in a lost MLContext, especially considering platforms where an MLContext is not so closely tied to a single GPU

Losing the MLContext is a very heavy-handed failure mode. In the current Chromium implementation (which, to be fair, is an implementation detail), this means killing the GPU process, which impacts the system well beyond the scope of WebNN. I don't think the MLContext is always the right blast radius for a dispatch() error.

There is also no way whatsoever to surface an error from writeTensor()!

State of the World

Here are examples of how I've observed dispatch() fail in the current Chromium implementation:

The process executing the graph OOMs and crashes
- blowing away the MLContext may indeed be the only option
Some resource allocation while executing the graph fails and aborts gracefully. Some thoughts on how to react:
- it may be reasonable to blow away the entire MLContext e.g. if you assume an OOM is imminent,
- it may be reasonable to just blow away the MLGraph e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM
- it may be reasonable to assume this issue is transitory
Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail. Ideally this type of failure would be surfaced earlier during MLGraphBuilder.build(), but unfortunately this is not always the case. This is currently the most common failure mode for Chromium's CoreML backend. Some thoughts on how to react:
- blowing away the entire MLContext is not a useful option
- it may be reasonable to blow away the MLGraph, especially if you're confident it will never execute successfully
- it's possible - though it seems unlikely - this issue is transitory
- you may not know whether you're actually in case 4
Graph execution fails due to a runtime error caused by the specific graph inputs and outputs. From what I can tell, this is always(?) due to issues with the user agent implementation. For example, TFLite does not support negative or OOB indices for gather ops (see #486), and so Chromium's TFLite backend must address this, which it does not (yet). Some thoughts on how to react:
- blowing away the entire MLContext is not a useful option
- it may be reasonable to assume this issue is transitory...
- ...though it may also be reasonable to assume that the website may attempt to dispatch() with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away the MLGraph
- you may not know whether you're actually in case 3

Observations

In each of cases 2, 3, and 4, a mechanism to signal failure without escalating by destroying the MLContext (or the entire GPU process) would be useful
Ideally case 3 would not exist. It is unfortunate that frameworks like CoreML "successfully" compile graphs which will never run, though user agents should also do more to cover for these bugs
- One example we've observed in Chromium is behavioral differences at runtime (including dispatch() failures) depending on whether the GPU process is sandboxed. In this case it seems like a bug in the framework: the graph compiler is not aware of user agent sandboxing and incorrectly assumes (without verifying) certain resources are accessible
- In the long run, I expect frameworks to address some of these bugs...
- ...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...
- ...and since some frameworks and drivers are only updated with the OS, in practice I expect this to be an issue we'll have to contend with for quite a long time, even considering user agent workarounds
- More sophisticated techniques to work around these bugs come with drawbacks. For example, the user agent could execute the compiled graph with dummy inputs to probe for runtime errors. However, this may be expensive and the dummy inputs may not even exercise the problematic code path(s). For example, graphs with the where operator may fail to hit the affected branch(es).
Eventually, case 4 should not exist. This requires all WebNN operators to be well-defined and implementations work around the quirks of the underlying platform however is necessary to comply with these specs. The WebNN spec and Chromium implementation are immature, but I expect we'll get there
Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent
Blowing away the MLGraph is a reasonable (though not strictly necessary) response to examples 2, 3, and 4

Failures are cascading

// If this dispatch fails...
context.dispatch(graph, inputs, {'output': intermediateTensor});

// Any operations which depend on `intermediateTensor` should also fail.
context.dispatch(graph, {'input': intermediateTensor}, outputs);

Tracking down the source of cascading failures is challenging
Failures only matter if they're observable
- If a tree falls in a forest...
- If a dispatch() fails but its output tensors are never read back...
- If a dispatch() fails but its output tensors are later overwritten by new data...
Results of operations on the WebNN timeline are observable to script only via a limited set of async APIs:
- readTensor()
- (eventually) importExternalBuffer()

Proposal

If an operation on the WebNN timeline (writeTensor(), dispatch()) catastrophically fails, continue to lose the MLContext
If the failure is not catastrophic, the affected objects (usually MLTensors, though possibly also an MLGraph, TBD) are put into an errored state
An object's errored state may be reset if it is the output of a successful operation
- e.g. writeTensor() writes new data
Any operations which take an errored object as an input will propagate this error to its outputs
Any promise-bearing operations which take an errored object as an input will reject the promise
Use labels to improve debuggability by attributing failures to specific operations

Example:

// If this dispatch fails, `tensorA` is put into an errored state.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// An operation dependent on `tensorA` will fail.
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});

// This promise will reject with a implementation-defined message which at
// minumum mentions 'foo'.
context.readTensor(tensorA)
    .catch(error => ...);

// This promise will reject with a implementation-defined message which at
// minumum mentions 'bar' (though perhaps also points back to 'foo').
context.readTensor(tensorB)
    .catch(error => ...);

// Clears the errored state of `tensorA` if the write is successful.
context.writeTensor(tensorA);

Open Questions

In the example above, should graph1 be put into an errored state, too?
- Or only if the user agent believes graph1 will always fail to execute?
Do we need a more structured format for reporting errors?
- I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.
How will errors be reported with a sync importExternalBuffer() method?
- I'm tentatively hoping GPUError scopes will be able to handle this case
Should createBuffer() be made synchronous and use this error reporting mechanism?
This proposal does not include a way to query for this errored state on the affected objects from script (e.g. MLTensor.error), since the errored state exists on the WebNN timeline. Is that sufficient?

Tentative IDL:

dictionary MLObjectDescriptorBase {
  USVString label = "";
};

// Add labels to operations on the WebNN timeline.
void dispatch(
    MLGraph graph,
    MLNamedTensors inputs,
    MLNamedTensors outputs,
    optional MLObjectDescriptorBase options = {});

void writeTensor(
    MLTensor tensor,
    AllowSharedBufferSource inputData,
    optional MLObjectDescriptorBase options = {});

// Add labels to objects which may be used on the WebNN timeline.
partial dictionary MLContextOptions : MLObjectDescriptorBase {}

partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}

partial interface MLGraphBuilder {
  // To label the resulting MLGraph.
  Promise<MLGraph> build(
    MLNamedOperands outputs,
    optional MLObjectDescriptorBase options = {});
};

Thanks @a-sully for the proposal.

A couple of thoughts.

Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).

Which WebNN backend is expected to fail after build() but before execution? Seems undesirable. Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.

Allowing web developers to identify WebNN objects by name sounds like a good idea. However, I think we could make this even more useful by assigning labels to all created WebNN objects so that they appear consistently across all operations (similar to WebGPU).

That's more or less what I've proposed :) See MLObjectDescriptorBase (and its usages) in the Tentative IDL section.

Which WebNN backend is expected to fail after build() but before execution? Seems undesirable.

The bigger problem we're seeing right now is backends failing during graph execution. That being said, there's a class of failures where an inconsistency in system state (or assumed system state, in the example below) between build() and dispatch() leads to failures such that build() succeeds and dispatch() will always fail. From the Observations section:

...but some of these "bugs" are unavoidable. The problem of resources being (assumed to be) available during compilation not being available during graph inference is a generic TOCTOU issue...

I agree it's undesirable, but I argue that it's unavoidable:

Even if the classes of errors described in cases 3 and 4 are eliminated, case 2 errors are more or less impossible to prevent

Even if we capture errors occurring between the building and dispatch phases, there should be some guarantee for the web developer about which state is affected before they handle it.

Could you elaborate on what you mean by this?

@a-sully , thank you very much for putting this together.

Re: Labeling object. I am always in favor giving web developers a way to label objects and use those labels in subsequent diagnostic output, or errors flagged by the browser. Should we derive MLTensor and MLGraph off of MLObjectBase as well?

In the example above, should graph1 be put into an errored state, too?

For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad? Would failing dispatches subsequently succeed if you used an input with different values or is the input object doomed to fail no matter what graph you use it with? Knowing this would inform which object we should put into an error state, or propagating error state to.

When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next?

Should we derive MLTensor and MLGraph off of MLObjectBase as well?

Yes, I should have been more explicit meant to proposing this in the Tentative IDL section. I proposed adding MLObjectDescriptorBase to each the creation of these objects, but they should also be extended by a corresponding MLObjectBase. So this:

partial dictionary MLTensorDescriptor : MLObjectDescriptorBase {}

partial interface MLGraphBuilder {
  // To label the resulting MLGraph.
  Promise<MLGraph> build(
    MLNamedOperands outputs,
    optional MLObjectDescriptorBase options = {});
};

should be augmented by this:

interface mixin MLObjectBase {
    attribute USVString label;
};

MLTensor includes MLObjectBase;
MLGraph includes MLObjectBase;

For the scenario you outlined where build succeeds but dispatch fails, is the failure a product of the input being bad or the graph being bad?

The former is case 4 and the latter is case 3 from the State of the World section. Notably, it's hard to distinguish these cases at runtime:

Graph execution fails due to a runtime error inherent in running the compiled graph in the current environment, meaning that executing this graph will always fail...

you may not know whether you're actually in case 4

Graph execution fails due to a runtime error caused by the specific graph inputs and outputs...

you may not know whether you're actually in case 3

I'm tempted to say we should always invalidate the MLGraph if it is the cause of a dispatch() failure. While this may lead to some false-positives, in practice I expect many "transient" issues (such as bumping up on memory limits or issues specific to a given input) are not as transient as they seem. Some snippets from above:

it may be reasonable to just blow away the MLGraph e.g. if you suspect the graph may always be too large to ever execute successfully. This would free up some memory and possibly prevent an otherwise imminent OOM

it may also be reasonable to assume that the website may attempt to dispatch() with similar inputs, in which case you could end up in this scenario again soon. It may be reasonable to preemptively blow away the MLGraph

So, in the example above we'd invalidate graph1 (and free its associated resources) but not graph2, which failed only due to propagation of a cascading failure.

When the errors happen, are they recoverable by retrying some or all of the previous steps they took to get to that point? What guidance should we provide as to what they should try next?

This relates to this open question:

Do we need a more structured format for reporting errors?

I think rejecting the promise with an implementation-defined error message should be sufficient, at least for now. User agents are welcome to make this error message as detailed as they like.

Ideally the error message should point to the cause of the failure. If we always invalidate the cause of a failure, then developers can use string-matching to identify that graph1 is invalidated... which isn't great. It seems nice to provide a more structured error format, but readTensor() currently uses promise rejection to report an error, which returns a string

The least-bad option I can think of (suggestions welcome!) is to store the "last error" on the MLTensor and allow the developer to check it after a promise rejection. For instance:

partial interface MLTensor {
  attribute USVString causeOfLastError;
};

Alternatively we could add some sort of error-checking getter to MLTensor and MLGraph, but that may be misleading due to TOCTOU issues. For example:

// If this dispatch fails, `graph1` is invalidated.
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// This cannot be known synchronously and seems likely to be misused.
// If this is async, it is unnecessarily expensive.
graph1.isValid();

WDYT?

Having MLObjectBase take a label SGTM!

Could you elaborate on what you mean by this?

The "isValid" approach is similar to getError() in WebGL. This approach was abandoned in WebGPU because web developers often couldn't determine which object had caused an error (context or graph?). This uncertainty made it difficult for sites to respond appropriately and frequently led to excessive error checks scattered throughout the code.

Another common approach is to register a call-back which is invoked for the specific errors the web developer can react to and if it goes unhandled, propagates to become a context lost.

let errorQueue = ctx.createErrorQueue();
errorQueue.pushErrorFilter('internal');
ctx.dispatch(graph, inputs, outputs);
errorQueue.popErrorFilter(internalErrorHandler);

@a-sully for dispatch specific errors: having a pushErrorScope/popErrorScope as @bbernhar describes is similar to what WebGPU does and seems like it gives us the best of both worlds. Though it can be improved by additionally providing the labels of the objects involved. The WebGPU Error Handling best practices is a good article on the subject.

For case 4, is it possible that a particular MLTensor can cause failures when used as input for one graph but be fine to use in a different graph? If so, seems like we shouldn't condemn the MLTensor object to be invalid for the rest of its life.

If there exist platforms where MLGraph can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with a valid attribute that becomes false when the promise resolves.

For case 4, is it possible that a particular MLTensor can cause failures when used as input for one graph but be fine to use in a different graph? If so, seems like we shouldn't condemn the MLTensor object to be invalid for the rest of its life.

Yes. I didn't explain this well, but I've been using "invalid" and "errored" to represent different error states. This proposal was initially aimed at the latter, but once we started talking about invalidating MLGraphs then we started muddling the two.

"Invalid" objects could be specified to behave as if their respective destroy() method was called. We'd invalidate the object only if it's the cause of the failure, so we can reasonably extrapolate that the object is no longer usable. Concretely:

an MLGraph which fails to dispatch()
an MLTensor which fails to writeTensor()

Meanwhile, "errored" objects (for now, only MLTensors) are affected by this initial failure since they now contain junk data. The idea is that using "errored" objects as inputs to subsequent operations should cascade this error state to the outputs of these operations. Implementations may also be able to short-circuit these subsequent operations, but that's an implementation detail. But as you mention, there's no reason to condemn this tensor forever, at least once the junk data is overwritten. I call out in the proposal:

An object's errored state may be reset if it is the output of a successful operation

e.g. writeTensor() writes new data

If there exist platforms where MLGraph can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with a valid attribute that becomes false when the promise resolves.

I think this is true of all the WebNN backends in the current Chromium implementation? It's true of CoreML and TFLite, and I assume it's also true of DML for case 2 failures?

Having a promise similar to what we currently have on the MLContext SGTM to surface newly "invalid" objects...

...there's then a question of whether we need to care about "errored" objects at all. The original reasoning for using this cascading error failure mechanism was to:

surface errors from dispatch() failures,
avoid surfacing junk data to the web, and
(maybe) avoid unnecessary work on junk data

If we invalidate the MLGraph when dispatch() fails, the first case is covered by the invalidation promise. From the perspective of the web developer, this is arguably less ergonomic in the standard writeTensor() + dispatch() + readTensor() flow (with the error being observed from readTensor()), but if the invalidation promise points to the label passed to dispatch() then this seems workable...

The question is then whether we care to avoid exposing the contents of MLTensors from failed operations, or allow implementations the ability to early-exit from operations with junk inputs. I expect the output tensor of a failed dispatch() to typically be unmodified, but given the implementation-defined nature of the failure cases, I don't expect we can make that assertion. For instance:

// If this dispatch fails:
//  - `graph1` is invalidated and can no longer be used. This resolves
//     an invalidation promise on `graph1`
//  - `tensorA` is probably unmodified... but we should probably treat
//     its contents as undefined?
context.dispatch(graph1, inputs, {'out': tensorA}, {label: 'foo'});

// What should happen here?
context.dispatch(graph2, {'in': tensorA}, {'out': tensorB}, {label: 'bar'});

@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?

@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?

Good question. WebGPU resources can be invalidated if they cannot be created (e.g., due to OOM) or if the device is lost or destroyed. If the operation is non-fatal (i.e., validation failure or OOM), they could also remain valid. However, 'non-fatal' does not include internal errors raised during queue operations—pipeline creation is the only exception where an 'internal error' is not considered a device loss AFAIK. Similarly, dispatch() could raise an MLGraphCompilationError, allowing tensors to be reused in another graph. The nice thing about error queues is that it avoids errors leaking to/from unrelated code.

@bbernhar @RafaelCintron what does WebGPU do to resources involved in non-fatal errors?

Nothing. If a validation error happens in the GPU process, the error is raised to the error scope and the call is ignored.

[Rafael] If there exist platforms where MLGraph can become invalid but the context from which it came is otherwise fine to use, I would be fine with a there being a promise on the object which resolves when it becomes invalid along with a valid attribute that becomes false when the promise resolves.

[Austin] I think this is true of all the WebNN backends in the current Chromium implementation? It's true of CoreML and TFLite, and I assume it's also true of DML for case 2 failures?

In Chromium, there are places in the DML backend where errors during graph building can cause the build command to fail with an error string. We've been gradually eliminating these as they can be difficult for web developers to reason about, especially when the errors are machine specific and do not happen during local development. Failures during dispatch and tensor reading/writing, on the other hand, usually result in the context becoming lost and the GPU process ending.

If there exist platforms where an error during dispatch and readTensor/writeTensor, results in undefined tensor output and the browser is able to detect this has happened, we can have the browser clear the output tensors to defined values such as zeros. If we can subsequently determine with certainty that the graph will no longer produce valid output ever again, we should mark it as invalid and leave it in the same effective state as a destroyed graph.

If the system gets into a state where it is not clear whether forward progress can be made and random tensors can be in undefined states, then making the context as "lost" and starting over might be the safest option.

WebNN has been good at surfacing errors as early as possible during graph building. If that's not always possible due to platform limitations, then introducing an "errorScope" (like WebGPU does) or error queue with errors referring to labeled objects seems like the best alternative.

webmachinelearning / webnn