webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
382 stars 46 forks source link

MLBuffer destruction timeline #716

Open reillyeon opened 4 months ago

reillyeon commented 4 months ago

While prototyping an implementation of the MLBuffer interface for Chromium's TFLite-based WebNN backend I've encountered a question: What should happen if a developer executes the following code?

let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

As currently implemented promise will never be resolved. This is likely undesirable. We probably desire one of the following behaviors:

  1. promise resolves with the contents of the buffer. More generally, the buffer is not destroyed until all pending operations referencing it are completed.
  2. promise rejects (probably with an AbortError or InvalidStateError). More generally, any incomplete operation referencing the buffer is aborted (if possible).

Option 1 is consistent with the idea that readBuffer(), writeBuffer() and dispatch() all queue tasks to run on a "timeline" separate from JavaScript execution. destroy() would be one more such method. Option 2 seems useful for quickly releasing resources, but means the developer must take care to only call destroy() when they are truly done with the buffer (e.g. after awaiting the Promise returned by readBuffer()).

bbernhar commented 3 months ago

I'm in favor of Option 1. This was the original proposal intent #542:

Destroy() gets called on the context timeline but doesn't actually release until the device signals completion.

huningxin commented 3 months ago
let buffer = context.createBuffer(...);
context.writeBuffer(...);
let promise = context.readBuffer();
buffer.destroy();

In the current Chromium prototype, the promise is never resolved nor rejected, because destroy() resets the mojo remote of the buffer. WebNN service still reads the data from GPU but fails (silently) to send the data back to renderer and never resolves the promise in JavaScript.

I guess destory() can be fixed by rejecting the pending promises that basically implements Option 2. WebNN service still needs to wait for GPU readback is done and releases the readback buffer. The backing resource of the destroyed buffer would be released earlier when handling the destory() / disconnection message.

e.g. after awaiting the Promise returned by readBuffer()

I suppose this is the typical usage.

bbernhar commented 3 months ago

Updated https://github.com/webmachinelearning/webnn/issues/543 with additional clarification on content/script timeline: destroy() must reject pending readBuffer promises.

reillyeon commented 3 months ago

The updated proposal defines what happens to the promises returned by readBuffer(), but what about pending dispatch operations? If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

bbernhar commented 3 months ago

If only one buffer involved in a dispatch is destroyed do we cancel the dispatch and reject from any readBuffer() calls any of the output buffers?

Thanks for raising the question, Reilly. I don't believe its possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

reillyeon commented 3 months ago

Thanks for raising the question, Reilly. I don't believe it's possible to stop or cancel GPUs from executing in-flight commands without resetting the device. If a dispatch is pending (or still running on-device) then we can allow it to complete which should (on the device timeline) safely release any destroyed buffers used as input/output.

In that case I'm not sure of the value of causing readBuffer() to reject since the compute is still going to happen and we don't actually free any resources immediately.

bbernhar commented 3 months ago

I don't feel strongly either way. However, I think it's atypical to call destroy() without first waiting on readBuffer(). Therefore, I'm okay with disallowing it and rejecting the current approach.

reillyeon commented 3 months ago

Thinking about this more with the additional context of MLContext.destroy() I think rejecting the promises immediately is the right behavior as it will make these two methods consistent, even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

bbernhar commented 3 months ago

even though it makes the overall behavior somewhat inconsistent with the concept that MLBuffer operations all occur on the device timeline.

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

reillyeon commented 3 months ago

Could we have MLBuffer operations be defined across two timelines: script and device/queue. Then calling destroy() would reject pending promises and as its last step, issue the release operation for the device/queue timeline, as its first step.

Yes, I think explicitly splitting the timelines in the method's algorithm will make this clear for developers and implementers.

MLContext.destroy() could effectively call MLBuffer.destroy()... But this leads me to wonder, why are MLGraph(s) not destroyable too? Like MLBuffer, MLGraph can hold copies of buffers (ie. a device-internal representation) which could benefit from predictable release (and consistent destruction).

+1 to adding MLGraph.destroy() and MLGraphBuilder.destroy() methods.