Closed bbernhar closed 5 months ago
@bbernhar I'm still not quite clear on the problem statement. Can you please clarify on what we think is the problem here?
@wchao1115 Sure.
WebNN (as spec'd) and WebGPU lack a way of sharing tensor data on-device directly with each other: GPUBuffer
is inaccessible to WebNN and WebGPU does not support NPU buffers. I believe a sharable NPU or GPU buffer type that WebGPU could use is the only path forward for us to support WebGPU interop for WebNN (ex. custom ops). The other problem is chained inferences: WebNN has no means to re-use existing GPU or NPU results between compute()
calls without copying everything back to the CPU.
- Give WebNN developer control of device-storage to avoid round-trips to/from CPU.
I think this feature would be critical for some language models' performance on device (GPU/NPU) where the outputs of the previous inference, e.g., hidden state or KV pairs, will be used as inputs of the next inference. For such use case, frameworks usually allow to allocate on-device tensors and use these tensors as inputs/outputs for model inference, for example ONNXRuntime I/O Binding.
Since an MLBuffer
is created from an MLContext
, I infer from the proposal that the buffer is always bound to the context. Is that correct?
If so, the read/write operations could be simplified by moving them onto the interface rather than needing to pass the buffer.
Nitpick on proposed API shape: MLBuffer? createBuffer(...)
implies null
can be returned; exceptions should be preferred for synchronous error cases.
Thanks @inexorabletash for the feedback.
Read/write ops must occur in the domain of the context because only the context determines device-execution order, not MLBuffer
. We could consider support for direct mapped MLBuffer
which supports that line of thinking.
If so, the read/write operations could be simplified by moving them onto the interface rather than needing to pass the buffer.
My thoughts, too, but MLContext is indeed the main encapsulator here, and IIUC MLBuffer
is a control object used for identification and indirect control of the underlying opaque data. So passing an MLBuffer
as argument does not necessarily involve any copying, just recording what to do with the given buffer. It is the explicit methods that control the content of the buffer.
Ah, thank you for clarifying. The timeline mentions are subtle, I hadn't internalized that yet. One more thing to specify concretely in the spec. :)
FWIW, it's best to make API proposals by leading with examples of the JS code using the proposed API, and only worry about providing the IDL later.
WIW, it's best to make API proposals by leading with examples of the JS code using the proposed API, and only worry about providing the IDL later.
Unfortunately, no code example tells you which timeline gets used where, only the WebNN spec can describe this behavior: which state is available to which operations. The WebNN programming model should probably concretely define these "timelines" then describe the entire API using it.
Hi @bbernhar, thanks for this proposal (and the Chromium prototype)! I agree that specifying a clear WebNN <-> WebGPU interop is needed. I have a comment and a few of questions for ya
WebNN (as spec'd) and WebGPU lack a way of sharing tensor data on-device directly with each other:
GPUBuffer
is inaccessible to WebNN and WebGPU does not support NPU buffers. I believe a sharable NPU or GPU buffer type that WebGPU could use is the only path forward for us to support WebGPU interop for WebNN (ex. custom ops). The other problem is chained inferences: WebNN has no means to re-use existing GPU or NPU results betweencompute()
calls without copying everything back to the CPU.
This proposal includes a clear way to read back to JS/CPU using readBuffer()
and writeBuffer()
, but doesn't mention how data is expected to be passed to/from WebGPU. Could you elaborate on the plans to interop MLBuffer
with WebGPU?
In particular, I'm curious about:
MLBuffer
creation? e.g. converting a WebGPU buffer to an MLBuffer
or importing a GPUExternalTexture
(perhaps as a not-yet-created GPUExternalBuffer
) are the first things which come to mind. Meanwhile, we should consider how this might interact with e.g. the Wasm Memory Control proposal (@dtig FYI)As @inexorabletash mentioned, code snippets showing WebGPU <-> WebNN interop would be extremely helpful here :)
a sharable NPU or GPU buffer type that WebGPU could use
Are we expecting to provide any guarantees about where this buffer resides? Would an MLBuffer
live on CPU if created from an MLContext
with a CPU backend, for example?
We need to also take into consideration other platforms where ML execution is not so closely tied to a single "device" as DirectML is (e.g. Core ML on Mac - this is related to discussions in https://github.com/webmachinelearning/webnn/pull/322 around MLContext
creation). I assume we're not expecting to promise zero-copy buffer transfer between WebNN and WebGPU in all scenarios, right?
For synchronous compute, use the read-back functions for window and workers, async and sync, respectively.
Just a heads up that with JSPI coming soon, I would expect pushback on adding this sync interface, even in a worker :)
Some questions:
I think read/writeBuffer
makes sense for XPU <-> CPU transfer.
Perhaps a bit future looking, how do we support GPU <-> NPU/GPU transfers? (e.g. GPU/NPU cooperation, iGPU <-> dGPU, multi-GPU)
From the current design,looks like developers need to:
Is there a faster path (or do we anticipate one) for inter-device transfer? Can we use Intel GPU <-> NPU transfer as an example?
Should we change compute()
and dispatch()
to accept only MLBuffer (i.e. drop TypedArray and ArrayBuffers)?
Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute?
Can MLBuffer be used where a MLOperand is accepted, like in conv2d(inputNode, /*filters=*/ mlBuffer)
?
Should these be defined on the MLBuffer themselves? Looks like read/write operations is dependent on the context the MLBuffer is associated with.
Defining read/write on MLBuffer removes the need to check MLBuffer.context == thisContext
in MLContext.read/writeBuffer.
When is MLBuffer's memory allocated on device? Is it during writeBuffer?
Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?
I also wonder if the continuous memory model is too simplified. What is different device use different channel ordering or endianness?
Are we expecting developers to perform said conversion on CPU manually?
Thanks, @a-sully for raising these questions.
How do we plan to prevent WebNN and WebGPU from stomping over shared resources
If we allow MLBuffer
to be sharable or GPU transferable; unlike GPUBuffer
, it can synchronize itself when used in WebGPU operations.
Then I think adding a new WebGPU API, GPUDevice.importExternalBuffer
could convert a MLBuffer
directly to GPUBuffer
, leaving the MLBuffer
"detached" or as if it was destroyed. To restore access [to WebNN], one could re-import it using MLContext.importExternalBuffer
or dispose GPUBuffer
(exact names are TBD).
That code could look like this:
// Create sharable buffer in WebNN
ml_context = ML.createContext(wgpuDevice);
ml_buffer = ml_context.createBuffer({size:size, forExport:true});
// Import buffer to WebGPU
gpu_buffer = wgpuDevice.importExternalBuffer(ml_buffer);
pipeline = wgpuDevice.createComputePipeline(/* pipeline with compute shader that updates gpu_buffer */);
bind_group = wgpuDevice.createBindGroup(/* create bind group for gpu_buffer */);
command_encoder = wgpuDevice.createCommandEncoder();
pass = command_encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(/* buffer index in shader */, bind_group);
pass.dispatchWorkgroups(/* sizes */);
pass.end();
wgpuDevice.queue.submit([command_encoder.finish()]);
// Export buffer from WebGPU
ml_buffer = ml_context.importExternalBuffer(gpu_buffer)
Are we expecting to provide any guarantees about where this buffer resides?
Yes, it will reside on the same device used to create the MLContext. If it is a CPU-only ML context, then WebNN should create a CPU backed MLBuffer.
I assume we're not expecting to promise zero-copy buffer transfer between WebNN and WebGPU in all scenarios, right?
Right. If you are on the same GPU, the result allows zero-copy. Otherwise, a GPU copy is usually required for NPU-to-GPU or iGPU-to-dGPU or video-to-tensor conversions.
Just a heads up that with JSPI coming soon, I would expect pushback on adding this sync interface, even in a worker :)
Thanks for the heads up.
@wacky6 Great questions, my thoughts below.
Can we use Intel GPU <-> NPU transfer as an example?
The only true "no copy" path I'm aware of is CPU/iGPU. I believe all other scenarios require GPU/NPU copy.
Should we change compute() and dispatch() to accept only MLBuffer (i.e. drop TypedArray and ArrayBuffers)?
I am also in favor of using MLBuffer
everywhere and only having dispatch()
. However, I'm told WebNN developers may prefer to stick with ArrayBuffers or TypedArray since using MLBuffer
everywhere creates a inconvenience.
Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute?
Currently, yes. In the future, MLBuffer
could also be permitted for interop API. See https://github.com/webmachinelearning/webnn/issues/482#issuecomment-1894189126.
Should these be defined on the
MLBuffer
themselves?
Perhaps the earlier response addresses this? https://github.com/webmachinelearning/webnn/issues/482#issuecomment-1856445691.
When is MLBuffer's memory allocated on device? Is it during writeBuffer?
No, it would be on buffer creation. This avoids generating a fatal OOM where the WebNN developer wouldn't expect.
Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?
MLBuffer.destroy()
would not necessarily guarantee de-allocation since its preferable to re-use memory.
continuous memory model is too simplified
I will follow up with our Intel NPU teams if they plan to introduce complex formats.
@wacky6 and @a-sully , thank you for your feedback.
wacky6 wrote:
How to transfer MLBuffer between devices? I think read/writeBuffer makes sense for XPU <-> CPU transfer. Perhaps a bit future looking, how do we support GPU <-> NPU/GPU transfers? (e.g. GPU/NPU cooperation, iGPU <-> dGPU, multi-GPU)
From the current design,looks like developers need to:
- Read GPUBuffer to CPU, block until completion
- Write the buffer on CPU to NPUBuffer, block until completion
- Use NPUBuffer Is there a faster path (or do we anticipate one) for inter-device transfer? Can we use Intel GPU <-> NPU transfer as an example?
In the current proposal, there is no "block until completion" on the JS side for steps 1 or 2. After developers call writeBuffer
to transfer memory, they're free to use the MLBuffer
in one or more dispatch
calls before calling readBuffer
. For WebGPU interop, readBuffer
may not be necessary depending on the scenario.
The proposal does not talk about WebGPU/WebNN interop but I agree with Bryan about having an importExternalBuffer
API in WebGPU which will turn an MLBuffer
into a GPUBuffer
. The implementation of importExternalBuffer
would handle synchronizing the NPU and GPU device such that WebGPU reads see WebNN writes. While the buffer is imported as a GPUBuffer
, developers would not be able to use MLBuffer
until it is relinquished by WebGPU
using a different API. Similar synchronization would need to happen such that WebNN reads see WebGPU writes.
FYI, WebGPU already has a similar relationship with other web APIs such as video frames. See Importing External Textures.
wacky6 wrote:
MLBuffer usage scope Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute? Can MLBuffer be used where a MLOperand is accepted, like in conv2d(inputNode, /filters=/ mlBuffer)?
As Bryan says MLBuffer
can only be used as an input/output of a graph. If we allow an MLBuffer
to be used as an MLOperand
down the road, we need to make the spec clear (as is already the case for JS arrays) that a copy is made of the contents of the MLBuffer
at compilation time. Since graph compilation uses the contents of the buffer to make optimizations, any writeBuffer
changes made to the MLBuffer
after compilation would be ignored.
wacky6 wrote:
read/writeBuffer definition Should these be defined on the MLBuffer themselves? Looks like read/write operations is dependent on the context the MLBuffer is associated with.
Defining read/write on MLBuffer removes the need to check MLBuffer.context == thisContext in MLContext.read/writeBuffer.
I would prefer that we keep read/writeBuffer on the context so that it is more clear to web developers that those operations are queued relative to dispatch operations. WebGPU works in a similar manner. See GPUQueue
wacky6 wrote:
MLBuffer memory management When is MLBuffer's memory allocated on device? Is it during writeBuffer? Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?
I agree with Bryan the memory should be allocated when the buffer is created.
Both WebGPU and WebGL have similar destroy
methods. In neither case is a promise returned. When do you expect a WebNN developer would use this?
@bbernhar and @RafaelCintron thanks for the explanations! The code example is very helpful
TLDR I'd like to raise some issues which I think are necessary to resolve before this proposal can move forward. I don't have any concrete proposals since I'm still learning this area, but I would appreciate confirmation that the raised issues do need to be tackled. I'm also very happy to help work these out together :)
The proposal does not talk about WebGPU/WebNN interop
I believe that if we're to go forward with MLBuffer
, WebGPU interop needs to be considered from the start rather than assuming we can patch it on later. If I'm reading between the lines correctly here, the WebNN "timelines" mentioned above will have to be closely integrated with WebGPU timelines. Also, given that we'll be hooking into the internals of WebGPU, we need to play by WebGPU's rules, e.g. around buffer usage. I think we need to at minimum:
MLBuffer
at creationWith regards to (1) let's look at a snippet from the example above:
// ... wgpuDevice.queue.submit([command_encoder.finish()]); // Export buffer from WebGPU ml_buffer = ml_context.importExternalBuffer(gpu_buffer);
Presumably this code does not suggest that we are synchronously mapping/copying the GPUBuffer
into the MLBuffer
(or else some synchronization would be required in JS between these statements), but rather that the gpu_buffer
will be mapped/copied to ml_buffer
once gpu_buffer
's contents are ready to be accessed. So a contrived example to read the contents of the GPU buffer via an MLBuffer
might look like:
// `gpuBuffer` is used in some WebGPU work submitted here
wgpuDevice.queue.submit([commandEncoder.finish()]);
// Inform WebGPU to map/copy `gpuBuffer` to `mlBuffer` once
// `gpuBuffer`'s contents are ready to be accessed.
const mlBuffer = mlContext.importExternalBuffer(gpuBuffer);
// Queue this work behind the importExternalBuffer() call on a WebNN timeline.
// This implicitly awaits all WebGPU work involving `gpuBuffer`
const gpuBufferContentsCopiedToJsBuffer = await mlContext.readBuffer(mlBuffer);
Note that readBufferSync()
would be functionally equivalent to the much-discussed GPUBuffer.mapSync()
if the import doesn't require a copy, which is why I expect it to receive pushback :)
What's actually happening here? How can the user agent know whether it can map gpuBuffer
to mlBuffer
or whether it will need to make a copy? This operation should only be valid if:
GPUBuffer
is read-only,GPUBuffer
is invalidated afterwards, orSince the usages of a GPUBuffer
are known, we may be able to do this. That being said, mapping a GPUBuffer
to an MLBuffer
will still require abiding by all of WebGPU's constraints - e.g. that these usage flags must be constant within a usage scope, and changing the state of a GPUBuffer
will need to be scheduled on a queue timeline.
Let's think about the reverse scenario of WebNN -> WebGPU mapping:
// Inform WebNN to map/copy `mlBuffer` to `gpuBuffer` once
// `mlBuffer`'s contents are ready to be accessed
const gpuBuffer = wgpuDevice.importExternalBuffer(mlBuffer);
How can the user agent know whether it can map mlBuffer
to gpuBuffer
or whether it will need to make a copy?
To import a GPUExternalTexture
, that texture is a snapshot which may not change. Presumably the imported MLBuffer
must be guaranteed to be read-only to be mapped to a WebGPU buffer, as well?
Buffer usage is always assumed on first access (ex. passed as
outputs
assumes output usage).
This does not seem feasible - especially if we expect the MLBuffer's memory to be allocated on buffer creation. For example, what's the implied usage here?
mlContext.dispatch(
graph,
/*inputs=*/{buffer: someMlBuffer},
/*outputs=*/{buffer: someMlBuffer},
);
Edge cases aside, let's look at an example of chained inference - the other use case for MLBuffer
:
const inputMlBuffer = mlContext.createBuffer({inputSize});
const intermediateMlBuffer = mlContext.createBuffer({intermediateSize});
const outputMlBuffer = mlContext.createBuffer({outputSize});
mlContext.writeBuffer(
inputMlBuffer,
/*dstOffset=*/0,
/*srcData=*/someJsArrayBuffer,
);
mlContext.dispatch(
graph,
/*inputs=*/{buffer: inputMlBuffer},
/*outputs=*/{buffer: intermediateMlBuffer},
);
// Feed the output of one execution as the input to the next. Chained inference!
mlContext.dispatch(
graph,
/*inputs=*/{buffer: intermediateMlBuffer},
/*outputs=*/{buffer: outputMlBuffer},
);
const resultBuffer = await mlContext.readBuffer(outputMlBuffer);
Seems great! Now, where exactly will these buffers be allocated?
This snippet from the WebGPU explainer gives us a hint that we can't both (1) allocate on creation and (2) not know the usage upfront - at least, not without sacrificing something (e.g. performance, extra copies):
The physical memory location for a GPUBuffer’s underlying buffer depends on whether it should be mappable and whether it is mappable for reading or writing
To make this concrete - the Chromium prototype's DML implementation allocates memory for an MLBuffer
in the GPU process using the D3D12_RESOURCE_STATE_UNORDERED_ACCESS
flag and with D3D12_HEAP_FLAG_NONE
. Depending on the usage (and the device architecture), this may be suboptimal. For example, on discrete GPUs, generally resources that can be mapped to the CPU will not perform well when used as D3D12_RESOURCE_STATE_UNORDERED_ACCESS
on the GPU. And if the MLBuffer
is to be used for mapping to CPU buffers, "upload" or "readback" heaps are more appropriate.
This proposal doesn't use the words "mapping", but what's being proposing here is effectively mapping for MLBuffer
s:
Buffer type | From JS to *Buffer | From *Buffer to JS |
---|---|---|
GPUBuffer |
GPUQueue.writeBuffer() |
GPUBuffer.mapAsync() |
MLBuffer |
MLContext.writeBuffer() |
MLContext.readBuffer() |
It seems clear to me that we need to define usage of MLBuffer
at creation. How we define this mapping might be different if we're designing exclusively for WebNN <-> CPU (JS) interop vs. if we want to support mapping buffers between WebNN <-> WebGPU, which is why I think we should take WebGPU into account early on.
With regards to (2), let's take the real-time video processing use case as another example. Using MLBuffer
, this might look like:
const applyEffectToFrame = () => {
// Get the frame data as a GPU buffer
// Some way to import directly into an MLBuffer directly would avoid this step
const gpuExternalBuffer = device.importExternalBuffer({source: video});
// Get the frame data into WebNN. The imported buffer is read-only, so this should
// hopefully not require a copy if `mlContext` tied to the same GPU as `gpuExternalBuffer`
const inputMlBuffer = mlContext.importExternalBuffer(gpuExternalBuffer);
const outputMlBuffer = mlContext.createBuffer({size: inputMlBuffer.size});
// Perform some effects described by `graph` on the frame (e.g. background blur)
const inputs = {buffer: inputMlBuffer};
const outputs = {buffer: outputMlBuffer};
mlContext.dispatch(graph, inputs, outputs);
// Inform WebNN to map/copy `outputMlBuffer` - which contains the resulting
// frame after effects have been applied - to `gpuBufferToRender` once
// `outputMlBuffer`'s contents are ready to be accessed
//
// To avoid a copy, `outputMlBuffer`'s contents must be guaranteed not to change
const gpuBufferToRender = wgpuDevice.importExternalBuffer(outputMlBuffer);
// create a bind group for `gpuBufferToRender`, create a command encoder, etc.
// asking WebGPU to render `gpuBufferToRender`
// ...
// These queued commands must block on completion of the `dispatch()` call above
wgpuDevice.queue.submit([commandEncoder.finish()]);
// Call this method for each frame
video.requestVideoFrameCallback(applyEffectToFrame);
}
Without any additional synchronization, the commands submitted to the GPUQueue
must block on completion of MLContext.dispatch()
. It seems that the GPUQueue
must either:
MLContext
that outputMlBuffer
is available, orMLContext
is running on?My understanding is that https://github.com/webmachinelearning/webnn/issues/264 was attempting to specify the latter by describing WebNN execution in a MLCommandEncoder
to be submitted to a GPUQueue
. That would naturally queue the commandEncoder
's commands behind the WebNN workload, which would effectively handle the synchronization. But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?
FYI, WebGPU already has a similar relationship with other web APIs such as video frames. See Importing External Textures.
My (limited) understanding of WebGPU's relationship with video frames is that the former behavior does not exist? Consider a basic rendering loop with WebGPU:
const render = () => {
// Get the frame data as a GPU buffer
const gpuExternalBuffer = device.importExternalBuffer({source: video});
// create a bind group for `gpuExternalBuffer`, create a command encoder,
// beginRenderPass, etc
// ...
// Queue a bunch of commands to the GPUQueue, which will eventually render to
// a WebGPU canvas
wgpuDevice.queue.submit([commandEncoder.finish()]);
// This method registers a callback which will be fired once a new frame is
// sent to the compositor
video.requestVideoFrameCallback(render);
}
The encoded GPU commands will eventually "update the rendering of a WebGPU canvas", which in turn calls these steps in the HTML spec, which in turn (eventually) runs the animation frame or video request frame callbacks... which triggers the render()
function and so forth. There are hooks into other APIs, but (as far as I'm aware) the GPUQueue
does not pause execution while waiting on other APIs.
I think we need more details as to how WebGPU and WebNN synchronization will work :)
I had a couple thoughts while looking through @a-sully's comment:
MLContext.importExternalBuffer()
should be an async method with semantics equivalent to mapAsync()
+ getMappedRange()
+ set()
. That is, it should wait for WebGPU to be done with the buffer, map it into a context where it can be read by the device the MLContext
is associated with, and then either copied into a new MLBuffer
or wrapped by an MLBuffer
until it is destroyed or converted back to a GPUBuffer
by a call to GPUDevice.importExternalBuffer()
(equivalent to unmap()
).GPUDevice.importExternalBuffer()
is easier because assuming that MLContext.dispatch()
has similar semantics to MLContext.compute()
the buffer would be in some kind of "transferred" state and couldn't be passed back to WebGPU until WebNN is done with it.MLBuffer
might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer
might be in GPU memory) I recommend not using the suggested readBuffer()
and writeBuffer()
methods but instead copying the mapAsync()
and getMappedRange()
design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.Both WebGPU and WebGL have similar destroy methods. In neither case is a promise returned. When do you expect a WebNN developer would use this?
I'm thinking about scenarions where explicit memory management is needed. Say for example, a developer wants to ensure the GPU memory used by WebNN has been deallocated before allocating another chunk of memory on GPU (e.g. call WebGPU/WebGL immediately after they finish WebNN operations).
My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer.
Or do we expect GPU service / GPU drivers to handle this transparently? Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?
@a-sully Appreciate the questions and help, responses below.
@a-sully wrote
It seems clear to me that we need to define usage of MLBuffer at creation.
Good point to clarify. I think it's easiest to spec MLBuffer to have both input and output usage at creation. This is equivalent to GPUBuffer's resource usage bits (storage | storage_read), which gets re-created from MLBuffer's resource upon importExternalBuffer()
. Note: no mapping or copy is required to transfer/import a GPU resource from the same device.
@a-sully wrote
How can the user agent know whether it can map mlBuffer to gpuBuffer or whether it will need to make a copy?
A MLBuffer could be transfered/imported as-is without a copy if the User Agent determines the MLContext and GPUDevice supports zero-copy (ex. same adapter).
@a-sully wrote
Seems great! Now, where exactly will these buffers be allocated?
The MLBuffer allocates its "default" device-resource persistently from the device/context used to create it. The exact location is opaque to the WebNN developer since its runtime managed. Similarly, "upload" and "readback" heaps are allocated/managed from MLContext upon writeBuffer()
and readBuffer()
, respectively. This is what my full proof-of-concept does [1].
But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?
In order to transfer MLBuffer, MLContext must be flushed prior, which occurs on importExternalBuffer()
. So in effect, WebGPU borrows the buffer WebNN rents out: the queue maintains exclusive access. Because this synchronization happens upon explicit resource use, there is no need to teach WebGPU about new commands. This is a similar approach to how video import works.
[1] https://chromium-review.googlesource.com/c/chromium/src/+/5101781
Bryan
But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?
In order to transfer MLBuffer, MLContext must be flushed prior, which occurs on
importExternalBuffer()
. So in effect, WebGPU borrows the buffer WebNN rents out: the queue maintains exclusive access. Because this synchronization happens upon explicit resource use, there is no need to teach WebGPU about new commands. This is a similar approach to how video import works.
Just to clarify, are you suggesting the synchronization strategy is to pause work from one API while the other API is using the buffer? e.g. WebNN must pause all work (from the MLContext
the MLBuffer
was created from, at least) while WebGPU holds the buffer, to ensure it does not change out from under WebGPU (which would violate WebGPU's expectations).
What about the reverse? This is where WebNN is different from the video import example, as far as I can tell. Do we expect WebGPU to block all execution while a buffer is rented out to WebNN?
This is related to the point I was trying to make in this comment (though in that case WebNN is renting to WebGPU):
How can the user agent know whether it can map mlBuffer to gpuBuffer or whether it will need to make a copy?
The user agent needs to know whether mlBuffer
will be modified while WebGPU holds gpuBuffer
; otherwise a copy would have to be made... But of course that's only relevant if WebNN is allowed to do work while and WebGPU holds the buffer :)
Just to clarify, are you suggesting the synchronization strategy is to pause work from one API while the other API is using the buffer?
Yup. MLBuffer
cannot be written to from multiple APIs simultaneously. We could allow simultaneous read access for MLBuffer
. WebNN must ensure any enqueued writes are completed prior to import. However, WebNN could continue work on another MLBuffer
, non-imported.
Do we expect WebGPU to block all execution while a buffer is rented out to WebNN?
WebGPU doesn't rent-out MLBuffer
, only WebNN does that. Once WebGPU is completed using the GPUBuffer
, it created from import, it could throw it away or export it back. The difference between WebNN and video is MLBuffer
is restored instead of recycled.
@reillyeon thanks for the comments.
@reillyeon wrote
Given that an MLBuffer might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer might be in GPU memory) I recommend not using the suggested readBuffer() and writeBuffer() methods but instead copying the mapAsync() and getMappedRange() design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.
MLContext.importExternalBuffer
would be more semantically equivalent to GPUDevice.importExternalTexture
. Whereas GPUBuffer.mapAsync
+ GPUBuffer.getMappedRange
is semantically equivalent to MLContext.readBuffer
and MLContext.writeBuffer
. FYI, this is why WebGPU has GPUDevice.writeBuffer
.
WebGPU has a concept of CPU-accessible device memory (eg. staging buffer) which relies on copyBufferToBuffer
to get this data into a GPU-accessible buffer. The WebNN context is more like a WebGL context here, which has glReadPixels
, because otherwise its not obvious when CPU data is applied w/o explicit queue operations.
@wacky6 appreciate the comments, responses below
@wacky6 wrote
My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer.
If the web developer forgets to synchronize MLBuffer
, we can wind up with undefined behavior. I prefer we synchronize on their behalf to avoid such calamity.
@wacky6 wrote
Or do we expect GPU service / GPU drivers to handle this transparently?
Yup. I would expect WebNN, like WebGPU, to manage resources on the web developer's behalf (or by the GPU service).
@wacky6 wrote
Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?
Yes, it could. WebNN memory would get deallocated, say upon calling MLBuffer.Destroy
, before WebGPU executes the submitted work using it, after importExternalTexture()
. Internally, MLBuffer provides that synchronization point (ie. fence) to ensure the GPU wait is satisfied first. If your curious to see how this works, see https://dawn-review.googlesource.com/c/dawn/+/171160.
[@wacky6] My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer. [@bbernhar] If the web developer forgets to synchronize MLBuffer, we can wind up with undefined behavior. I prefer we synchronize on their behalf to avoid such calamity.
+1 to @bbernhar's reply.
I think that having an explicit transfer (via import APIs) between WebGPU and WebNN should be enough to satisfy our requirements. What we spec should clearly state that only one API should have access to the buffer at a time. While an MLBuffer is transferred to WebGPU, queueing new work to it via WebNN should be an error. Similarly, while an MLBuffer has been transferred back to WebNN, queuing new work to it via WebGPU should be an error.
Note that for scenarios where developers want to source input from 2D raster-type data (image elements, media, camera, canvas elements, image bitmaps, etc) there will already need to be a conversion (tensorization) step from the raster image to a buffer for ingestion to WebNN. You can't really "zero copy optimize" this transfer operation. The web developer must tell the browser (via a WebGPU compute shader) how they'd like the raster image data to be arrange in the buffer. The same thing is true if you want to visualize the result of the ML operation via a 2D image. You'll need to import the buffer to WebGPU so that you can convert it to a GPUTexture
via a WebGPU compute shader.
[@wacky6] Or do we expect GPU service / GPU drivers to handle this transparently? Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?
In the current proposal, MLBuffer destruction happens when you call the destroy method on the object. WebGPU has a similar setup where GPUBuffer destruction is done via a destroy method on GPUBuffer instead of queuing such an operation. See buffer destruction. Once you destroy a buffer, you can not queue new operations with it. Any inflight operations using the buffer complete before storage for the buffer is released. The browser is responsible for ensuring everything happens in a defined manner with no crashes or use-after-frees.
Similar to how video textures work in WebGPU, calling GPUBuffer.destroy
on the WebGPU side should not release memory for the MLBuffer. The memory is owned by WebNN, same as video textures are owned by the media engine.
[@reillyeon] Given that an MLBuffer might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer might be in GPU memory) I recommend not using the suggested readBuffer() and writeBuffer() methods but instead copying the mapAsync() and getMappedRange() design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.
As @bbernhar pointed out, WebGPU already has a writeBuffer API which has the same parameters as this proposal. For WebNN, readBuffer
is different, but simpler. If people feel strongly that we replace readBuffer
with exactly the same API shape as WebGPU, I would be OK with that.
[@reillyeon] MLContext.importExternalBuffer() should be an async method with semantics equivalent to mapAsync() + getMappedRange() + set(). That is, it should wait for WebGPU to be done with the buffer, map it into a context where it can be read by the device the MLContext is associated with, and then either copied into a new MLBuffer or wrapped by an MLBuffer until it is destroyed or converted back to a GPUBuffer by a call to GPUDevice.importExternalBuffer() (equivalent to unmap()).
MLContext.importExternalBuffer
is asynchronous in the sense that calling it enqueues work in another process which runs on the device (NPU or GPU) timeline. The implementation will ensure that inflight WebGPU writes to the buffer finish before WebNN reads the buffer. GPUDevice.importExternalBuffer
, likewise, will ensure that all WebNN writes to the buffer finish before WebGPU reads of the buffer. Unless I am missing something, I do not think the above requires that either import method must be a promise-returning one what causes the JS thread to wait.
The only time the JS thread will need to wait is when you call readBuffer
to ask for the contents of the buffer on the CPU. Since the device could be busy with the buffer, readBuffer
must return a promise to avoid jank.
Thank you for your patience while I learn enough about the WebGPU primitives here to be able to appreciate the complexity behind this proposal. Overall I think this is the right direction. At this point my feedback is mainly editorial. I'd like to see as much symmetry with the WebGPU APIs as possible, which mainly means removing the readBuffer()
/readBufferSync()
methods and copying WebGPU's mapping concept to support reading buffer contents from script instead. I better understand now the need for writeBuffer()
.
I think the semantics discussed above around serializing WebNN and WebGPU operations which touch the same buffer make sense. To actually specify them I think we need to update the specification to be significantly more specific about timelines and introduce the pattern common to the WebGPU specification of providing "content timeline" and "device timeline" steps. Something I don't think the WebGPU specification has had to deal with yet is the potential for multiple interacting device timelines, as we would see in the case of WebGPU interacting with a non-GPU MLContext
. I think this is all plausible with the appropriate use of fences. It just needs to be specified very explicitly, though we may need a MLQueue
concept to match the GPUQueue
concept. I'm still unsure about that part.
Since the importExternalBuffer()
methods assume that the buffer was originally created as an MLBuffer
I think it would help if they didn't look so symmetrical. For example, if the functionality was centralized in the MLBuffer
or MLContext
interfaces as a pair of mapAsGpuBuffer()
and unmapFromGpuBuffer()
methods.
Thank you @bbernhar, @RafaelCintron and many others for the proposal and deep thought on this issue. As I mentioned in the WG call this morning, although conceptually this proposal could work with any device buffer, a currently motivated use case is the WebNN/WebGPU interop scenario. The previous proposal (i.e. MLCommandEncoder
) was thought to be non-implementable on the WebGPU side, so I think one way to make this proposal more concrete is to demonstrate through a small working prototype that it can actually be implemented.
@reillyeon Thanks for the feedback. Overall, the comments make sense to me; in particular, the lack of MLQueue
. This shaped my motivation for MLContext.readBuffer
.
@reillyeon wrote
I'd like to see as much symmetry with the WebGPU APIs as possible, which mainly means removing the readBuffer()/readBufferSync() methods and copying WebGPU's mapping concept to support reading buffer contents from script instead. I better understand now the need for writeBuffer().
WebGPU's buffer mapping relies on GPUQueue::copyBufferToBuffer
to work. Since WebNN lacks its own GPUQueue
-equivelent type, MLBuffer.mapAsync
would need to invoke a internal GPU copy using the MLContext
before/after dispatch begins/ends. This is a different operation than GPUBuffer.mapAsync
which was meant for CPU reads/writes. The WebGPU developer could not interact with WebNN's mapping API the same way they did in WebGPU.
FYI: I just posted https://github.com/webmachinelearning/webnn/pull/541, which continues this discussion in the form of (heavily annotated) sample code for the key use cases we've identified here.
Please take a look when you get a chance!
Thanks, @a-sully. Will do. I agree with you that this issue is quite large; esp. with interop in view.
So, I propose we split #482 into 4 separate sub-issues, where we can more easily discuss and reach consensus on the individual parts but in a logical order:
MLContext.createBuffer
)I would be happy to initiate / open these issues if there are no objections.
SGTM. Thanks @bbernhar!
Closing, this issue has been replaced by smaller sub-issues which I encourage we use for discussion instead. https://github.com/webmachinelearning/webnn/issues?q=is%3Aissue+is%3Aopen+in%3Atitle+MLBuffer+
This issue proposes a new opaque device-specific storage type in WebNN,
MLBuffer
.MLBuffer
is a backend-agnostic storage type (CPU, GPU, NPU, etc) which can be used in WebNN operations.MLBuffer
would be the solution to:Construction/Destruction
MLBuffer
is always known (and linear access is assumed).Upload/Download tensor data
srcData
is always made and returns control back to the web developer immediately.Binding to graphs
outputs
assumes output usage).Edits:
dispatch
instead of overloadingcompute()
per https://www.w3.org/2023/12/14-webmachinelearning-minutes.html