Support for device-based tensor storage objects

bbernhar commented 11 months ago

This issue proposes a new opaque device-specific storage type in WebNN, MLBuffer. MLBuffer is a backend-agnostic storage type (CPU, GPU, NPU, etc) which can be used in WebNN operations.

MLBuffer would be the solution to:

Give WebNN developer control of device-storage to avoid round-trips to/from CPU.
Could be extended to export/import to support WebNN interop with web APIs.

Construction/Destruction

typedef [EnforceRange] unsigned long long MLSize64;

dictionary MLBufferDescriptor {
    required MLSize64 size;
};

[Exposed=(Window, DedicatedWorker), SecureContext]
interface MLContext {
    MLBuffer createBuffer(MLBufferDescriptor descriptor);
};

Layout of MLBuffer is always known (and linear access is assumed).

typedef unsigned long long MLSize64Out;

[Exposed=(Window, DedicatedWorker)]
interface MLBuffer {
  [CallWith=Isolate] void destroy();

  readonly attribute MLSize64Out size;
}

WebNN developers should prefer calling Destroy(), vs relying on GC, for predictable device memory usage.
Destroy() gets called on the context timeline but doesn't actually release until the device signals completion.

Upload/Download tensor data

[Exposed=(Window, DedicatedWorker), SecureContext]
interface MLContext {

   undefined writeBuffer(
        MLBuffer dstBuffer, 
        MLSize64 dstOffset, 
        AllowSharedBufferSource srcData,
        optional MLSize64 srcOffset = 0,
        optional MLSize64 srcSize);

  [Exposed=(Window)]
  Promise<ArrayBuffer> readBuffer(
        MLBuffer srcBuffer,
        MLSize64 srcOffset,
        MLSize64 srcSize);

  [Exposed=(DedicatedWorker)]
  void readBufferSync(
        MLBuffer srcBuffer, 
        MLSize64 srcOffset,
        MLSize64 srcSize,
       AllowSharedBufferSource dstData);
};

Transfer operations will execute on the device timeline in the same order they were enqueued on the context timeline.
A copy of srcData is always made and returns control back to the web developer immediately.
For synchronous compute, use the read-back functions for window and workers, async and sync, respectively.

Binding to graphs

dictionary MLBufferView {
  required MLBuffer buffer;
  MLSize64 offset = 0;
  MLSize64 size;
};

typedef record<DOMString, MLBufferView> MLNamedMLBufferViews;

undefined dispatch(
      MLGraph graph, MLNamedMLBufferViewsinputs, MLNamedMLBufferViews outputs);

Buffer usage is always assumed on first access (ex. passed as outputs assumes output usage).
WebNN developer must call readBuffer() to get a resulting output ML buffer back after compute().

const bufferA = new Float32Array(4).fill(1.0);
const bufferB = new MLBuffer({size:4});
const inputs = {'A': bufferA};
const outputs = {'B': bufferB};
context.dispatch(graph, inputs, outputs);
context.readBuffer(bufferB);

Edits:

12/15: added MLBuffer dispatch instead of overloading compute() per https://www.w3.org/2023/12/14-webmachinelearning-minutes.html
12/15: fixed createBuffer return - should of been non-optional.
1/10: edit to rename MLNamedMLBufferViews => MLNamedMLBufferResourceViews
1/10: added readBufferSync
1/17: renamed MLBufferResource => MLBuffer

wchao1115 commented 11 months ago

@bbernhar I'm still not quite clear on the problem statement. Can you please clarify on what we think is the problem here?

bbernhar commented 11 months ago

@wchao1115 Sure.

WebNN (as spec'd) and WebGPU lack a way of sharing tensor data on-device directly with each other: GPUBuffer is inaccessible to WebNN and WebGPU does not support NPU buffers. I believe a sharable NPU or GPU buffer type that WebGPU could use is the only path forward for us to support WebGPU interop for WebNN (ex. custom ops). The other problem is chained inferences: WebNN has no means to re-use existing GPU or NPU results between compute() calls without copying everything back to the CPU.

huningxin commented 11 months ago

Give WebNN developer control of device-storage to avoid round-trips to/from CPU.

I think this feature would be critical for some language models' performance on device (GPU/NPU) where the outputs of the previous inference, e.g., hidden state or KV pairs, will be used as inputs of the next inference. For such use case, frameworks usually allow to allocate on-device tensors and use these tensors as inputs/outputs for model inference, for example ONNXRuntime I/O Binding.

inexorabletash commented 11 months ago

Since an MLBuffer is created from an MLContext, I infer from the proposal that the buffer is always bound to the context. Is that correct?

If so, the read/write operations could be simplified by moving them onto the interface rather than needing to pass the buffer.

inexorabletash commented 11 months ago

Nitpick on proposed API shape: MLBuffer? createBuffer(...) implies null can be returned; exceptions should be preferred for synchronous error cases.

bbernhar commented 11 months ago

Thanks @inexorabletash for the feedback.

Read/write ops must occur in the domain of the context because only the context determines device-execution order, not MLBuffer. We could consider support for direct mapped MLBuffer which supports that line of thinking.

zolkis commented 11 months ago

If so, the read/write operations could be simplified by moving them onto the interface rather than needing to pass the buffer.

My thoughts, too, but MLContext is indeed the main encapsulator here, and IIUC MLBuffer is a control object used for identification and indirect control of the underlying opaque data. So passing an MLBuffer as argument does not necessarily involve any copying, just recording what to do with the given buffer. It is the explicit methods that control the content of the buffer.

inexorabletash commented 11 months ago

Ah, thank you for clarifying. The timeline mentions are subtle, I hadn't internalized that yet. One more thing to specify concretely in the spec. :)

FWIW, it's best to make API proposals by leading with examples of the JS code using the proposed API, and only worry about providing the IDL later.

bbernhar commented 11 months ago

WIW, it's best to make API proposals by leading with examples of the JS code using the proposed API, and only worry about providing the IDL later.

Unfortunately, no code example tells you which timeline gets used where, only the WebNN spec can describe this behavior: which state is available to which operations. The WebNN programming model should probably concretely define these "timelines" then describe the entire API using it.

a-sully commented 10 months ago

Hi @bbernhar, thanks for this proposal (and the Chromium prototype)! I agree that specifying a clear WebNN <-> WebGPU interop is needed. I have a comment and a few of questions for ya

WebNN (as spec'd) and WebGPU lack a way of sharing tensor data on-device directly with each other: GPUBuffer is inaccessible to WebNN and WebGPU does not support NPU buffers. I believe a sharable NPU or GPU buffer type that WebGPU could use is the only path forward for us to support WebGPU interop for WebNN (ex. custom ops). The other problem is chained inferences: WebNN has no means to re-use existing GPU or NPU results between compute() calls without copying everything back to the CPU.

This proposal includes a clear way to read back to JS/CPU using readBuffer() and writeBuffer(), but doesn't mention how data is expected to be passed to/from WebGPU. Could you elaborate on the plans to interop MLBuffer with WebGPU?

In particular, I'm curious about:

How do we plan to prevent WebNN and WebGPU from stomping over shared resources, as you had discussed here on an earlier proposal? https://github.com/webmachinelearning/webnn/issues/264#issuecomment-1134937438
Are there plans to support other types of MLBuffer creation? e.g. converting a WebGPU buffer to an MLBuffer or importing a GPUExternalTexture (perhaps as a not-yet-created GPUExternalBuffer) are the first things which come to mind. Meanwhile, we should consider how this might interact with e.g. the Wasm Memory Control proposal (@dtig FYI)

As @inexorabletash mentioned, code snippets showing WebGPU <-> WebNN interop would be extremely helpful here :)

a sharable NPU or GPU buffer type that WebGPU could use

Are we expecting to provide any guarantees about where this buffer resides? Would an MLBuffer live on CPU if created from an MLContext with a CPU backend, for example?

We need to also take into consideration other platforms where ML execution is not so closely tied to a single "device" as DirectML is (e.g. Core ML on Mac - this is related to discussions in https://github.com/webmachinelearning/webnn/pull/322 around MLContext creation). I assume we're not expecting to promise zero-copy buffer transfer between WebNN and WebGPU in all scenarios, right?

For synchronous compute, use the read-back functions for window and workers, async and sync, respectively.

Just a heads up that with JSPI coming soon, I would expect pushback on adding this sync interface, even in a worker :)

wacky6 commented 10 months ago

Some questions:

How to transfer MLBuffer between devices?

I think read/writeBuffer makes sense for XPU <-> CPU transfer.

Perhaps a bit future looking, how do we support GPU <-> NPU/GPU transfers? (e.g. GPU/NPU cooperation, iGPU <-> dGPU, multi-GPU)

From the current design,looks like developers need to:

Read GPUBuffer to CPU, block until completion
Write the buffer on CPU to NPUBuffer, block until completion
Use NPUBuffer

Is there a faster path (or do we anticipate one) for inter-device transfer? Can we use Intel GPU <-> NPU transfer as an example?

Simplified types in compute()

Should we change compute() and dispatch() to accept only MLBuffer (i.e. drop TypedArray and ArrayBuffers)?

MLBuffer usage scope

Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute?

Can MLBuffer be used where a MLOperand is accepted, like in conv2d(inputNode, /*filters=*/ mlBuffer)?

read/writeBuffer definition

Should these be defined on the MLBuffer themselves? Looks like read/write operations is dependent on the context the MLBuffer is associated with.

Defining read/write on MLBuffer removes the need to check MLBuffer.context == thisContext in MLContext.read/writeBuffer.

MLBuffer memory management

When is MLBuffer's memory allocated on device? Is it during writeBuffer?

Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?

I also wonder if the continuous memory model is too simplified. What is different device use different channel ordering or endianness?

Are we expecting developers to perform said conversion on CPU manually?

bbernhar commented 10 months ago

Thanks, @a-sully for raising these questions.

How do we plan to prevent WebNN and WebGPU from stomping over shared resources

If we allow MLBuffer to be sharable or GPU transferable; unlike GPUBuffer, it can synchronize itself when used in WebGPU operations.

Then I think adding a new WebGPU API, GPUDevice.importExternalBuffer could convert a MLBuffer directly to GPUBuffer, leaving the MLBuffer "detached" or as if it was destroyed. To restore access [to WebNN], one could re-import it using MLContext.importExternalBuffer or dispose GPUBuffer (exact names are TBD).

That code could look like this:

// Create sharable buffer in WebNN
ml_context = ML.createContext(wgpuDevice);
ml_buffer = ml_context.createBuffer({size:size, forExport:true});

// Import buffer to WebGPU
gpu_buffer = wgpuDevice.importExternalBuffer(ml_buffer);
pipeline = wgpuDevice.createComputePipeline(/* pipeline with  compute shader that updates gpu_buffer */);
bind_group = wgpuDevice.createBindGroup(/* create bind group for gpu_buffer */);
command_encoder = wgpuDevice.createCommandEncoder();
pass = command_encoder.beginComputePass();
pass.setPipeline(pipeline);
pass.setBindGroup(/* buffer index in shader */, bind_group);
pass.dispatchWorkgroups(/* sizes */);
pass.end();
wgpuDevice.queue.submit([command_encoder.finish()]);

// Export buffer from WebGPU
ml_buffer = ml_context.importExternalBuffer(gpu_buffer)

Are we expecting to provide any guarantees about where this buffer resides?

Yes, it will reside on the same device used to create the MLContext. If it is a CPU-only ML context, then WebNN should create a CPU backed MLBuffer.

I assume we're not expecting to promise zero-copy buffer transfer between WebNN and WebGPU in all scenarios, right?

Right. If you are on the same GPU, the result allows zero-copy. Otherwise, a GPU copy is usually required for NPU-to-GPU or iGPU-to-dGPU or video-to-tensor conversions.

Just a heads up that with JSPI coming soon, I would expect pushback on adding this sync interface, even in a worker :)

Thanks for the heads up.

bbernhar commented 10 months ago

@wacky6 Great questions, my thoughts below.

Can we use Intel GPU <-> NPU transfer as an example?

The only true "no copy" path I'm aware of is CPU/iGPU. I believe all other scenarios require GPU/NPU copy.

Should we change compute() and dispatch() to accept only MLBuffer (i.e. drop TypedArray and ArrayBuffers)?

I am also in favor of using MLBuffer everywhere and only having dispatch(). However, I'm told WebNN developers may prefer to stick with ArrayBuffers or TypedArray since using MLBuffer everywhere creates a inconvenience.

Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute?

Currently, yes. In the future, MLBuffer could also be permitted for interop API. See https://github.com/webmachinelearning/webnn/issues/482#issuecomment-1894189126.

Should these be defined on the MLBuffer themselves?

Perhaps the earlier response addresses this? https://github.com/webmachinelearning/webnn/issues/482#issuecomment-1856445691.

When is MLBuffer's memory allocated on device? Is it during writeBuffer?

No, it would be on buffer creation. This avoids generating a fatal OOM where the WebNN developer wouldn't expect.

Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?

MLBuffer.destroy() would not necessarily guarantee de-allocation since its preferable to re-use memory.

continuous memory model is too simplified

I will follow up with our Intel NPU teams if they plan to introduce complex formats.

RafaelCintron commented 10 months ago

@wacky6 and @a-sully , thank you for your feedback.

wacky6 wrote:

How to transfer MLBuffer between devices? I think read/writeBuffer makes sense for XPU <-> CPU transfer. Perhaps a bit future looking, how do we support GPU <-> NPU/GPU transfers? (e.g. GPU/NPU cooperation, iGPU <-> dGPU, multi-GPU)

From the current design,looks like developers need to:

Read GPUBuffer to CPU, block until completion

Write the buffer on CPU to NPUBuffer, block until completion

Use NPUBuffer Is there a faster path (or do we anticipate one) for inter-device transfer? Can we use Intel GPU <-> NPU transfer as an example?

In the current proposal, there is no "block until completion" on the JS side for steps 1 or 2. After developers call writeBuffer to transfer memory, they're free to use the MLBuffer in one or more dispatch calls before calling readBuffer. For WebGPU interop, readBuffer may not be necessary depending on the scenario.

The proposal does not talk about WebGPU/WebNN interop but I agree with Bryan about having an importExternalBuffer API in WebGPU which will turn an MLBuffer into a GPUBuffer. The implementation of importExternalBuffer would handle synchronizing the NPU and GPU device such that WebGPU reads see WebNN writes. While the buffer is imported as a GPUBuffer, developers would not be able to use MLBuffer until it is relinquished by WebGPU using a different API. Similar synchronization would need to happen such that WebNN reads see WebGPU writes.

FYI, WebGPU already has a similar relationship with other web APIs such as video frames. See Importing External Textures.

wacky6 wrote:

MLBuffer usage scope Is MLBuffer only permitted for binding input/output buffers to a built-graph during compute? Can MLBuffer be used where a MLOperand is accepted, like in conv2d(inputNode, /filters=/ mlBuffer)?

As Bryan says MLBuffer can only be used as an input/output of a graph. If we allow an MLBuffer to be used as an MLOperand down the road, we need to make the spec clear (as is already the case for JS arrays) that a copy is made of the contents of the MLBuffer at compilation time. Since graph compilation uses the contents of the buffer to make optimizations, any writeBuffer changes made to the MLBuffer after compilation would be ignored.

wacky6 wrote:

read/writeBuffer definition Should these be defined on the MLBuffer themselves? Looks like read/write operations is dependent on the context the MLBuffer is associated with.

Defining read/write on MLBuffer removes the need to check MLBuffer.context == thisContext in MLContext.read/writeBuffer.

I would prefer that we keep read/writeBuffer on the context so that it is more clear to web developers that those operations are queued relative to dispatch operations. WebGPU works in a similar manner. See GPUQueue

wacky6 wrote:

MLBuffer memory management When is MLBuffer's memory allocated on device? Is it during writeBuffer? Should MLBuffer.destroy() returns a Promise to tell the caller that the memory has been deallocated?

I agree with Bryan the memory should be allocated when the buffer is created.

Both WebGPU and WebGL have similar destroy methods. In neither case is a promise returned. When do you expect a WebNN developer would use this?

a-sully commented 9 months ago

@bbernhar and @RafaelCintron thanks for the explanations! The code example is very helpful

TLDR I'd like to raise some issues which I think are necessary to resolve before this proposal can move forward. I don't have any concrete proposals since I'm still learning this area, but I would appreciate confirmation that the raised issues do need to be tackled. I'm also very happy to help work these out together :)

The proposal does not talk about WebGPU/WebNN interop

I believe that if we're to go forward with MLBuffer, WebGPU interop needs to be considered from the start rather than assuming we can patch it on later. If I'm reading between the lines correctly here, the WebNN "timelines" mentioned above will have to be closely integrated with WebGPU timelines. Also, given that we'll be hooking into the internals of WebGPU, we need to play by WebGPU's rules, e.g. around buffer usage. I think we need to at minimum:

Define usage of an MLBuffer at creation
Explicitly define WebNN's timelines and how they interact with WebGPU's timelines

With regards to (1) let's look at a snippet from the example above:

// ...

wgpuDevice.queue.submit([command_encoder.finish()]);

// Export buffer from WebGPU
ml_buffer = ml_context.importExternalBuffer(gpu_buffer);

Presumably this code does not suggest that we are synchronously mapping/copying the GPUBuffer into the MLBuffer (or else some synchronization would be required in JS between these statements), but rather that the gpu_buffer will be mapped/copied to ml_buffer once gpu_buffer's contents are ready to be accessed. So a contrived example to read the contents of the GPU buffer via an MLBuffer might look like:

// `gpuBuffer` is used in some WebGPU work submitted here
wgpuDevice.queue.submit([commandEncoder.finish()]);

// Inform WebGPU to map/copy `gpuBuffer` to `mlBuffer` once
// `gpuBuffer`'s contents are ready to be accessed.
const mlBuffer = mlContext.importExternalBuffer(gpuBuffer);

// Queue this work behind the importExternalBuffer() call on a WebNN timeline.
// This implicitly awaits all WebGPU work involving `gpuBuffer`
const gpuBufferContentsCopiedToJsBuffer = await mlContext.readBuffer(mlBuffer);

Note that readBufferSync() would be functionally equivalent to the much-discussed GPUBuffer.mapSync() if the import doesn't require a copy, which is why I expect it to receive pushback :)

What's actually happening here? How can the user agent know whether it can map gpuBuffer to mlBuffer or whether it will need to make a copy? This operation should only be valid if:

the GPUBuffer is read-only,
the GPUBuffer is invalidated afterwards, or
we explicitly want to copy the buffer contents

Since the usages of a GPUBuffer are known, we may be able to do this. That being said, mapping a GPUBuffer to an MLBuffer will still require abiding by all of WebGPU's constraints - e.g. that these usage flags must be constant within a usage scope, and changing the state of a GPUBuffer will need to be scheduled on a queue timeline.

Let's think about the reverse scenario of WebNN -> WebGPU mapping:

// Inform WebNN to map/copy `mlBuffer` to `gpuBuffer` once
// `mlBuffer`'s contents are ready to be accessed
const gpuBuffer = wgpuDevice.importExternalBuffer(mlBuffer);

How can the user agent know whether it can map mlBuffer to gpuBuffer or whether it will need to make a copy?

To import a GPUExternalTexture, that texture is a snapshot which may not change. Presumably the imported MLBuffer must be guaranteed to be read-only to be mapped to a WebGPU buffer, as well?

Buffer usage is always assumed on first access (ex. passed as outputs assumes output usage).

This does not seem feasible - especially if we expect the MLBuffer's memory to be allocated on buffer creation. For example, what's the implied usage here?

mlContext.dispatch(
    graph,
    /*inputs=*/{buffer: someMlBuffer},
    /*outputs=*/{buffer: someMlBuffer},
);

Edge cases aside, let's look at an example of chained inference - the other use case for MLBuffer:

const inputMlBuffer = mlContext.createBuffer({inputSize});
const intermediateMlBuffer = mlContext.createBuffer({intermediateSize});
const outputMlBuffer = mlContext.createBuffer({outputSize});

mlContext.writeBuffer(
    inputMlBuffer,
    /*dstOffset=*/0,
    /*srcData=*/someJsArrayBuffer,
);

mlContext.dispatch(
    graph,
    /*inputs=*/{buffer: inputMlBuffer},
    /*outputs=*/{buffer: intermediateMlBuffer},
);

// Feed the output of one execution as the input to the next. Chained inference!
mlContext.dispatch(
    graph,
    /*inputs=*/{buffer: intermediateMlBuffer},
    /*outputs=*/{buffer: outputMlBuffer},
);

const resultBuffer = await mlContext.readBuffer(outputMlBuffer);

Seems great! Now, where exactly will these buffers be allocated?

This snippet from the WebGPU explainer gives us a hint that we can't both (1) allocate on creation and (2) not know the usage upfront - at least, not without sacrificing something (e.g. performance, extra copies):

The physical memory location for a GPUBuffer’s underlying buffer depends on whether it should be mappable and whether it is mappable for reading or writing

To make this concrete - the Chromium prototype's DML implementation allocates memory for an MLBuffer in the GPU process using the D3D12_RESOURCE_STATE_UNORDERED_ACCESS flag and with D3D12_HEAP_FLAG_NONE. Depending on the usage (and the device architecture), this may be suboptimal. For example, on discrete GPUs, generally resources that can be mapped to the CPU will not perform well when used as D3D12_RESOURCE_STATE_UNORDERED_ACCESS on the GPU. And if the MLBuffer is to be used for mapping to CPU buffers, "upload" or "readback" heaps are more appropriate.

This proposal doesn't use the words "mapping", but what's being proposing here is effectively mapping for MLBuffers:

Buffer type	From JS to *Buffer	From *Buffer to JS
`GPUBuffer`	`GPUQueue.writeBuffer()`	`GPUBuffer.mapAsync()`
`MLBuffer`	`MLContext.writeBuffer()`	`MLContext.readBuffer()`

It seems clear to me that we need to define usage of MLBuffer at creation. How we define this mapping might be different if we're designing exclusively for WebNN <-> CPU (JS) interop vs. if we want to support mapping buffers between WebNN <-> WebGPU, which is why I think we should take WebGPU into account early on.

With regards to (2), let's take the real-time video processing use case as another example. Using MLBuffer, this might look like:

const applyEffectToFrame = () => {
  // Get the frame data as a GPU buffer
  // Some way to import directly into an MLBuffer directly would avoid this step
  const gpuExternalBuffer = device.importExternalBuffer({source: video});

  // Get the frame data into WebNN. The imported buffer is read-only, so this should
  // hopefully not require a copy if `mlContext` tied to the same GPU as `gpuExternalBuffer`
  const inputMlBuffer = mlContext.importExternalBuffer(gpuExternalBuffer);

  const outputMlBuffer = mlContext.createBuffer({size: inputMlBuffer.size});

  // Perform some effects described by `graph` on the frame (e.g. background blur)
  const inputs = {buffer: inputMlBuffer};
  const outputs = {buffer: outputMlBuffer};
  mlContext.dispatch(graph, inputs, outputs);

  // Inform WebNN to map/copy `outputMlBuffer` - which contains the resulting
  // frame after effects have been applied - to `gpuBufferToRender` once
  // `outputMlBuffer`'s contents are ready to be accessed
  //
  // To avoid a copy, `outputMlBuffer`'s contents must be guaranteed not to change
  const gpuBufferToRender = wgpuDevice.importExternalBuffer(outputMlBuffer);

  // create a bind group for `gpuBufferToRender`, create a command encoder, etc.
  // asking WebGPU to render `gpuBufferToRender`
  // ...

  // These queued commands must block on completion of the `dispatch()` call above
  wgpuDevice.queue.submit([commandEncoder.finish()]);

  // Call this method for each frame
  video.requestVideoFrameCallback(applyEffectToFrame);
}

Without any additional synchronization, the commands submitted to the GPUQueue must block on completion of MLContext.dispatch(). It seems that the GPUQueue must either:

block, waiting for some signal from the MLContext that outputMlBuffer is available, or
...be the same queue the MLContext is running on?

My understanding is that https://github.com/webmachinelearning/webnn/issues/264 was attempting to specify the latter by describing WebNN execution in a MLCommandEncoder to be submitted to a GPUQueue. That would naturally queue the commandEncoder's commands behind the WebNN workload, which would effectively handle the synchronization. But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?

FYI, WebGPU already has a similar relationship with other web APIs such as video frames. See Importing External Textures.

My (limited) understanding of WebGPU's relationship with video frames is that the former behavior does not exist? Consider a basic rendering loop with WebGPU:

const render = () => {
  // Get the frame data as a GPU buffer
  const gpuExternalBuffer = device.importExternalBuffer({source: video});

  // create a bind group for `gpuExternalBuffer`, create a command encoder,
  // beginRenderPass, etc
  // ...

  // Queue a bunch of commands to the GPUQueue, which will eventually render to
  // a WebGPU canvas
  wgpuDevice.queue.submit([commandEncoder.finish()]);

  // This method registers a callback which will be fired once a new frame is
  // sent to the compositor
  video.requestVideoFrameCallback(render);
}

The encoded GPU commands will eventually "update the rendering of a WebGPU canvas", which in turn calls these steps in the HTML spec, which in turn (eventually) runs the animation frame or video request frame callbacks... which triggers the render() function and so forth. There are hooks into other APIs, but (as far as I'm aware) the GPUQueue does not pause execution while waiting on other APIs.

I think we need more details as to how WebGPU and WebNN synchronization will work :)

reillyeon commented 9 months ago

I had a couple thoughts while looking through @a-sully's comment:

MLContext.importExternalBuffer() should be an async method with semantics equivalent to mapAsync() + getMappedRange() + set(). That is, it should wait for WebGPU to be done with the buffer, map it into a context where it can be read by the device the MLContext is associated with, and then either copied into a new MLBuffer or wrapped by an MLBuffer until it is destroyed or converted back to a GPUBuffer by a call to GPUDevice.importExternalBuffer() (equivalent to unmap()).
GPUDevice.importExternalBuffer() is easier because assuming that MLContext.dispatch() has similar semantics to MLContext.compute() the buffer would be in some kind of "transferred" state and couldn't be passed back to WebGPU until WebNN is done with it.
Given that an MLBuffer might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer might be in GPU memory) I recommend not using the suggested readBuffer() and writeBuffer() methods but instead copying the mapAsync() and getMappedRange() design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.

wacky6 commented 9 months ago

Both WebGPU and WebGL have similar destroy methods. In neither case is a promise returned. When do you expect a WebNN developer would use this?

I'm thinking about scenarions where explicit memory management is needed. Say for example, a developer wants to ensure the GPU memory used by WebNN has been deallocated before allocating another chunk of memory on GPU (e.g. call WebGPU/WebGL immediately after they finish WebNN operations).

My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer.

Or do we expect GPU service / GPU drivers to handle this transparently? Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?

bbernhar commented 9 months ago

@a-sully Appreciate the questions and help, responses below.

@a-sully wrote

It seems clear to me that we need to define usage of MLBuffer at creation.

Good point to clarify. I think it's easiest to spec MLBuffer to have both input and output usage at creation. This is equivalent to GPUBuffer's resource usage bits (storage | storage_read), which gets re-created from MLBuffer's resource upon importExternalBuffer(). Note: no mapping or copy is required to transfer/import a GPU resource from the same device.

@a-sully wrote

How can the user agent know whether it can map mlBuffer to gpuBuffer or whether it will need to make a copy?

A MLBuffer could be transfered/imported as-is without a copy if the User Agent determines the MLContext and GPUDevice supports zero-copy (ex. same adapter).

@a-sully wrote

Seems great! Now, where exactly will these buffers be allocated?

The MLBuffer allocates its "default" device-resource persistently from the device/context used to create it. The exact location is opaque to the WebNN developer since its runtime managed. Similarly, "upload" and "readback" heaps are allocated/managed from MLContext upon writeBuffer() and readBuffer(), respectively. This is what my full proof-of-concept does [1].

But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?

In order to transfer MLBuffer, MLContext must be flushed prior, which occurs on importExternalBuffer(). So in effect, WebGPU borrows the buffer WebNN rents out: the queue maintains exclusive access. Because this synchronization happens upon explicit resource use, there is no need to teach WebGPU about new commands. This is a similar approach to how video import works.

[1] https://chromium-review.googlesource.com/c/chromium/src/+/5101781

Bryan

a-sully commented 9 months ago

But how would this work if the WebNN workload cannot be expressed in terms of GPU commands?

In order to transfer MLBuffer, MLContext must be flushed prior, which occurs on importExternalBuffer(). So in effect, WebGPU borrows the buffer WebNN rents out: the queue maintains exclusive access. Because this synchronization happens upon explicit resource use, there is no need to teach WebGPU about new commands. This is a similar approach to how video import works.

Just to clarify, are you suggesting the synchronization strategy is to pause work from one API while the other API is using the buffer? e.g. WebNN must pause all work (from the MLContext the MLBuffer was created from, at least) while WebGPU holds the buffer, to ensure it does not change out from under WebGPU (which would violate WebGPU's expectations).

What about the reverse? This is where WebNN is different from the video import example, as far as I can tell. Do we expect WebGPU to block all execution while a buffer is rented out to WebNN?

This is related to the point I was trying to make in this comment (though in that case WebNN is renting to WebGPU):

How can the user agent know whether it can map mlBuffer to gpuBuffer or whether it will need to make a copy?

The user agent needs to know whether mlBuffer will be modified while WebGPU holds gpuBuffer; otherwise a copy would have to be made... But of course that's only relevant if WebNN is allowed to do work while and WebGPU holds the buffer :)

bbernhar commented 9 months ago

Just to clarify, are you suggesting the synchronization strategy is to pause work from one API while the other API is using the buffer?

Yup. MLBuffer cannot be written to from multiple APIs simultaneously. We could allow simultaneous read access for MLBuffer. WebNN must ensure any enqueued writes are completed prior to import. However, WebNN could continue work on another MLBuffer, non-imported.

Do we expect WebGPU to block all execution while a buffer is rented out to WebNN?

WebGPU doesn't rent-out MLBuffer, only WebNN does that. Once WebGPU is completed using the GPUBuffer, it created from import, it could throw it away or export it back. The difference between WebNN and video is MLBuffer is restored instead of recycled.

bbernhar commented 9 months ago

@reillyeon thanks for the comments.

@reillyeon wrote

Given that an MLBuffer might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer might be in GPU memory) I recommend not using the suggested readBuffer() and writeBuffer() methods but instead copying the mapAsync() and getMappedRange() design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.

MLContext.importExternalBuffer would be more semantically equivalent to GPUDevice.importExternalTexture. Whereas GPUBuffer.mapAsync + GPUBuffer.getMappedRange is semantically equivalent to MLContext.readBuffer and MLContext.writeBuffer. FYI, this is why WebGPU has GPUDevice.writeBuffer.

WebGPU has a concept of CPU-accessible device memory (eg. staging buffer) which relies on copyBufferToBuffer to get this data into a GPU-accessible buffer. The WebNN context is more like a WebGL context here, which has glReadPixels, because otherwise its not obvious when CPU data is applied w/o explicit queue operations.

bbernhar commented 9 months ago

@wacky6 appreciate the comments, responses below

@wacky6 wrote

My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer.

If the web developer forgets to synchronize MLBuffer, we can wind up with undefined behavior. I prefer we synchronize on their behalf to avoid such calamity.

@wacky6 wrote

Or do we expect GPU service / GPU drivers to handle this transparently?

Yup. I would expect WebNN, like WebGPU, to manage resources on the web developer's behalf (or by the GPU service).

@wacky6 wrote

Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?

Yes, it could. WebNN memory would get deallocated, say upon calling MLBuffer.Destroy, before WebGPU executes the submitted work using it, after importExternalTexture(). Internally, MLBuffer provides that synchronization point (ie. fence) to ensure the GPU wait is satisfied first. If your curious to see how this works, see https://dawn-review.googlesource.com/c/dawn/+/171160.

RafaelCintron commented 9 months ago

[@wacky6] My question is that whether WebNN need to provide an explicit synchronization point mechanism to the developer. [@bbernhar] If the web developer forgets to synchronize MLBuffer, we can wind up with undefined behavior. I prefer we synchronize on their behalf to avoid such calamity.

+1 to @bbernhar's reply.

I think that having an explicit transfer (via import APIs) between WebGPU and WebNN should be enough to satisfy our requirements. What we spec should clearly state that only one API should have access to the buffer at a time. While an MLBuffer is transferred to WebGPU, queueing new work to it via WebNN should be an error. Similarly, while an MLBuffer has been transferred back to WebNN, queuing new work to it via WebGPU should be an error.

Note that for scenarios where developers want to source input from 2D raster-type data (image elements, media, camera, canvas elements, image bitmaps, etc) there will already need to be a conversion (tensorization) step from the raster image to a buffer for ingestion to WebNN. You can't really "zero copy optimize" this transfer operation. The web developer must tell the browser (via a WebGPU compute shader) how they'd like the raster image data to be arrange in the buffer. The same thing is true if you want to visualize the result of the ML operation via a 2D image. You'll need to import the buffer to WebGPU so that you can convert it to a GPUTexture via a WebGPU compute shader.

[@wacky6] Or do we expect GPU service / GPU drivers to handle this transparently? Could "queue for webnn memory for deallocation, allocate webgpu memory (could this OOM?), webnn memory deallocated" happen?

In the current proposal, MLBuffer destruction happens when you call the destroy method on the object. WebGPU has a similar setup where GPUBuffer destruction is done via a destroy method on GPUBuffer instead of queuing such an operation. See buffer destruction. Once you destroy a buffer, you can not queue new operations with it. Any inflight operations using the buffer complete before storage for the buffer is released. The browser is responsible for ensuring everything happens in a defined manner with no crashes or use-after-frees.

Similar to how video textures work in WebGPU, calling GPUBuffer.destroy on the WebGPU side should not release memory for the MLBuffer. The memory is owned by WebNN, same as video textures are owned by the media engine.

[@reillyeon] Given that an MLBuffer might be in non-CPU memory associated with an NPU (similar to how a GPUBuffer might be in GPU memory) I recommend not using the suggested readBuffer() and writeBuffer() methods but instead copying the mapAsync() and getMappedRange() design from WebGPU so that developers don't have to learn two different ways of interacting with external buffers.

As @bbernhar pointed out, WebGPU already has a writeBuffer API which has the same parameters as this proposal. For WebNN, readBuffer is different, but simpler. If people feel strongly that we replace readBuffer with exactly the same API shape as WebGPU, I would be OK with that.

[@reillyeon] MLContext.importExternalBuffer() should be an async method with semantics equivalent to mapAsync() + getMappedRange() + set(). That is, it should wait for WebGPU to be done with the buffer, map it into a context where it can be read by the device the MLContext is associated with, and then either copied into a new MLBuffer or wrapped by an MLBuffer until it is destroyed or converted back to a GPUBuffer by a call to GPUDevice.importExternalBuffer() (equivalent to unmap()).

MLContext.importExternalBuffer is asynchronous in the sense that calling it enqueues work in another process which runs on the device (NPU or GPU) timeline. The implementation will ensure that inflight WebGPU writes to the buffer finish before WebNN reads the buffer. GPUDevice.importExternalBuffer, likewise, will ensure that all WebNN writes to the buffer finish before WebGPU reads of the buffer. Unless I am missing something, I do not think the above requires that either import method must be a promise-returning one what causes the JS thread to wait.

The only time the JS thread will need to wait is when you call readBuffer to ask for the contents of the buffer on the CPU. Since the device could be busy with the buffer, readBuffer must return a promise to avoid jank.

reillyeon commented 9 months ago

Thank you for your patience while I learn enough about the WebGPU primitives here to be able to appreciate the complexity behind this proposal. Overall I think this is the right direction. At this point my feedback is mainly editorial. I'd like to see as much symmetry with the WebGPU APIs as possible, which mainly means removing the readBuffer()/readBufferSync() methods and copying WebGPU's mapping concept to support reading buffer contents from script instead. I better understand now the need for writeBuffer().

I think the semantics discussed above around serializing WebNN and WebGPU operations which touch the same buffer make sense. To actually specify them I think we need to update the specification to be significantly more specific about timelines and introduce the pattern common to the WebGPU specification of providing "content timeline" and "device timeline" steps. Something I don't think the WebGPU specification has had to deal with yet is the potential for multiple interacting device timelines, as we would see in the case of WebGPU interacting with a non-GPU MLContext. I think this is all plausible with the appropriate use of fences. It just needs to be specified very explicitly, though we may need a MLQueue concept to match the GPUQueue concept. I'm still unsure about that part.

Since the importExternalBuffer() methods assume that the buffer was originally created as an MLBuffer I think it would help if they didn't look so symmetrical. For example, if the functionality was centralized in the MLBuffer or MLContext interfaces as a pair of mapAsGpuBuffer() and unmapFromGpuBuffer() methods.

wchao1115 commented 9 months ago

Thank you @bbernhar, @RafaelCintron and many others for the proposal and deep thought on this issue. As I mentioned in the WG call this morning, although conceptually this proposal could work with any device buffer, a currently motivated use case is the WebNN/WebGPU interop scenario. The previous proposal (i.e. MLCommandEncoder) was thought to be non-implementable on the WebGPU side, so I think one way to make this proposal more concrete is to demonstrate through a small working prototype that it can actually be implemented.

bbernhar commented 9 months ago

@reillyeon Thanks for the feedback. Overall, the comments make sense to me; in particular, the lack of MLQueue. This shaped my motivation for MLContext.readBuffer.

@reillyeon wrote

I'd like to see as much symmetry with the WebGPU APIs as possible, which mainly means removing the readBuffer()/readBufferSync() methods and copying WebGPU's mapping concept to support reading buffer contents from script instead. I better understand now the need for writeBuffer().

WebGPU's buffer mapping relies on GPUQueue::copyBufferToBuffer to work. Since WebNN lacks its own GPUQueue-equivelent type, MLBuffer.mapAsync would need to invoke a internal GPU copy using the MLContext before/after dispatch begins/ends. This is a different operation than GPUBuffer.mapAsync which was meant for CPU reads/writes. The WebGPU developer could not interact with WebNN's mapping API the same way they did in WebGPU.

a-sully commented 9 months ago

FYI: I just posted https://github.com/webmachinelearning/webnn/pull/541, which continues this discussion in the form of (heavily annotated) sample code for the key use cases we've identified here.

Please take a look when you get a chance!

bbernhar commented 9 months ago

Thanks, @a-sully. Will do. I agree with you that this issue is quite large; esp. with interop in view.

So, I propose we split #482 into 4 separate sub-issues, where we can more easily discuss and reach consensus on the individual parts but in a logical order:

Creation and representing MLBuffer on a XPU devices (ie. MLContext.createBuffer)
Uploading/downloading tensor data to/from MLBuffer (ex. buffer mapping, device-copy, construction-initializer)
Means to bind MLBuffer and use them for graph execution (dispatch vs compute).
Export and import MLBuffer from/to web APIs (ie. sync, transfer/mappings).

I would be happy to initiate / open these issues if there are no objections.

a-sully commented 9 months ago

SGTM. Thanks @bbernhar!

bbernhar commented 5 months ago

Closing, this issue has been replaced by smaller sub-issues which I encourage we use for discussion instead. https://github.com/webmachinelearning/webnn/issues?q=is%3Aissue+is%3Aopen+in%3Atitle+MLBuffer+

webmachinelearning / webnn