webmachinelearning / webnn

🧠 Web Neural Network API
https://www.w3.org/TR/webnn/
Other
369 stars 46 forks source link

Should WebNN support async APIs? #230

Closed huningxin closed 2 years ago

huningxin commented 2 years ago

As mentioned in https://github.com/webmachinelearning/webnn/issues/229, the existing WebNN graph building (MLGraphBuilder.build) and execution (MLGraph.compute) are sync APIs. This is required by Wasm (C++) based ML framework's backend implementation. To avoid blocking the main thread, the good practice is to call these synchronous APIs in a worker context.

There are JavaScript based ML frameworks, like TensorFlow.js, that are mainly used in the main thread. Should WebNN support async APIs in the main thread? This would help not only the JS ML frameworks but also broader JS adoption of the API.

/cc @pyu10055

bbernhar commented 2 years ago

@huningxin

I can help fill in the GPU story a bit.

GPUDevice does not require WebNN compute() to be sync. WebGPU schedules work on the GPU/queue timeline which is async by design. WebNN behavior's here is non-normative and should clarify it shares ownership of WebGPU's (implicit) queue so calling compute() goes back on the queue timeline, which never blocks the main thread. This will be important for interop because we cannot have compute() work being scheduled alongside WebGPU work on two different timelines.

huningxin commented 2 years ago

Thanks @bbernhar for clarifying the GPU story. I agree this is important for WebNN-WebGPU interop and we need to improve this part.

GPUDevice does not require WebNN compute() to be sync.

I suppose you mean if an MLContext is created from a GPUDevice, the sync compute() would not block the main thread because its actual work is scheduled async on the GPU/queue timeline. Correct?

I think this align with what @RafaelCintron shared in the WebML WG Teleconference – 18 Nov 2021:

RafaelCintron: WebGPU and WebGL are technically async APIs because you submit it on the CPU which then queues it to the GPU … when WebNN is used in the GPU, the commands are executed in parallel from a CPU perspective, so essentially async - we should block on the CPU until the inference is done

This is true if the inputs and outputs are GPU resources, such as GPUBuffer, because developers can use GPUBuffer.mapAsync to read back the results without blocking the main thread.

However, I am not sure if the input and outputs are CPU resources. According to WebNN device selection, the MLContext created from a GPUDevice also accept ArrayBuffer. It looks like the main thread would block on the compute() until the results are read back from GPU to CPU for this case. /cc @wchao1115

bbernhar commented 2 years ago

@huningxin

Correct?

Sounds right to me!

This is true if the inputs and outputs are GPU resources, such as GPUBuffer

GPUBuffer can also be CPU-visible (readback) and still be a GPU resource. What matters is compute() stays async until the CPU reads the GPU resource (ex. map/copy). You can upload (tensor) data to the GPU using only the CPU timeline so long as the GPU is not waiting to execute using it (or "in flight" in GPU-lingo).

Hope this helps.

pyu10055 commented 2 years ago

@huningxin @bbernhar For most models sync execution is sufficient. But for models with control flow, it will require read back of the intermediate tensors to determine the following execution plan, this would mean JS main thread will be blocked by GPU resources and computation. Having an async compute would allow those type of models to be executed efficiently.

bbernhar commented 2 years ago

@pyu10055 If we want to use WebGPU, I don't think "models sync execution" will be sufficient. Even if the CPU and GPU is serial, that doesn't prevent WebGPU and WebNN work from being executed in the wrong order on the GPU. =(

pyu10055 commented 2 years ago

@bbernhar can you elaborate on that, why would GPU execute the graph in a wrong order?

bbernhar commented 2 years ago

If WebNN and WebGPU share the same GPUDevice (= same GPU queue) then GPU-side synchronization is only guaranteed by submission order. WebNN must submit work BEFORE WebGPU does if graph execution depends on it.

wchao1115 commented 2 years ago

@huningxin Yes, WebNN compute is a block call the way it is defined today. For the GPU device, compute should record all the dispatches on the command list, execute the queue, and wait till it finishes.

@bbernhar I think the main issue here is that the compute call operates on two different timelines depending on whether the device is a CPU or GPU. On the GPU device, ideally, you would want to give the caller all the flexibility and control, and should really accept only the command list where WebNN will just record all the dispatches, but not executing them implicitly. This way the compute call would be a sync call with the eventual execution happening on the GPU timeline, not on the UI thread, and triggered by the caller explicitly. This, I think, would be the most friendly way to interop with WebGPU.

However, on the CPU device, there is no such concept as the command buffer, and that the execution can happen on any timeline depending on the calling thread.

bbernhar commented 2 years ago

@wchao1115 Yup, that sounds right to me. @huningxin FYI. =)

pyu10055 commented 2 years ago

Thank you @wchao1115 @bbernhar for the explanation, seems everyone agrees that there will be an async API to access result from GPU. My point is that the compute API can be just a sync call for most serializable models, given that it does not perform the actually computation, but for models that require read back intermediate results during collection of the command, the sync compute method would not work. This is how TensorFlow.js currently utilizing the GPU, for models without control flow ops, the compute() is a sync method, otherwise users need to use an async method.

wchao1115 commented 2 years ago

@pyu10055 I think there are some confusion here. The current compute call as defined today is a blocking call that would perform the actual computation.

Regarding the WebNN's ability to interop well with WebGPU, my suggestion is that it remains a sync call but altering the API contract so that it only records the command dispatches without actually executing the command queue, when the given device is a GPU device shared with the WebGPU context. This way the execution of the command queue and order of submission can be controlled by the caller of both WebGPU and WebNN.

However that said change would have an impact on the CPU device case, and that is the part I think we should still think through.

pyu10055 commented 2 years ago

@wchao1115 Got it, in TensorFlow.js, we unify APIs for CPU/GPU devices by separating compute and data access APIs. The data access is an async call, which for CPU it will be returning an resolved promise, which GPU returns a promise that uses fence or polling to wait for GPU. The complexity of GPU is that, when constructing the GPU command set depends on the intermediate results, the compute() cannot be sync.

huningxin commented 2 years ago

@wchao1115

Regarding the WebNN's ability to interop well with WebGPU, my suggestion is that it remains a sync call but altering the API contract so that it only records the command dispatches without actually executing the command queue, when the given device is a GPU device shared with the WebGPU context. This way the execution of the command queue and order of submission can be controlled by the caller of both WebGPU and WebNN.

+1

However that said change would have an impact on the CPU device case, and that is the part I think we should still think through.

Probably we could define a new type of Context and Graph, e.g. MLContextForGPU / MLGraphForGPU (or other better names), when the given device is a GPU device and use that API contract. The MLGraphForGPU::compute would be sync and only accept GPU resources as input and output.

For the generic MLGraph::compute that accepts ArrayBufferView as input and output, it could be changed to async for the main thread usage and limit the sync version to worker usage as discussed in #229

RafaelCintron commented 2 years ago

[huningxin] Probably we could define a new type of Context and Graph, e.g. MLContextForGPU / MLGraphForGPU (or other better names), when the given device is a GPU device and use that API contract. The MLGraphForGPU::compute would be sync and only accept GPU resources as input and output.

I agree with this proposal and is what we've discussed in previous meetings.

[wchao1115] .... my suggestion is that it remains a sync call but altering the API contract so that it only records the command dispatches without actually executing the command queue, when the given device is a GPU device shared with the WebGPU context. This way the execution of the command queue and order of submission can be controlled by the caller of both WebGPU and WebNN.

As currently speced, WebNN's compute API would need to submit work to the WebGPU default queue, which I admit I haven't felt 100% comfortable with. If WebGPU adds multi-queue in the future, it will be even more confusing when and where work is happening.

To alleviate the confusion, we can either:

wchao1115 commented 2 years ago

I don't think we need to fork the MLContext type. MLContext is already polymorphic depending on how it is created. If it is created via a GPUDevice it is a context for the WebGPU device.

We can also achieve what we want here by just extending the MLGraph interface instead. We can leave the current compute method but limiting it to only accept ArrayBufferView. Calling compute would then trigger the whole graph execution on the CPU. This also means that calling compute on a WebGPU context would fail with an unsupported exception.

An additional sync method record can be defined on the MLGraph interface, which takes a GPUComputePassEncoder as a param. Likewise, it'll fail the call unless the graph is previously created from a WebGPU context. By passing in the encoder, it asks the graph to record all the necessary dispatches into the command buffer associated with the encoder. In this scenario, WebNN is therefore not involved in the actual execution of the command buffer. The caller is free to select which queue it wants to submit the command buffer to, and to explicitly control which order of execution it wants the WebNN dispatches to follow relative to other payload within the same or in other queue. Obviously, this method should only accept GPUBuffer as input and output buffers.

It might be easier to see this in code. I'll put together a PR that implements this change for reviews.

anssiko commented 2 years ago

Looking this from both developer ergonomics and future-proofing perspective, setting my chair hat aside, this suggestion from @wchao1115 strongly resonates with me:

I don't think we need to fork the MLContext type.

Forking may be an easy solution now but may bite us back later.

Also thanks for helping put together a PR. Please consider https://github.com/webmachinelearning/webnn/pull/250 that adds [[contextType]] internal slot for MLContext. This internal state may be useful in other places when decisions need to be made based on how the context was created.

bbernhar commented 2 years ago

@wchao1115 ML work can't be encoded into compute passes because it cannot be dispatched, only submitted through the shared queue. A new "ML pass" type is needed to encode ML work or more simply, pass/get the queue to record.

wchao1115 commented 2 years ago

@bbernhar The key idea here is to separate out the act of submitting the ML GPU work into the command buffer from executing the commands in the queue. How would you suggest us define this behavior for WebGPU context? i.e. what is the right WebGPU "currency" that should be passed into the proposed record method if the goal is to only record the dispatches and not executing them?

bbernhar commented 2 years ago

@wchao1115

Why not treat MLGraph like a D3D11-style immediate context? MLGraph.record(queue) just records the ML commands into it's (internal) command buffer but does not execute them until WebGPUQueue.Submit is called.

Using passes just allows for finer grain scheduling (interleaving 3D and ML work) but that requires non-trivial changes to WebGPU.

wchao1115 commented 2 years ago

Thanks. That works.

huningxin commented 2 years ago

@wchao1115

We can also achieve what we want here by just extending the MLGraph interface instead. We can leave the current compute method but limiting it to only accept ArrayBufferView.

+1 to limit the compute method to only accept ArrayBufferView. As it does the actual compute, back to this issue, we probably need to introduce an async version to work in the main thread.

This also means that calling compute on a WebGPU context would fail with an unsupported exception. An additional sync method record can be defined on the MLGraph interface, which takes a GPUComputePassEncoder as a param. Likewise, it'll fail the call unless the graph is previously created from a WebGPU context.

If the MLGraph created from a WebGPU context doesn't support compute, it probably makes sense to introduce a separate MLGraphForGPU that only has record method. This would help developers to avoid calling wrong method on a specific type of graph that doesn't support it.

If we leave the MLGraph created from a WebGPU context to still support ArrayBufferView as current spec, we can make MLGraphForGPU inherit from MLGraph and add record method into the sub interface.

huningxin commented 2 years ago

@bbernhar

Why not treat MLGraph like a D3D11-style immediate context? MLGraph.record(queue) just records the ML commands into it's (internal) command buffer but does not execute them until WebGPUQueue.Submit is called.

It sounds good.

Using passes just allows for finer grain scheduling (interleaving 3D and ML work) but that requires non-trivial changes to WebGPU.

It looks like we can support them (MLGraph.record(queue) and ml pass) one by one for different usages.

I am just wondering with the first one (MLGraph.record(queue)) whether it is capable to support the full-GPU-only video processing pipeline use case. Any insights?

wchao1115 commented 2 years ago

If the MLGraph created from a WebGPU context doesn't support compute, it probably makes sense to introduce a separate MLGraphForGPU

If you fork the graph type, then both the graph builder type and the context that creates them will have to fork too. This will create two parallel tracks of interface hierarchy that are largely similar but not really the same. It could be very confusing.

One way to reduce the number of type specializations is to decouple the type hierarchy where it matters. In this case we can keep everything else polymorphic, but instead introducing a separate "graph executor" notion (e.g. MLGraphExecutor) that can be specialized into e.g. MLGraphGPUExecutor and MLGraphCPUExecutor respectively. And, so an accidental mismatch between the kind of graph and the kind of executor to run it with would result in an unsupported exception. Does that address your concern?

huningxin commented 2 years ago

If you fork the graph type, then both the graph builder type and the context that creates them will have to fork too.

Although forking graph builder type is not a good solution, we probably should consider whether to specialize the MLGraphBuilder::constant which also takes both ArrayBufferView and GPU resources today. Should we limit it to only accept GPUBuffer when the builder is created from a GPU context. And in this case, should we consider to supply queue to MLGraphBuilder::build, because it would also records the graph initialization commands?

One way to reduce the number of type specializations is to decouple the type hierarchy where it matters.

It makes sense.

In this case we can keep everything else polymorphic, but instead introducing a separate "graph executor" notion (e.g. MLGraphExecutor) that can be specialized into e.g. MLGraphGPUExecutor and MLGraphCPUExecutor respectively. And, so an accidental mismatch between the kind of graph and the kind of executor to run it with would result in an unsupported exception. Does that address your concern?

It does. However because it essentially depends on the type of the context, should we instead consider specializing context types and grouping the context dependent "graph execution / recording" respectively, e.g., MLContext::compute and MLContextForGPU::record.

bbernhar commented 2 years ago

@huningxin

Resource sharing through a shared queue is a preferred path for GPU interop (zero copy or GPU-only processing). But there is limitation by WebGPU, need exportExternalTexture (NOT import) to cross non-WebGPU component boundaries. But that's easier to justify then a "ML pass" that cannot be natively implemented by WebGPU. =)

wchao1115 commented 2 years ago

@huningxin, maybe it's easier to show it in the code.

(Not exactly WebNN syntax; some details are omitted here for simplicity)

const context = ml.createContext(gpuDevice);
const builder = new MLGraphBuilder(context);
const conv2d = builder.conv2d(...);
const graph = builder.build(conv2d);
// Record the ML workload to a given queue, so it can be interleaved with the rest of other WebGPU workload
const gpuExecutor = new MLGPUExecutor(context);
gpuExecutor.record(graph, gpuQueue);

The only time an interface specialization is needed here is when a graph executor is needed to execute the payload in the graph. I prefer to not specialize the context type for this because there may be methods we want to add in the future to the context interface that may make sense for all types of context. Specializing it now could potentially fragment the interface over time.

The specialization for the executor is more future-proof since it is truly a point in the API call pattern that we know of right now where the callers do indeed need to know what they want to do next on what kind of context they operate on, and on what threading model they are bound to. This gives them that flexibility without polluting the rest of the API calling pattern.

For example, if a caller wants to use WebNN just to construct a graph out of a given context regardless of what kind of context they may be given, they can do that without having to know the difference between different context type.

huningxin commented 2 years ago

@wchao1115 , thanks for the code example, it really helps.

The only time an interface specialization is needed here is when a graph executor is needed to execute the payload in the graph.

Should the the graph initialization be specialized either? If the graph builder is created from a GPU context, it may accept the GPU resources as constants and initialize the graph with these GPU resources. Should the graph initialization commands be recorded into the GPU queue? Should it be decoupled into a "graph initializer"?

wchao1115 commented 2 years ago

Graph constants such as weights are normally uploaded from the CPU memory even for the GPU context. For the initializers, those actually need to be treated like one of the graph inputs that are bound at the execution time. I think the current support for the GPU resource view constants is probably not needed.

huningxin commented 2 years ago

Graph constants such as weights are normally uploaded from the CPU memory even for the GPU context.

It's true.

However, to work with the WebGPU implementation of a framework, like TensorFlow.js WebGPU backend, a tensor data may be already uploaded to GPUBuffer. If WebNN doesn't support use GPUBuffer as graph constants, the caller has to readback the data to CPU buffer and upload again to WebNN graph for GPU context. This is not an idea case.

Should we decouple the graph initialization from graph build? It means the MLGraphBuilder::constant would be just a placeholder without binding buffers. And add MLGPUExecutor::init to bind GPU buffers to graph constants before execution.

const context = ml.createContext(gpuDevice);
const builder = new MLGraphBuilder(context);
const filter = builder.constant();
const conv2d = builder.conv2d(filter);
const graph = builder.build(conv2d);
// Record the ML workload to a given queue, so it can be interleaved with the rest of other WebGPU workload
const gpuExecutor = new MLGPUExecutor(context);
gpuExecutor.init(graph, constantGpuBuffers, gpuQueue);
gpuExecutor.record(graph, inputGpuBuffers, outputGpuBuffers, gpuQueue);

The downside is the MLGraph now has state "uninitialized" and "initialized". Executing an uninitialized graph would throw an exception.

wchao1115 commented 2 years ago

There isn't a need to define an init call here. If the weight are already processed and uploaded to a GPU resource, it should be bound as a named input to the graph. It's the initializer semantic -- the weight resource is fed to the graph just like any other input. From the GPU standpoint, resource binding is part of a command dispatch. There is no need to treat them separately.

huningxin commented 2 years ago

The weight resource might be bound once and owned by the runtime, e.g., by setting DML_TENSOR_FLAG_OWNED_BY_DML flag for DirectML backend. The doc says:

When this flag is set on a particular tensor description, the corresponding tensor must be bound to the binding table during operator initialization, and not during execution.

Did I read it correctly?

anssiko commented 2 years ago

I suppose this is fixed by #257.

anssiko commented 2 years ago

We agreed to keep this issue open until https://github.com/webmachinelearning/webnn/issues/263 is addressed to retain context.

wchao1115 commented 2 years ago

263 has been resolved and closed. @anssiko I believe this issue can now be closed.

anssiko commented 2 years ago

Closing with a note that async context creation is discussed in its own issue #272.