Worker threads for background pipeline compilation

kainino0x commented 1 year ago

Without worker threads, "async" pipeline compilation can't really be async. It's also been discussed that shader compilation and non-async pipeline compilation could take advantage of worker threads as well.

If implementations want worker threads, should they start them internally, or should there be something in the API where the user launches their own thread and then starts the webgpu worker in it?

Having the user launch threads presents a few issues:

In remoting implementations, the threads need to be on the service end, not the client end. So that means you need some library-specific insight into the remoting to do it.
Similarly in WASM, it won't make any sense.

So I think it should be a per-implementation decision (for example Dawn might launch its own workers, with some platform hooks that control how it does that, whereas wgpu might make the user launch them, and Emscripten won't have any such option).

grovesNL commented 1 year ago

In wgpu we've also talked about the possibility of allowing people to use a thread pool they provide, so it might be nice to have an API that accepts a reference to an entire thread pool and can attach to worker threads as needed.

Another interesting use case of worker threads could be to allow implementations to (optionally) automatically poll/tick in the background.

Kangz commented 1 year ago

Side-note: We should start a docs/ subfolder with MD files that describe semantics of the API like this.

kainino0x commented 1 year ago

We talked about this at length in today's webgpu.h meeting. There was no clear solution but what we're leaning toward is the following:

On the device (or maybe the instance), add "task runner" hook, e.g. a pointer to a function that receives a void* (*task)(void*) and executes it wherever it wants. This makes it possible for the application to schedule the work on a thread pool that it's already running, instead of having a dedicated one for webgpu.h.
- It's probably not needed on the instance but could be moved from the device to the instance. Just need to figure out a guarantee about how the application frees its userdata for the hook.
- It could be overridable on a per-call basis but it's unclear if this would be necessary.
Allow implementations to send whatever they want to that hook.
- WASM implementations will send nothing there, probably
- the synchronous work behind createPipelineAsync should be sent there
- the synchronous work behind createPipeline (sync) and createShaderModule may be sent there[1]

[1] this makes the web and native implementations more similar by making createPipeline and createShaderModule return quickly instead of synchronously blocking, and it's also just convenient for a lot of applications IMO so they can get on with their application logic. There could be an extra option to say "actually please do this one synchronously", e.g. locally override the hook with null.

Still need some time to think over this though, and especially seek feedback from folks outside the meeting.

kainino0x commented 1 year ago

meeting: tentatively proposing that solution. @jimblandy seemed OK with it and it turns out Dawn already basically does this.

austinEng commented 1 year ago

A thought I had on this after the meeting

should the task runner hook pass back the WGPUDevice on which the task is getting posted for?

If the hook is a plain function pointer with static lifetime, the application is likely going to need to proxy it over to some task runner implementation which may not have a static lifetime.

cwfitzgerald commented 1 year ago

In the rust api, I interpreted the hook as also having a userdata pointer:

type Task = Box<dyn FnOnce() + Send>;
type TaskCallback = Box<dyn Fn(Task) + Send + Sync>

so both parties could ferry about whatever data they needed opaquely.

kainino0x commented 1 year ago

The PostTask hook may need some degree of extensibility. In particular I'm thinking there may be at least a flag for two different types of posted tasks:

CPU work that should be run in a thread pool no larger than the number of hardware threads
Blocking work that doesn't have to be limited to hardware threads and may block for a while without making progress (in particular: forwarding vkWaitForFences into OS-provided event primitives, for composability with other async things)

Not sure if we will ever need more than a flag, but maybe we want a place to put a struct chain for future extensions?

Another thing we might want is a name string for the task? I don't know if this is a common thing.

kainino0x commented 1 year ago

(in particular: forwarding vkWaitForFences into OS-provided event primitives, for composability with other async things)

Lacking a way to express this, we could probably guarantee something that would work OK, like we'll never have more than one posted task be kernel-blocked at once.

kainino0x commented 1 year ago

never have more than one posted task be kernel-blocked at once

per VkDevice.

Another possibility would be allowing the PostTask API to spin up an indefinitely-lived task, i.e. create a thread, like we had discussed in the past. For Vulkan this could be one thread per VkDevice to sit and wait on vkWaitForFences.

kainino0x commented 1 year ago

webgpu.h meeting:

Add a bitflag or enum, and specify in the API contract that: (1) these flags are always hints (2) the embedder "must" gracefully ignore hints that it doesn't recognize (that is, it's not a semver-major-version-breaking-change if we add a new one).

Also specify that PostTask callback must not just call the callback inline. It's safe but could maybe cause deadlocks (if we use it to do blocking I/O).

kainino0x commented 1 year ago

Marking has-resolution, but someone will still have to figure out exactly what the api surface will be.

webgpu-native / webgpu-headers

Worker threads for background pipeline compilation #187