ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
34.94k stars 2.55k forks source link

Proposal: Event loop redesign #8224

Open kprotty opened 3 years ago

kprotty commented 3 years ago

The current event loop is not ready yet (relatively slow, windows unfinished, bugs/races) and many wish for it to be. From the discord communities at least, there seems to be enough interest to warrant addressing this, with some interested in helping out. My question then is: what do people really want out of the standard libraries event loop? Toying with some designs locally produced incomplete implementations from lack of a direction. In addition to what people want from an event loop, how do they also want it implemented? Here are some example properties:

I'm open to adding more considerations if others have some. Just want to understand the requirements before making any sort of PR.

kristoff-it commented 3 years ago

Assuming we continue with the idea of having an "event loop" abstraction, I would like to see from it some form of consistency across different OSs, at least when it comes to the big 3.

More specifically, I believe one important goal should be to ensure, as much as possible, that if an application doesn't deadlock on OS Y it also doesn't deadlock on OS X (hah pun kinda intended). I am not sure what is the current state of things, but one example that comes to mind is file I/O which only simulates evented I/O by using a thread pool.

Somewhat related to the above goal, there should be an API (or a clear pattern) for dealing with blocking I/O (and CPU intensive operations). As a concrete example, in bork I used to steal one thread from the event loop when using BearSSL and, depending on the CPU configuration of the host (and other implementation details of the event loop that fortunately in this case didn't pose a problem) it could simply deadlock.

If instead the event loop gets broken down into discrete components, then this question IMO still stands and would need to be re adapted for the new context.

haze commented 3 years ago

I want one that prints "bruh" asynchronously

lithdew commented 3 years ago

I'll put my high-level thoughts down in points here since my thoughts are a little all over the place when it comes to Zig's async story. The perspective I'm coming from with my thoughts here is after spending the last few months implementing a few resource-constrained high-performance distributed systems and async-driven sensor applications in Zig, and spending the last month researching how async works in other programming languages.


  1. It's really, really hard to design for a general-purpose, all-in-one event loop that everyone can be happy with. I don't think we should aim towards such an amalgamation, and that we should instead aim towards creating production-ready modules (I/O reactors, I/O proactors, thread pool executors, synchronization primitives) that a developer may then combine into an event loop design that they are happy with.

I personally have a lot of frustrations over some of the design decisions in event loop amalgamations featured in the libraries and runtimes for Go/Rust/Nim/Pony/etc. At the very least, aiming towards modularizing the event loop is work that can be done with Zig today that would undoubtedly be reusable should the time come that we eventually come to a consensus on an all-in-one event loop design.

  1. There should be a focus in the standard library on supporting the lowest-common-denominator subset of cross-platform features that can be driven asynchronously by an I/O poller. To paint a picture, there should be a single io module in the standard library that presents first-class async support for sockets, timers, pipes, events (i.e. eventfd in Linux), etc.

Given that async file I/O really is something that has only been recently been made available in Linux via. io_uring, the io module should only contain support for blocking file I/O. An experimental module could optionally be included inside the standard library which provides Linux support for async file I/O.

There should also be cross-platform support for async Ctrl+C handling in io, with a separate, optional experimental module for async signal handling support.

  1. I feel that the cross-platform abstraction for I/O pollers in the Zig standard library should follow the proactor pattern, and that all async I/O modules have completion-based APIs (callback-based). This would allow for easy C interoperability (pass a callback), and also only requires minimal boilerplate for being driven by async/await in Zig by wrapping the callback into an async frame. A number of battle-tested cross-platform async I/O libraries like libuv, libevent, and asio follow the proactor pattern as well.

This comes from my experience that abstracting a proactor-based I/O poller (e.g. IOCP) to follow a reactor-based design requires quite a bit of hackery and workarounds vs. abstracting a reactor-based I/O poller to follow a proactor-based design.

Having such a design where either a callback may be provided, or an async frame may be created to wrap under a callback, allows for users to have flexibility in how they want to allocate the memory necessary to asynchronously drive an I/O primitive (i.e. on the heap, or on the stack).

  1. All async I/O and synchronization primitives should support cancellation and composition by design. The memory and CPU overhead for not supporting cancellation or composition is negligible for most applications.

With a production-ready I/O poller and thread pool abstraction, either way, it is most likely the case that you wouldn't settle with general-purpose implementations of async I/O or synchronization primitives that are easy to use if you are in a situation where the smallest amount of overhead present in supporting cancellation and compositional operations matter to you.

  1. Much like the rest of the standard library, I feel that allocations should always be left up to the user. Therefore, where possible, memory should be intrusive.

  2. The thread pool should support being configured with callbacks that are to be run in certain places (e.g. when a thread is spawned). A common use case for this is e.g. initializing thread-local state (tracy encourages users to have all user-spawned threads marked with names via. it's C API), and logging events.

  3. I personally dislike global flags like the io_mode flag, and would rather have individual modules be configured to operate in a blocking/non-blocking manner. Hence, unlike status-quo Zig today, my suggestion is the instantiation of an I/O poller and thread pool and more should be made explicit in a user's program.

  4. Apart from the overhead of supporting cancellation and async chaining operations (i.e. wait all ops to finish, race for either one of many ops to finish, etc.), all modules should be optimized for performance. For something as low-level as asynchronous I/O and thread pooling, performance and a good user-experience in coordinating async operations should be the only two things people really care about when it comes to a standard library.


Regarding a few of the points above, I didn't dig that deep into the specifics as I'd love to see other people's thoughts and feedback first given my one-sided point of view (i.e. how module-level configuration should look like, how async operations are to be facilitated by the standard library to be composed together, how I/O poller-related state should be separated and contained, etc.).

That being said, let me know if you spotted any glaring holes in my thoughts, or if you'd like me to explain more of my rationale on any one of the points above.

kprotty commented 3 years ago

Any further opinions on this?

jorangreef commented 3 years ago

My question then is: what do people really want out of the standard libraries event loop?

Thanks @kprotty for setting this up.

Multi-threaded by default?

No, and yes, depending on the plane: control plane or data plane. Having a clear distinction between these helps to inform the design decisions for each separately because the safety and performance profiles are different.

For example, given Zig's approach to safety and performance, I would really want a single-threaded control plane, with this single-threaded control plane outsourcing to the kernel thread pool for I/O through io_uring (or kqueue or IOCP for other platforms) to reduce the performance cost of syscalls, context switches and locks, and to side-step or at least diminish the gap for data races for the most part. Backwards compatibility for older Linux platforms can be shimmed with an I/O threadpool, but io_uring should drive the design because it reflects where hardware performance is at today.

Influence scheduling (hard/soft priorities, QoS classes, fairness)?

QoS is important. In addition to the single-threaded control plane (with the kernel threadpool for I/O) I really want additional multi-threaded data plane(s) for CPU-intensive work [xor] long-running work, with clear QoS, i.e. one data plane for user-defined CPU-intensive tasks (e.g. bulk crypto like checksumming 256 MiB chunks, or erasure coding, or deduplicating them), and another separate data plane for long-running background tasks (stuff like DNS requests that are not CPU-intensive, that could take minutes and that could be context-switched without a performance hit).

So the data plane for CPU-intensive tasks should be sized according to the number of cores to reduce context switching. On the other hand, the data plane for long-running background tasks would be okay with context switching (being on a different order compared to the task time), and would be sized larger than the number of cores to support concurrency.

Support cancellation?

This could solve alot of pain for people. Cancellation can also be racy and difficult to get right. Then again, it's hard at every layer and solving this lower down can help higher up. My gut feel is it's hard to do well. On the other hand, there could be cancellation for everything, but with a completion event for when the cancellation succeeds, so that this could shim over underlying abstractions that don't support cancellation, while not holding back the abstractions that do.

Are allowed to dynamically allocate (i.e. heap alloc) internally?

No. Although there might be a good argument to made for something like registered buffers, with a user-provided allocator of course. The focus overall though should be on zero-copy into user provided buffers. For TigerBeetle, everything is static allocation (even disk) and zero-copy. We would never use something that allocated dynamically, unless this was at initialization, where we have some kind of control over the parameters, and again for something like registered buffers. Allocation for stuff like internal queues etc. is also okay but it needs to be subject to the user's restrictions and at initialization only.

How async should it be?

Zig's async is awesome of course. However, here's something surprising from our experience. We tried out Zig's async plus io_uring for TigerBeetle (https://github.com/coilhq/tigerbeetle/blob/main/src/io_async.zig) and then actually went back to explicit callbacks plus io_uring in the end (https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig). The reason being that we were doing this for a distributed consensus protocol where we wanted to make sure our message handlers run to completion with the same system state, whereas coroutines increase dimensionality while a function is still running. We wanted clear jumping off points to I/O just because getting the consensus protocol right was hard enough. This is specific to our use-case for TigerBeetle only, it might not be relevant here, but wanted to share the anecdote if it helps.

Should we even have an event loop at all?

The money question. We could start with the primitives. For example, @ifreund and I worked hard to get an event loop primitive going for io_uring that could mimic epoll's "poll for some amount of time": https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig#L52-L83

io_uring

io_uring from the 5.6 kernel on up has almost everything we need for the data plane intensive aspects of storage and network I/O, and even the 5.8 kernel is gaining support. io_uring is a beast and it's only accelerating from here. It's something that's hard to ignore.

While we could plan according to current support for the 5.8 Linux kernel right now, perhaps a better approach would be to plot a course taking Zig's 1.0.0 trajectory into account.

With that in mind, I really want to see the IO component of the event loop taking full advantage of io_uring as a first-class citizen, because that's where the puck is already at, just not evenly distributed.

NVME devices have become so fast that the cost of a context switch is on the same order as an I/O operation, so io_uring makes sense to support the next generation of lockless thread-per-core designs, which is what we're doing for TigerBeetle to support a million financial transactions a second.

It's also a perfect time for Zig to take advantage of this shift in API design. I would really want to start with io_uring and work backwards rather than the other way round. A worker threadpool or traditional approaches to event loops should be last-resort, not something driving our design.

The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it.

kprotty commented 3 years ago

Thanks @kprotty for setting this up.

Thanks @jorangreef for the detailed response! This is an area where I strongly believe zig should make some good decisions after having interacted with other async/IO environements so any input/ideas are welcome.

control plane or data plane

Could you elaborate on the differences between the two?

In regards to performance, particularly throughput, it sounds difficult to reach its optimum if there's only one thread which handles control flow (if this is what you're implying). Having multiple threads performing application logic and without locking sounds ideal for throughput in the absence of effective static work distribution.

I do agree that io_uring should somewhat drive the design for the I/O subsection of the Event Loop. I say "somewhat" as SQEs are allocated internally on flush (at least currently) and it may not be what is desired for io_uring compatibility layers.

My gut feel is it's hard to do well

This has been my experience as well

Allocation for stuff like internal queues

This was my primary intent when posing the question. While I have working thread pools which only utilize intrusive memory, I have been interested if using dynamically growing but contiguous task queues would help in latency. This sort of memory allocation sounds contrary to your idea though.

We wanted clear jumping off points to I/O

By "clear", would this be a reference to a function call doing I/O without the callers awareness? or some other scenario?

an event loop primitive going for io_uring

This is pretty neat! Like the idea of intrusive SQE/CQE structs and the flushing/polling strategy. This would solve the "io_uring SQE allocation" issue at the abstraction level.

io_uring is a beast and it's only accelerating from here.

This was my initial experience too. Although having micro-benchmarked it against epoll, the decrease in syscalls and use of kernel thread-pool that sounds more efficient on the surface ended up being slower. Do you (or anyone else really) have any ideas why this might be the case?

I really want to see the IO component of the event loop taking full advantage of io_uring as a first-class citizen,

Definitely. Although it's a bit rough trying to envision how this would manifest itself higher-level API-wise. Have personally been looking into libdispatch for scheduling and queuing insight/inspirations recently.

jorangreef commented 3 years ago

Could you elaborate on the differences between the two?

Sure! In the abstract, the control plane is just anything that's not in the critical data path, it's safety critical but not performance critical, whereas the data plane has the inverse profile. For example, it would be fine to have (and one would want) plenty of assertions in the control plane for safety, whereas the data plane is in the critical path with huge volumes flowing through it. Here in the data plane one would want to optimize for performance: cache misses, context switches, branch mispredicts etc. The data plane would be like a water mains or oil pipeline, there's nothing inside to obstruct flow. The control plane would be all the safety checks you do outside the pipeline as an operator, the little control box that adjusts pressure and controls the pipeline. Having this clean split in the design provides both safety and performance without compromising either. It also obviates in large part the need for a borrow checker, although data races are still possible.

As a concrete example of this technique, in TigerBeetle, we have around 10,000 financial transactions in a batch. Our single-threaded control plane is responsible for switching each of these batches through the consensus protocol, and we amortize all runtime bounds checks, assertions, syscalls and I/O across the batch, so that these become almost free, yet we have literally hundreds of assertions, and we're doing O_DSYNC on every write etc.

In regards to performance, particularly throughput, it sounds difficult to reach its optimum if there's only one thread which handles control flow (if this is what you're implying). Having multiple threads performing application logic and without locking sounds ideal for throughput in the absence of effective static work distribution.

I think there's also a third option. For example, in TigerBeetle, if we want to do crypto for a batch (we have a use-case for this for Interledger), we would then drop that into a multi-threaded data plane with QoS for CPU-intensive work, by dividing the batch up across threads. We wouldn't want to solve that kind of performance problem with a multi-threaded control plane, because that would introduce too much complexity into a safety-critical domain, and also be conflating the control plane and data plane. In terms of this approach, the performance-critical work should happen in the multi-threaded data plane because that's been optimized for it. If we do find any scalability limit, then we would simply move to multiple event loop abstractions per domain, in our case across networking, storage, and state machine, all connected by ring buffers, and all the control planes still single-threaded.

This is also not new to TigerBeetle, we based this design on the work done by LMAX, a high-performance trading platform, which has single-threaded control planes for all event loops, and which has multiple event loops connected by ring buffers. They found the single-threaded control plane is also faster because it lets the CPU run unimpeded by context switches, like a sprinter doing the 100m in a straight line without zigzags. Martin Thompson calls this "mechanical sympathy" or "the single writer principle" (great blog post). He's done some awesome talks on this that we've documented here: https://github.com/coilhq/tigerbeetle/blob/main/docs/DESIGN.md#references

Redpanda is another new database, from people who worked on ScyllaDB, also doing thread-per-core.

Allocation for stuff like internal queues This was my primary intent when posing the question

Just speaking from our point of view, it would be totally fine to allocate this at initialization, which coincidentally is also how io_uring does it for the most part. Contiguous task queues would definitely be great for latency to reduce cache misses and avoid pointer-chasing. We would really like fixed-size bounded queues with an error for overflow and back pressure (in our IO code we still need to set our upper bounds for SQ overflow, we're yet to profile some of our upper bounds). Explicit limiting is a good thing, especially for any kind of distributed system. This is probably also an area in event loops where developers have been used to "hidden allocation" and unbounded IO depth but I think we could be more explicit about this.

By "clear", would this be a reference to a function call doing I/O without the callers awareness? or some other scenario?

Yes, this would be where the function call ends up doing I/O that suspends so that when the caller is resumed, the rest of the system has moved on to a new state. For example, in the Viewstamped Replication consensus protocol that we use, it's critical that distributed messages are only processed when a replica process' status is .normal and where the replica process' view number (or term or epoch as with other consensus protocols) is exactly the same. We wouldn't want state like this changing during the lifetime of a function. Again, these examples are specific to TigerBeetle and Zig's async/await still has plenty of good use cases. Our team just decided on the explicit pain of colored function call sites for safety. Here's an example of a footgun, just for us, where the use of async I/O made this kind of bug much more likely (we're still weaning this code off async and onto callbacks): https://github.com/coilhq/tigerbeetle/blob/main/src/vr/journal.zig#L522-L536

This is pretty neat! Like the idea of intrusive SQE/CQE structs and the flushing/polling strategy. This would solve the "io_uring SQE allocation" issue at the abstraction level.

Thanks! We learned alot from the event loop gists you did!

Although having micro-benchmarked it against epoll, the decrease in syscalls and use of kernel thread-pool that sounds more efficient on the surface ended up being slower. Do you (or anyone else really) have any ideas why this might be the case?

Yes, there was a known networking regression in 5.7.16 that has since been patched and @axboe has also been optimizing against epoll significantly since then so that io_uring should be ahead on most benchmarks. In terms of storage, io_uring performance on NVME can be more than double compared to blocking syscalls and more than quadruple compared to an I/O threadpool like libuv. Overall, what excites me is that io_uring offers a single unified interface for networking and storage and potentially more and more syscalls down the line. There are some rough edges, like what we had to do to implement IO.run_for_ns() but there is already better support for that in io_uring_enter() in newer kernels. It's also early days and there are incremental improvements all the time.

Have personally been looking into libdispatch for scheduling and queuing insight/inspirations recently.

Thanks for the link! I'm definitely looking up to you when it comes to anything like this, and look forward to seeing your work on this.

vlada-dudr commented 3 years ago

Just a experience from studying Zig language: when I saw that all memory allocating function actually take an allocator as a parameter, I actually expected that IO functions would take an event loop as a parameter. Or at least that one could switch seemlessly between different implementations or instances of the main loop. Same as with allocators, user might have different needs for it so I'd expect standard library would play in concert with many possible implemetations. Also given how minimal Zig tries to be, I was really surprised by the fact that making a program async suddenly starts number of threads by itself, which clashes with the "no hidden control flow" slogan on the homepage.

kprotty commented 3 years ago

IO functions would take an event loop

@vlada-dudr The goal of the discussion is to shed more light on the "hows" and "whys" of asynchronous computing. This also leads to a position that an "event loop" may not even be the right abstraction. Not all IO can be done asynchronously (or at least efficiently with async) and IO done with and without an "event loop" could have very different semantics. For example async wouldn't work with the virtual dispatch abstraction as seen in std.mem.Allocator and "async" IO may not be able to just operate on raw, non-owned, file descriptors/handles similar to the blocking apis.

I was really surprised by the fact that making a program async suddenly starts number of threads by itself

Zig is minimal, but the stdlib is there to do certain things for you. This includes setting up custom handles, looking up vdso functions, discovering tls, etc. Spawning threads could be considered another thing it does to help do other stuff (simplify IO async frame scheduling) for you.

which clashes with the "no hidden control flow"

I don't think so because the creation of threads doesn't disrupt the control flow of your program. If you use an stdlib function which schedules to said threadpool, it could switch threads between suspends. Wouldn't necessarily classify this as "hidden" given suspend or at least a function call is the explicit control flow jump

vlada-dudr commented 3 years ago

@kprotty:

The goal of the discussion is to shed more light on the "hows" and "whys" of asynchronous computing. My post intentionally didn't touch implemetation. I think one "why" might be "because we want have higher level API look like this". Actually high level API is something which newcomers (like me) will be confronted with. It's readability and coherency with other parts of the standard library will be major arguments to why choose zig or not, at least for most people. That is why I shared my impressions with you, so you know how current state made a random person think about it.

Spawning threads could be considered another thing it does to help do other stuff (simplify IO async frame scheduling) for you. Well, most other languages would do it. But my impression from Zig was, that the programmer should be way more in charge and explicit then usual.

I don't think so because the creation of threads doesn't disrupt the control flow of your program. I guess it depends on the program type. But maybe in case it is a problem for someone, the person would bring his own "event loop" or "async library" for the special usage. I don't know.

Thanks for reaction. Maybe I should have posted it on irc, not here, so this can be free of random gibberish and keep it highly technical. Cheers!

kprotty commented 3 years ago

@vlada-dudr For the "hows" and "whys" part, I meant it as context to transition into why I don't think passing an "event loop" around could be equated to the Allocator approach, rather than a critique of missing impl. I believe your post did touch "how" (explicit loop ref) and "why" (similar to allocator).

It's readability and coherency with other parts of the standard library will be major arguments to why choose zig or not, at least for most people.

I think this is where i'm at a fork. On one hand, I would like to have a super-customizable and near-optimal async building-blocks library since I enjoy polishing concepts. But on the other hand, there seems to be enough prioritizing simplicity/easy-understanding (even with discussions in other zig communities (come join #async!) which comes at a price to the former. Practically, both have optimization opportunities in mind but having issue deciding if Zig (the stdlib rather) should target "efficient hardware utilization" as a high priority or "more efficient hw util than other environments" (or something else?). Have any thoughts on this meta issue?

my impression from Zig was, that the programmer should be way more in charge and explicit then usual.

Im a bit confused in how slippery this slope should be for the stdlib. In regards to the other useful features I noted that happen in start.zig before the traditional main() is called, should those be disabled by default and explicitly enabled? The reason this is slippery is because disabling them by default would make Zig no safer/easier than C for newcomers and could arguably subtract from their first experience which tends to value ease-of-use from my understanding.

But maybe in case it is a problem for someone, the person would bring his own "event loop" or "async library" for the special usage.

Would be helpful to have an example to go over. I've recently played around with a bring your own blocking impl for synchronization primitives which allows the same algs which implement things like Channels and Mutexes to work generically for blocking and asynchronous environments. Would it be something similar to this?

Maybe I should have posted it on irc, not here, so this can be free of random gibberish

I'm of the opinion that gibberish is fine so long as it's on topic to the discussion. An ok-ish but not so great example lol. I could try hopping on IRC later today to continue the discussion if needed.

kvelicka commented 3 years ago

My input requires a bit of context, so I'll start with that. Firstly, I'd like to somewhat echo an earlier comment that said:

Assuming we continue with the idea of having an "event loop" abstraction, I would like to see from it some form of consistency across different OSs, at least when it comes to the big 3. More specifically, I believe one important goal should be to ensure, as much as possible, that if an application doesn't deadlock on OS Y it also doesn't deadlock on OS X (hah pun kinda intended).

Later on, the thread has moved more in the direction of discussing more state of the art approaches, e.g.:

It's also a perfect time for Zig to take advantage of this shift in API design. I would really want to start with io_uring and work backwards rather than the other way round. A worker threadpool or traditional approaches to event loops should be last-resort, not something driving our design. The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it.

With those in mind, this is what I'd like to say:

I think that there's a difference between a good/optimal/pick-your-word event loop[1] design, and one that is appropriate for a standard library. My experience so far has been that Zig wokrs for a large number of targets, meaning both architectures and OSes (or lack thereof, but that's probably out of scope for stdlib discussions). Given that, opting for a linux-centric event loop design could end up at least somewhat dangerous/counterproductive:

I should say that I agree with the earlier quote The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it., but I don't think that this should come at the expense of general usability/approachability. I personally don't see a big issue with a lowest common denominator approach for the standard library event loop, and I don't think that goes against the quote - we can still have a modern, well made event loop for Linux[2], it just perhaps shouldn't be the canonical implementation throughout stdlib.

[1] or any other component for that matter [2] maybe even in the standard library somewhere

kprotty commented 3 years ago

@kvelicka

opting for a linux-centric event loop design

FWIW io_uring isn't necessarily linux-centric, but made popular by linux since it's more flexible. The design of "batching" and "asynchronous" syscalls has been seen elsewhere, notably for high-performance I/O in the form of Windows RIO. The idea basically just a queue of I/O operations that you submit and reap the results from, with the ability to wait for results. Given that description, it makes the API a lot simpler - and can be trivially modeled with existing APIs: libuv could be seen as a very close, but higher level, relative.

an advanced event loop is presumably harder to design, implement and [...] maintain

I agree, but am of a different notion on which of the two approaches would be labeled "advanced". I would like to argue is that while an efficient design of the "traditional" approach of a multi-threaded work-stealing event loop could be more familiar to setup/interact with for someone coming from say Go/Rust/Erlang, it would be much harder for an stdlib maintainer to change, understand, and modify. Designing an efficient scheduler for such an event loop is non-trivial, and when you look at advanced work-stealing schedulers like those of Go, tokio, and rayon only a hand-full of contributors make feature/bug changes due to their large scope, multi-purpose algorithms, and sometimes complexity. This has also been my experience as well, although I'm open to be convinced that the ramp-up time may not be as big of an issue in practice.

there may be a shortage of stdlib authors/contributors that have the requisite background

Once again, I think this rings true especially for efficient multi-threaded applications. Vetting them can be difficult and understanding atomics / memory ordering is known to be one of the "background specialization" areas IME.

we'd essentially be committing to a sub-par experience for non-Linux crowd

For some insight, we (@lithdew and I) recently created a prototype of a io_uring emulation layer on linux (using the "traditional" epoll interface) and found it to be as fast and often faster than using io_uring itself (probably due to the older 5.12 kernel @jorangreef mentioned). We also found that the traditional multi-threaded event loop, while providing marginally higher throughput for network requests than its competitors (Go/Tokio), was ultimately worse in throughput and average latency versus the "io_uring" (emulation or not) approach. This is all to say that it would be close to a universally sub-par experience (not only Linux) given kqueue works similar and Windows can be made the same as well with the use of AFD although i'm yet to test this perf claim for windows.

I don't think that this should come at the expense of general usability/approachability

I agree as well. The thing is that the higher level API of socket.read() being done asynchornously doesn't have to change between the two approaches. What changes is mostly how they decide to multi-thread the IO layer if they please, without having multi-threaded overhead by default:

I know i've been shilling the io_uring abstraction recently, but I have reasons in doing so:

Even though i've changed my stance on the "traditional" approach recently, I'm still very much open to going back if the supporting arguments are technically supportive or if such a design is just not worth it for the standard library. I just know that either abstraction has a ton of room for optimization opportunities.

kvelicka commented 3 years ago

Firstly, I really appreciate the extensive and thoughtful response! And to be clear, I am not opposed to a io-uring[like] interface - I'm primarily Linux user myself and I don't currently have plans to develop anything for non-Linux targets, so I'm certainly not worried about incompatibility for my own sake. My worries are captured by what you said in the last paragraph - ...if such a design is just not worth it for the standard library. I'm happy to hear that you have thought about this topic a lot and I only have a couple of things to add:

Does the Zig stdlib event loop need to be particularly high performance? I think there's value in an implementation that tries moderately hard to be correct (thus understandable) and consistent across platforms (to a feasible extent), but putting performance as a second/third priority. One of things I find appealing about Zig is that a lot of things that'd often be implemented as language features in other langs are just code which can be plugged and reused as any other. So, in principle, it shouldn't be difficult to plug in a high performance event loop instead of the std one[1]. And if that is difficult, perhaps looking how to make that easier is a worthwhile avenue to explore?

[1] Having infra like this outside of stdlib of course comes with its own sets of issues and advantages, and I think that in medium-long term those might be just as (if not more) important than the technical merits! C++'s asio might be an interesting case study - it's a async I/O library/framework that has been used in production for years, and is now becoming the basis for C++ standard library networking - I think having significant battle experience outside of stdlib can be rather liberating as you're not as constrained by backwards compatibility, release schedule etc. Rust's async history/story is likely also relevant here but I've only watched that from afar so I don't have any useful insights there.

kprotty commented 3 years ago

I don't think you mentioned OS X in your response, not sure what the situation with a io_uring-like approach is there

Darwin (macos, ios, etc.) uses kqueue, an event notification scheme similar to linux epoll. The primary differences are that it supports batched "epoll_ctl"s via kevent(), it doesn't coalesce read & write events on the same fd, and it supports a higher variety of notifications without requiring an fd (timerfd, eventfd, inotify, etc.). It's integration to an io_uring emulation layer would be similar to epoll's in a lot of aspects.

Does the Zig stdlib event loop need to be particularly high performance?

It's one of those things that are implicitly in the Zig zen, particularly Avoid local maximums. The standard library tends to focus on being memory efficient as opposed to necessarily highly performant. You can see this in code by the heavy use of explicit allocators and intrusive memory. You can see this in the community with many zig libraries avoiding heap allocations when they can.

With this, however, is also implied somewhat performant code. This focus on memory/operational efficiency in many aspects of the stdlib (unicode, MultiArrayList, HashMap.Entry, Allocators) helps design APIs that are performant but also not too difficult to use. I think that is the direction the event loop should head towards as well, regardless of the abstraction-class chosen. This means avoiding interface restrictions like having to dynamically allocating memory to schedule a callback/anyframe, avoiding contention where possible, and reducing required syscalls.

there's value in an implementation that tries moderately hard to be correct (thus understandable) and consistent across platforms

There's a difference between "correct" and "understandable": Goroutines and channels are "correct" and arguably "understandable" while Zig async is still "correct" but, from my experience, doesn't seem to be immediately "understandable". The good thing is that when Zig async is understood, it tends to be one of those "it's obvious in hindsight" things like std.mem.Allocator which designed for efficiently (thus performant) without being too complicated (hopefully there's a pattern here — something idk if i'm explaining correctly).

As for platform consistency, this looks like one of the criteria that I considered essential by default for a standard library abstraction so I didn't give it much attention in my response: I also think it should be of higher priority than "performance".

it shouldn't be difficult to plug in a high performance event loop instead of the std one

This is the hard question that needs to be answered: what should such an API look like which allows a high performance implementation to be plugged into? I'm of the opinion that the API choice (in this case "traditional multi-threaded" vs "io_uring single-threaded") is what affects the performance ceiling of the implementation[1] and that a poor API effects the efficiency of the implementation [2].

C++'s asio might be an interesting case study - it's a async I/O library/framework that has been used in production for years, and is now becoming the basis for C++ standard library networking

Boost ASIO appears to be effectively the "traditional multi-threaded event loop" approach coupled with an MPSC channel called "strand" and no cross-platform support for file I/O (unless i'm missing something here). Would categorize it similarly to other "traditional" approaches (i.e. go, rust tokio)

[1] POSIX synchronous file I/O interface limits ssd/m.2 drive performance [2] Rust async's Waker and poll() APIs force Futures (coroutines) and I/O buffers used in completion based APIs like io_uring, posix aio, and windows iocp to be dynamically tracked (via ref counting usually) and heap allocated (via a global allocator usually).

kvelicka commented 3 years ago

I see, thank you again for the detailed responses - I've learned tons and my mind is at ease, you have certainly thought about this a lot!

iacore commented 2 years ago

I support message passing as the only communication method between tasks, like Erlang/OTP. Functions should take a event loop argument if they want to use event loop.

kprotty commented 2 years ago

@locriacyber Message passing is an easy to grok, unified communication scheme. But it's not mechanically efficient given memory or serialization overhead it implies. Many times, shared memory is either more obvious or more resource efficient (e.g. Mutex) and I believe Zig should be explicit about those costs like it is with memory allocation and not provide only one high level abstraction.

As for event loop by function argument (instead of global), it sounds interesting. Would things like File/Socket/Thread take an event loop?

iacore commented 2 years ago

There is no cost if you only pass pointers around. The language user need to make sure not to use a pointer after it's being sent. This calls for event loop scoped memory allocator.

OS Thread is irrelevant to event loop. File and Socket operations generally don't need a event loop since they don't spawn new parallel tasks.

The current async/suspend feature in Zig is a language-level effect. A more elegant approach is to add an effect tracking system like the one in Koka allowing user-space defined suspend/resume. (That's a huge language feature.)

iacore commented 2 years ago

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

kristoff-it commented 2 years ago

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

@locriacyber that's not how it works in Zig. The implementation details matter just as much as the high level user facing API. If you want to contribute to this discussion you need to familiarize yourself not only with the usage semantics of concurrent operations, but also with the tradeoffs involved with specific OS APIs and implementation designs. On top of that, we can't just trivially copy NodeJS because Zig is not garbage collected.

kyle-github commented 2 years ago

@kprotty quoted @locriacyber:

As for event loop by function argument (instead of global), it sounds interesting. Would things like File/Socket/Thread take an event loop?

I really like the API/pattern for memory allocation. Makes everything nice and explicit. That said, what would such an API look like for event loops?

Would you have a special event loop type for when you want blocking calls? Or would you have two versions of each call, one for blocking and one for using an event loop?

kprotty commented 2 years ago

There is no cost if you only pass pointers around.

@locriacyber The cost is that the pointers have to be escaped as they no longer have a static/linear lifetime which generally means heap allocation. This often isn't necessary when using other shared memory primitives like (async) Mutexes. There's also synchronization/serialization overhead in the channel implementations over the shared memory mutexes depending if they're multi-consumer or multi-producer.

This calls for event loop scoped memory allocator.

This would be the referenced overhead from earlier

OS Thread is irrelevant to event loop.

Erlang/BEAM enables SMP by default, similar to Golang. Is the event loop being referred to here single threaded?

File and Socket operations generally don't need a event loop since they don't spawn new parallel tasks.

Files and Socket APIs are connected to and depend on the event loop implementation, as the loop needs to be aware of them in order to poll IO asynchronously.

As for the event loop itself, copying Node JS (single threaded task queue) is probably good enough.

There's some discussion of removing the current multi-threaded event loop and replacing it with 3 things:

iacore commented 2 years ago

On top of that, we can't just trivially copy NodeJS because Zig is not garbage collected.

@kristoff-it We want an event loop mostly because we want to spawn tasks that will clean itself up after completion (non-linear, non-deterministic (when many tasks are queued) and delayed execution). The event loop will only manage memory necessary to store the task's frame.

@kprotty Single-threaded event loop is good enough for I/O tasks. Even for video games, a single-threaded event loop is good enough. I need to read about io_uring.

kprotty commented 2 years ago

It's been a while, but wanted to check in on this again. Would the following be a good consensus?

gajanak commented 2 years ago

In Windows based OS, loop.accept could use ws32 AcceptEx which support overlappedIO Result which works with current 0.10 loop design. Is someone working on this already ? Current Build only throw "Os not Supported" - Compiler Error ?

Sashiri commented 2 years ago

@gajanak I tried to add it but there're a few limitations, mainly with the way current event loop tries to handle errors and windows worker calling finishOneEvent after pulling, making the task code harder to read and maintain. For more details see #10115

tauoverpi commented 2 years ago

I don't really like io_mode as it changes the language from underneath along with many of the abstractions within the stdlib without a way to use File and similar outside of the event loop. It would be nice if IO could be passed as a parameter (similar to Reader maybe?) as then the choice of which to use remains in the hands of the user of the event loop without having to implement their own variant of File and similar.

kprotty commented 2 years ago

@tauoverpi To better understand, would this make things like File, Stream, StreamServer type generic or would their methods take in the IO type?

Other food for thought: should this pattern also extend to other things that could possibly block like synchronization primitives and timed sleeps/waits? Or should it be strictly for IO, or just File IO?

tauoverpi commented 2 years ago

@kprotty Methods taking the IO type would possibly be cleaner as the underlying type (e.g File) doesn't change despite a different IO mode and allows it's inclusion within containers without comptime generics at the container type level.

I've used the type generic approach with timers and serial IO to ease testing timeouts and various read failures which has worked well (with hindsight, on methods would have been easier to work with) so going beyond IO to sync primitives and more could work well for testing and alternate loop implementations.

As a side note, this does start to feel like a capability/effects system.

kprotty commented 2 years ago

@tauoverpi One issue with IO types on the methods is that it makes the implementation less efficient:

Things like epoll and kqueue 1) wouldn't know the socket is non-blocking so its a syscall to set it as such (and possibly reset it back) 2) would need to use oneshot event registration, which is a syscall on every wait, as they don't have any stable association for the fd's lifetime due to IO being method based instead of type based. This prevents the usage of the more efficient edge triggering mode which requires extra state per handle/fd.

For windows, there wouldn't be a way to know that the handle supports overlapped IO without first tryring the syscall and observing the error. This would reduce the performance for non-overlapped/sync access.

Another potential issue is that one could use different IO types on the same file/socket/fd. And without a way to synchronize their accesses, one event registration or async routine could override the other. This is actually what happens when you try to waitUntilFdReadable() and waitUntilFdWritable() concurrently with the same fd on linux using the current event loop.

aep commented 2 years ago

I don't really like io_mode [...] . It would be nice if IO could be passed as a parameter (similar to Reader maybe?)

hopefully it's not too late to reverse the decision on making it implicit and global.

  1. no_std split the ecosystem in half since its off by default, so now we have a no_std fork of everything
  2. the hidden async state machine requires thread local storage, which is unusable on some cheap silicon.

async is extremely hard to get perfect for all use cases, so it would be very useful if competing async impls could exist. for example by simply making the underlying fd available in std without implementing the actual eventloop.

kprotty commented 2 years ago

@aep After some thought, I'm up for removing io_mode as well. Regarding the other points though, there might be some disconnect in terms of context:

  1. By no_std, do you mean the concept of Rust? Zig's std module can be imported even in freestanding targets IIRC.
  2. Requiring TLS to implement async was a thing only in older versions of Rust which was fixed around 2 years ago. This is not an issue in Zig (async does not require an OS/thread locals)

Could you elaborate on "async impls"? It's a language feature with currently one impl in zig stage1 that isn't tied to concepts like file descriptors or necessitating an event loop.

endel commented 2 years ago

For research purposes, this experiment from Nick Banks / Microsoft tries to explore a unified API for multiple IO models: https://github.com/nibanks/eventq

This is possibly relevant research material. I'm no expert on this matter. (Don't wanna sidetrack the discussion, feel free to ignore 😆)

billzez commented 2 years ago

I would like to see a symmetrical API similar to std.Thread.* that can be used interchangeably between threads and coroutines (maybe call it std.Strand).

Then you could launch a thread using std.Strand.spawn(.{}, entry, args) or coroutine using std.Strand.spawn(.{.modifier = .async}, entry, args). Then the entry point is compiled using the appropriate calling convention (eg .Async), and I can call read() or std.Strand.Lock/Condition/Semaphore.

So functions would still be colored, but it would be bottom up instead of top down. (eg - what does the consumer want). So one function can be instantiated with multiple calling conventions, and io_mode goes away.

kprotty commented 2 years ago

asynchronous versions of std.Thread require an implicit async runtime (i.e. where would Thread.spawn() / Mutex.unlock() schedule the frame to be resumed?). This is what io_mode is intended to setup.

I still believe something pluggable like this would be a good idea, although an std.Thread API requires an implicit allocator (for the OS version, its provided by the libc equiv or mmap for linux) while async frames can be executed concurrently without dynamic allocation.

billzez commented 2 years ago

There would still be an event loop to handle async scheduling. The event loop (with its allocator) could be an option to spawn(). What I'm proposing:

  1. A standard API between for threads and coroutines.
  2. Bottom up function coloring (the nosuspend and io_mode hacks can go away)

I've had a scenario where I wanted a single executable to have both blocking and evented IO. I would like to be able to write a function once and then instantiate it as both blocking and evented and the same executable. This can be accomplished by replacing io_mode with a compile time selected calling convention. And if the calling convention is .Async the event loop pointer could be part of the stack frame.

kprotty commented 2 years ago

For 1. there can be a standard API for threads and event-loop specific coroutines, but the latter has more scheduling freedom than the former and requiring OS thread APIs can limit what's most efficient (e.g. concurrency without alloc, tail scheduling a woken up task).

For 2. nosuspend is required to extract out an async frame's result in a sync/normal setting. It's also used to call a function that has an async path in a sync setting when you know that path won't be hit (e.g. writing to a file synchronously for error reporting). It's a necessary aspect of async semantics to bridge the sync/async realm without the latter infecting the former.

io_mode on the other hand is not a language primitive. It's meant to be a standard library decision to make switching to async APIs easier. Currently, to use blocking IO when evented, you need to first make sure that the IO won't block the event loop (i.e. by running it on another thread). Then, you can either use std.os APIs for IO directly or switch modes on the wrapper you're using if its supported (std.fs.File.intended_io_mode).

Since io_mode and the event loop aren't apart of the language, the event loop pointer can't be passed into @frame() (with @Frame() not having a stack to start - they're really state machines). If you store the @Frame() inside a Task struct (in a custom runtime for example), then you could maybe do @fieldParentPtr(Task, "frame", @frame()) to get it's event loop context. But this would only work for the top-level frame. Calling an async function implicitly creates a new @Frame() and that wouldn't be the one in the Task struct making this pattern a footgun.

billzez commented 2 years ago
  1. Obviously there are differences. But if for example you look at python asyncio, you will find it has Lock, Condition, and Semaphore. I'm simply advocated for a unified API for synchronization primitives that are common to both.

  2. nosuspend is a result of top down function coloring. If function A calls function B, and B does something that makes is colored async, then A gets colored async. But if we know A never blocks, we can use nosuspend. But if we had bottom up function coloring, then we can call A in the sync realm, then B gets called in the sync realm, and if B does a read() call, it uses the std.os blocking variant. But if we called A with .Async, then B can be colored .Async, and read() will use the event loop variant.

std.os.getrandom() is an example of this. It gets colored Async with io_mode = .evented even though on linux it never blocks. This breaks the std.rand.Random API, making it unusable when io_mode == .evented.

Bottom up function coloring eliminates the need for nosuspend and io_mode, and hopefully std.rand.Random.fillFn would be colored by the caller's desire. It would also eliminate the need for File.intended_io_mode.

vlada-dudr commented 2 years ago

The unified std.Strand.spawn idea seems very implicit. I'd prefer to explicitly choose how the task is going to be executed, if as own thread, or as async in current thread. Especially in language which tries to be low level and explicit.

And if thinking about locking/synchronization primitives, aren't they specific for type of runtime? I mean that if the async scheduler never moves task to other thread it doesn't need atomic operations, for dealing with other tasks on same os thread.

billzez commented 2 years ago

It would be similar to the status quo. Currently File.read() calls the event loop read or io.read depending on intended_io_mode. With this proposal, File.read would decide using @typeInfo(@TypeOf(read)).Fn.calling_convention.

kprotty commented 2 years ago

if B does a read() call, it uses the std.os blocking variant

This would mean that either the IO portions like read are comptime provided, or the callee is able to comptime inspect the calling convention used by the caller (of which idk how it would play out with comptime evaluation).

But if we called A with .Async

Wouldn't this syntax be similar to calling A with nosuspend but inversed? ATM, colorless async works because it's infectious upwards.

This breaks the std.rand.Random API, making it unusable when io_mode == .evented.

This is a good catch. You'd solve this by checking for io_mode = .evented, then calling getrandom() with `nosuspend.

Bottom up function coloring eliminates the need for nosuspend and io_mode

nosuspend here seems like it's replaced with comptime provided I/O functions which feels more verbose. io_mode could be removed but this is also true with "top-down coloring". The caller just now provides the functions that could be async / interact with a global event loop (correct me if i'm wrong here, still don't understand the proposed model fully).

The unified std.Strand.spawn idea seems very implicit

@vlada-dudr I think the idea was just to have a upper-level API that chooses between std.Thread and std.event. Andrew threw this idea at some point for something like this:

std.Mutex = if (io_mode == .evented) std.event.Mutex else std.Thread.Mutex
billzez commented 2 years ago

I would expect that the callee is able to just comptime inspect itself. In my example above, File.read can be compiled as both evented and blocking in the same executable. Then it just inspects its own calling convention to see which one it is.

Yes, std.crypto.random can be fixed by adding a nosuspend into tlcsprng.zig by inspected io_mode, but again, nosuspend feels like a hack that is being sprinkled around. Also, this would only work for linux, because other platforms, it still might get scheduled in the event loop.

dee0xeed commented 2 years ago

take a look at https://github.com/dee0xeed/xjiss and at https://github.com/dee0xeed/edsm-in-d-demo and at https://github.com/dee0xeed/edsm-in-c-demo

"everything in Unix is a file", right? ,)

jecolon commented 2 years ago

My 2 cents here aren't much on the technical side, but rather on the general Zig philosophy of being explicit, reducing magic, and making code easier to read over writing. A default, global, behind-the-scenes event loop seems to me a lot like the default, global, behind-the-scenes allocator many languages have and that Zig is better off for not having. If technically possible, I would prefer a user-selectable event loop(s) in the standard library versus the global one we have now. To me, that is just the more Zig-like way to do it.

PS- This is quite the long thread so I apologize if I'm repeating someone else's prior ideas here.

TwoClocks commented 2 years ago

Late to the game, but my 2 cents:

TLDR: I think IO models are going to undergo a large shift in the next few years. io_uring is only the first attempt. Whatever you choose for your IO model now, will be wrong in 5 years. Leave it out of std (or make it really basic)

Background: Hardware with CXL is starting to ship now. This allows NIC to move data into main system-memory and participate in the CPU's cache coherency protocol. The access times for a NIC to cross the bus to main memory are a smidge slower than CPU->RAM access times. The v2 of CXL is faster. So in this world, you can see zero-copy APIs becoming the norm. Put a small chip on the NIC to impl the stack, and move packets directly into the memory of the process that wants them. HPC apps would have a pin'd thread-per-core and busy-wait on memory. Maybe Intel's software interrupts will be fast enough to get rid of the busy-wait.

There are some exotic fabrics (InfiniBand, etc) where it's faster to move a packet to another machine than to take a context switch on x86. This is only going to become more common on less exotic hardware. CPU clock speeds aren't going up much anymore, but networking clock speeds are.

This type of zero-copy IO model is common in HPC environments and uses DPDK/efvi/vma/XDP etc. I think a standard zero-copy API will emerge (or maybe zig should define one?). Implement that in std when it happens. You can implement epoll under a zero-copy networking API w/ a single copy (same as epoll is today). But the other way around requires two copies.

If you are going to put an event loop in std I'd make it out of simple blocks that can be chosen by the user. Like rust's mio library or zig's io_uring. If you want a thread-pool to dispatch events, or single-threaded callbacks, that's another thing on-top of the poll library. If you want async/await you can add that on-top of callbacks.

Extra points : Define a way for multiple event-loops to coexist. Rust tried this but largly failed (IMHO). You need extra threads to combine loops in rust, and I think having multiple event-loops for a single-threaded app is not intended to be possible ( it can be done, but it's using APIs in ways they aren't intended to be used).

Counter argument: People who really care about IO models won't use whatever is in std anyway. So might as well do something "decent".

edit: CXL, not CLX.

dee0xeed commented 2 years ago

People who really care about IO models won't use whatever is in std anyway

Exactly! Having things like hash-maps or json parsing (i.e purely CPU+RAM stuff) in a library (does not matter std or not std) is definitely nice, but i/o is completely another story. Look at all those libpq, libhiredis etc - they all contain connect, read, write, but why? It might be good for a simple program that runs during some relatively short period of time, but for a complex program that interacts with many other programs on different hosts and works 24/7 it is not good at all.

Personally I usually do not use things like libpq - instead I implement (partly) the protocol myself and do i/o the way I want it using OS API directly (in a single event loop), without multi-level superstructures/wrappers found in libraries.

jumpnbrownweasel commented 2 years ago

Hardware with CLX is starting to ship now.

Do you mean CXL (Compute Express Link)? If not can you give a reference to CLX.

TwoClocks commented 2 years ago

Do you mean CXL (Compute Express Link)?

Yes. CXL. I'm dyslexic AF. Updated comment. Thanks for pointing that out.

ethindp commented 1 year ago

I'm a bit late to this thread, but I'll add my input:

I don't think an event loop should be used. This is mainly because from what I've seen, in other languages, when event loops are used they oftentimes get so deeply embedded into the language that they're difficult to replace when you need to do so. So, if an event loop is a genuine thing we need to have, it should be more of an "extra" than anything else; async/await should be minimal, like the rest of the language, and should not require one to function. (It also shouldn't be too complicated! I've seen how languages like Rust do it, and I'm definitely not impressed; in fact, I found the way Rust did async/await runtimes, in particular, such as async executors, incredibly confusing and difficult to actually understand, particularly when documentation I read claimed that Rust auto-inserted "yield points", whatever those are, but I had no way of detecting or interacting with these "yield points" or otherwise controlling when I awoke an async task, even in the executor, or that's how it seemed, which was really confusing and annoying for me.)

My reasoning for simplicity and control (e.g., like Zig does it at present) is because I'd like to take advantage of this in an embedded system I'm considering writing that relies heavily on this paradigm: I'd use it to suspend functions to wait for things like interrupts, instead of needing to do something like setting atomic flags, or anything like that. This way, the interrupt could just "resume" the frame and remove it from the suspended frame stack, for instance. I could see something similar being done in many other use-cases.

So, in sum, if you really do need an event loop, try to make it optional, and document it that maximizes understandability.

I wouldn't focus too heavily on efficiency. Use io_uring, iocp, etc., but don't try to make efficiency a hill to die on. (And you can always optimize later.) If someone desires something that's more efficient than what the stdlib offers, they can and should always write their own.