Discussing an alternative std.event.Loop architecture

kristoff-it commented 4 years ago

Current architecture

Currently std.event.Loop relies on the following general process to orchestrate async frames (to my understanding):

An async function calls listen/accept/connect/read/write. If the syscall response is EAGAIN (or equivalents), the function adds a oneshot event in epoll/kqueue/... for the specific action they need (i.e. read or write). The event contains a pointer to the async frame itself. Once the event is added, the frame suspends. The expectation is that the event loop will resume it once the socket becomes actionable again. This is used to retry some syscalls (e.g. read, write) and to know when one has completed (e.g. connect). If a syscall completes without EAGAIN, no event will be slotted in epoll/kqueue/..., and the async frame will not suspend.
All extra event loop threads start by polling epoll/kqueue/... in a while(true) loop and block on the polling request. The request will unblock when an event slotted in (1) triggers. The thread will then resume the frame pointer present in the event just obtained and, once the frame is done executing (the top level return in the frame "chain" is hit) or suspends again, the thread will return to its main while(true) polling loop.
A similar system is put in place to orchestrate cooperative scheduling: when an async function wants to suspend for reasons unrelated to I/O (e.g. an event-loop-aware locking system), it slots in epoll/kqueue/... a different type of event that gets immediately notified, thus unblocking another thread that will resume the provided frame.

Event diagram

This diagram shows a possible set of events slotted in epoll/kqueue/... during a program execution.

           +-----+   +-----------------------+
           |     +---+fd=4 filter=r frame=0x1|
           |  e  |   +-----------------------+
           |  p  |
           |  o  |   +-----------------------+
           |  l  +---+fd=4 filter=w frame=0x2|
           |  l  |   +-----------------------+
           |  /  |
           |  k  |   +-----------------------+
           |  q  +---+fd=5 filter=r frame=0x3|
           |  u  |   +-----------------------+
           |  e  |
           |  u  |   +-----------------------+
           |  e  +---+fd=6 filter=w frame=0x4|
           |  /  |   +-----------------------+
           |  .  |
           |  .  |   +-----------------------+
           |  .  +---+fd=6 filter=w frame=0x5|
           |     |   +-----------------------+
           +-----+

Pros of this system:

It's very straightforward: each async function handles its own business independently.
While currently not implemented, timeouts for I/O events are easy to add, provided all epoll/kqueue/... systems support it.
The architecture distributes the workload across different threads without requiring us to do anything special.

Cons:

Using oneshot events forces us to slot in epoll/kqueue/... a new event for each EAGAIN (or equivalents) we encounter. The architecture presented next avoids this problem.
While we get job distribution "for free", there might be better ways of doing so if we depended less on epoll/kqueue/... (e.g. this, originally shared by @kprotty)
(might not be a big deal) kqueue makes it hard or even impossible to wait for a writable OR readable event when used in this fashion, which is what originally made me look into the event loop code (#5286).

Alternative architecture

An example of an alternative architecture is one where, when a new file descriptor is opened, we arm epoll/kqueue/.. with a "multishot" (i.e. with the oneshot flag disabled) event that triggers every time the file descriptor goes from non-actionable to actionable (e.g. from non-readable, to readable).

At this point the question becomes: if we only slot one event per file descriptor, then what information should we include in the event struct, so that we can then resume work once the event comes back from epoll/kqueue/...?

One reasonable way is to have the event point to the corresponding std.fd.File struct in the application, and add a field to the struct so that it points to the head of a linked list of suspended frames. See the following diagram.

Event diagram

                         +-----+
                         |     |
                         |  e  |
                         |  p  |
                         |  o  |   +-------------------------+
                         |  l  +---+fd=4 filter=r/w file=0xA |
                         |  l  |   +-------------------------+
                         |  /  |
                         |  k  |   +-------------------------+
                         |  q  +---+fd=5 filter=r/w file=0xB |
                         |  u  |   +-------------------------+
                         |  e  |
                         |  u  |   +-------------------------+
                         |  e  +---+fd=6 filter=r/w file=0xC |
                         |  /  |   +-------------------------+
                         |  .  |
                         |  .  |
                         |  .  |
                         |     |
                         +-----+

     +--------------------+  +--------------------+  +--------------------+
     |  std.fs.File(0xA)  |  |  std.fs.File(0xB)  |  |  std.fs.File(0xC)  |
     |  .handle=4         |  |  .handle=5         |  |  .handle=6         |
     |  .suspendedFrames  |  |  .suspendedFrames  |  |  .suspendedFrames  |
     +--------------------+  +--------------------+  +--------------------+
               |                       |                       |
               |                       |                       |
               v                       v                       v
        +--------------+        +--------------+        +--------------+
        |  ResumeNode  |        |  ResumeNode  |        |  ResumeNode  |
        |  .frame=0x1  |        |  .frame=0x3  |        |  .frame=0x5  |
        |  .next       |        |  .next       |        |  .next       |
        +--------------+        +--------------+        +--------------+
               |                       |                       |
               |                       |                       |
               v                       v                       v
        +--------------+        +--------------+        +--------------+
        |  ResumeNode  |        |  ResumeNode  |        |  ResumeNode  |
        |  .frame=0x2  |        |  .frame=0x4  |        |  .frame=0x6  |
        |  .next=null  |        |  .next=null  |        |  .next=null  |
        +--------------+        +--------------+        +--------------+

Things not shown in the above graph:

In kqueue there's actually two separated events, one for read, and one for write, since you can't register a single event for both. Other than that, all is the same (i.e. both events point to the same std.fs.File struct).
std.fs.File structs don't have a single list of suspended frames, but rather 3: one list for frames awaiting for the file descriptor to become readable, one list for writable, and one list for readable or writable. The event loop inspects which condition caused the event to fire and resumes frames from the corresponding list(s).
The ResumeNode structs are stored in the frame they reference.

Exploration work

A couple of days ago Andrew came on stream to briefly explain some details of the current implementation and provide some advice on how to proceed. See this VOD, he shows up two times, one near the beginning, and one near the end of the stream.

Yesterday I worked on the new implementation. Since it's my first time trying this kind of work, I just aimed to get the thing working in the most straightforward way possible, ignoring error conditions, platforms other than the one I'm developing on (macOS), and necessary cleanup.

The result is the first commit of this branch: https://github.com/kristoff-it/zig/tree/loop-b It's able to run some test code (see this gist, the zag version compiles against my branch, while the zig version works with current zig). It's also currently single threaded.

I will report any progress & findings. For now I have two.

Findings

Passing std.fs.File pointers to epoll/kqueue/... makes those structs non-copiable. This is an annoying and pervasive change.
Running the benchmark tests in single threaded mode doesn't show really any improvement vs the current implementation, but this is just a preliminary observation.

Questions

One big question that I'm left with, is the following: all these systems expect multiple frames to be concurrently waiting to operate on the same stream, while this is not what you would want to happen in any reasonable use case that I can think of, since you would potentially mess up read/write order. For example, Python even throws an exception when this situation occurs (here you can see me demonstrate this case).

Can anybody give me an example of why this is considered an important situation to support?

andrewrk commented 4 years ago

Can anybody give me an example of why this is considered an important situation to support?

Consider pread, preadv, pwrite, pwritev - these specify the position in the syscall, so they are safe to do concurrently. One could imagine writing a database application, where it writes multiple records at the same time to a backing file.

kprotty commented 4 years ago

one list for frames awaiting for the file descriptor to become readable, one list for writable, and one list for readable or writable.

Relating to the last question, what would be some example use cases where an operation is waiting for readable or writable, but not only readable or only writable?

kristoff-it commented 4 years ago

@andrewrk thanks I was focusing only on sockets and forgot about seekable streams.

@kprotty good point, ultimately, I'm not sure it's a thing. This is what I know (and how I got there):

When I first looked at the Loop code, os.connect would call loop.waitUntilReadableOrWritable, which did not work with kevent (#5286).
The reason for that call, is that in macOS connect returns EINPROGRESS and you need to somehow poll the socket again to know if the connection was established succesfully or not.
In my tests, waiting for writeability is enough, but maybe there could be some peculiar type of file descriptor for which this makes sense?

In any case, since it was there originally, I wanted to make sure it would be possible to support the case if needed. If anybody that knows more can confirm it's not a thing, I'd be super happy to remove that list.

andrewrk commented 4 years ago

In any case, since it was there originally, I wanted to make sure it would be possible to support the case if needed. If anybody that knows come can confirm it's not a thing, I'd be super happy to remove that list.

You've already put more thought into it than I originally did when writing waitUntilReadableOrWritable. I think trying to maintain these semantics is a red herring.

kprotty commented 4 years ago

@kristoff-it AFAIK connect() and accept() emit writable and readable events respectively when ready for most posix systems, where EINPROGRESS is similar to EAGAIN for connect().

Another thing that doesn't seem to be currently handled is if there were any errors after an event comes in. The current implementation re-performs the operation to catch the error but this could trigger a SIGPIPE being generated if trying to write() to a peer-closed socket on linux without a MSG_NOSIGNAL flag for example (as seen on stream). The correct way to handle them looks to be to check a sockopt.

SpexGuy commented 4 years ago

At the very least, I think we can say that multiple simultaneous awaiters on one file is a fairly specialty case, and that most Zig programs using evented IO won't do this, or will do this only in highly specialized contexts. If there's any chance that making this work will cause a performance pessimization in the more common cases (which putting a linked list in std.fd.file would definitely do), it should be something that the programmer opts into within a scope. Maybe something more like this:

// register the event with an object that contains the linked list
const file_multi_awaiter = FileWithMultiAwait(&underlying_file, event_loop_with_multi_await_support, allocator);
// do stuff that may suspend and await the event
var a = async writeAtPosition(&file_multi_awaiter, event_loop_with_multi_await_support, pos1, data1);
var b = async writeAtPosition(&file_multi_awaiter, event_loop_with_multi_await_support, pos2, data2);
// wait for it to finish
await a; await b;
// deinit the multi-awaiter
file_multi_awaiter.deinit();

This form makes it clear that you can't relocate the FileWithMultiAwait, since a pointer is passed into the async functions. It also lets the user control when the multi-await event is active, to avoid extra queue events for files that won't be awaited. And it forces the user to choose the allocator that will be used for the linked list.

andrewrk commented 4 years ago

the allocator that will be used for the linked list.

One of the beautiful things about linked lists is that they are "bring your own memory" data structures. Each node in a linked list can be allocated with a different allocator, or with no allocator, and operations on linked lists do not need a reference to an allocator. So no allocator would be needed here. The linked list node can be inside the pwrite function's frame.

kristoff-it commented 4 years ago

Small status update: event loop integration was successfully moved from std.os to std.net/std.fs.File with the exception of sendfile which would need to be broken up into two parts (maybe even 3).

Other than sendfile there doesn't seem to be any other problem arising from the change, but we discussed on stream how it might make sense to explicitly separate files from sockets, be it via two different structs, or a union field in std.fs.File (or something else). We discussed this on stream with @SpexGuy @kprotty.

This change is probably even more compelling if we want to set ourselves up for io_uring, since it would break the assumption that evented io must always be implemented as a failed syscall + a retry notification in a loop.

kristoff-it commented 4 years ago

After thinking about this for a while (and having had a short break from it because of other engagements), I'm back at it and I'd like to discuss some practical actions with anybody that's interested.

While I started with the idea of changing the internal architecture of the event loop to support a different way of interfacing with epoll/kqueue, I think this change should be postponed, and instead I would recommend to proceed with a different change first, as mentioned in my last comment.

To reiterate what I mean: I think we should first make sure to pull up all integration with the event loop from os.zig to file.zig/net.zig. I've already done so in my fork and, as I mentioned before, it's easy to do, with the exception of sendfile.

Additionally, I would recommend to pull inside std.event.Loop the current "fail + subscribe + suspend + retry" logic, so that an implementation of Loop that uses completion-based APIs (e.g. iouring) can override it. In practical terms, when an event loop is available, File should directly call Loop.instance.read(), for example. Currently we call directly into the event loop only for the special file I/O case. What I'm proposing is basically doing the inverse: delegate to Loop by default, and call into std.os only when not in async mode or when blocking I/O is being expressely requrested. (see the current status here: https://github.com/ziglang/zig/blob/master/lib/std/fs/file.zig#L312-L320)

I think at that point it will become much easier to experiment with different architectures and integrate with more system APIs.

@andrewrk thoughts? If the plan sounds fine I could proceed with the changes and file a pr, but I wouldn't mind some guidance on how to test that I didn't break I/O for other platforms, etc.

andrewrk commented 4 years ago

I agree on both points! Thanks for the overview, it helps to understand the context when looking at PRs.

As far as testing goes, it's about time to add CI testing for async I/O modes, but we don't have it set up yet. That would probably be worth investing in directly before or after these changes. Async I/O is still "experimental" so it's ok to break master branch on accident. Personally I would move forward with your improvements, and look into the additional testing infrastructure afterwards.

ziglang / zig