Open kristoff-it opened 4 years ago
Can anybody give me an example of why this is considered an important situation to support?
Consider pread, preadv, pwrite, pwritev - these specify the position in the syscall, so they are safe to do concurrently. One could imagine writing a database application, where it writes multiple records at the same time to a backing file.
one list for frames awaiting for the file descriptor to become readable, one list for writable, and one list for readable or writable.
Relating to the last question, what would be some example use cases where an operation is waiting for readable or writable, but not only readable or only writable?
@andrewrk thanks I was focusing only on sockets and forgot about seekable streams.
@kprotty good point, ultimately, I'm not sure it's a thing. This is what I know (and how I got there):
os.connect
would call loop.waitUntilReadableOrWritable
, which did not work with kevent (#5286).connect
returns EINPROGRESS and you need to somehow poll the socket again to know if the connection was established succesfully or not.In any case, since it was there originally, I wanted to make sure it would be possible to support the case if needed. If anybody that knows more can confirm it's not a thing, I'd be super happy to remove that list.
In any case, since it was there originally, I wanted to make sure it would be possible to support the case if needed. If anybody that knows come can confirm it's not a thing, I'd be super happy to remove that list.
You've already put more thought into it than I originally did when writing waitUntilReadableOrWritable
. I think trying to maintain these semantics is a red herring.
@kristoff-it AFAIK connect()
and accept()
emit writable and readable events respectively when ready for most posix systems, where EINPROGRESS is similar to EAGAIN for connect()
.
Another thing that doesn't seem to be currently handled is if there were any errors after an event comes in. The current implementation re-performs the operation to catch the error but this could trigger a SIGPIPE being generated if trying to write() to a peer-closed socket on linux without a MSG_NOSIGNAL
flag for example (as seen on stream). The correct way to handle them looks to be to check a sockopt.
At the very least, I think we can say that multiple simultaneous awaiters on one file is a fairly specialty case, and that most Zig programs using evented IO won't do this, or will do this only in highly specialized contexts. If there's any chance that making this work will cause a performance pessimization in the more common cases (which putting a linked list in std.fd.file would definitely do), it should be something that the programmer opts into within a scope. Maybe something more like this:
// register the event with an object that contains the linked list
const file_multi_awaiter = FileWithMultiAwait(&underlying_file, event_loop_with_multi_await_support, allocator);
// do stuff that may suspend and await the event
var a = async writeAtPosition(&file_multi_awaiter, event_loop_with_multi_await_support, pos1, data1);
var b = async writeAtPosition(&file_multi_awaiter, event_loop_with_multi_await_support, pos2, data2);
// wait for it to finish
await a; await b;
// deinit the multi-awaiter
file_multi_awaiter.deinit();
This form makes it clear that you can't relocate the FileWithMultiAwait, since a pointer is passed into the async functions. It also lets the user control when the multi-await event is active, to avoid extra queue events for files that won't be awaited. And it forces the user to choose the allocator that will be used for the linked list.
the allocator that will be used for the linked list.
One of the beautiful things about linked lists is that they are "bring your own memory" data structures. Each node in a linked list can be allocated with a different allocator, or with no allocator, and operations on linked lists do not need a reference to an allocator. So no allocator would be needed here. The linked list node can be inside the pwrite
function's frame.
Small status update: event loop integration was successfully moved from std.os
to std.net
/std.fs.File
with the exception of sendfile
which would need to be broken up into two parts (maybe even 3).
Other than sendfile
there doesn't seem to be any other problem arising from the change, but we discussed on stream how it might make sense to explicitly separate files from sockets, be it via two different structs, or a union field in std.fs.File
(or something else). We discussed this on stream with @SpexGuy @kprotty.
This change is probably even more compelling if we want to set ourselves up for io_uring, since it would break the assumption that evented io must always be implemented as a failed syscall + a retry notification in a loop.
After thinking about this for a while (and having had a short break from it because of other engagements), I'm back at it and I'd like to discuss some practical actions with anybody that's interested.
While I started with the idea of changing the internal architecture of the event loop to support a different way of interfacing with epoll/kqueue, I think this change should be postponed, and instead I would recommend to proceed with a different change first, as mentioned in my last comment.
To reiterate what I mean: I think we should first make sure to pull up all integration with the event loop from os.zig
to file.zig
/net.zig
. I've already done so in my fork and, as I mentioned before, it's easy to do, with the exception of sendfile
.
Additionally, I would recommend to pull inside std.event.Loop
the current "fail + subscribe + suspend + retry" logic, so that an implementation of Loop that uses completion-based APIs (e.g. iouring) can override it. In practical terms, when an event loop is available, File
should directly call Loop.instance.read()
, for example. Currently we call directly into the event loop only for the special file I/O case. What I'm proposing is basically doing the inverse: delegate to Loop by default, and call into std.os
only when not in async mode or when blocking I/O is being expressely requrested. (see the current status here: https://github.com/ziglang/zig/blob/master/lib/std/fs/file.zig#L312-L320)
I think at that point it will become much easier to experiment with different architectures and integrate with more system APIs.
@andrewrk thoughts? If the plan sounds fine I could proceed with the changes and file a pr, but I wouldn't mind some guidance on how to test that I didn't break I/O for other platforms, etc.
I agree on both points! Thanks for the overview, it helps to understand the context when looking at PRs.
As far as testing goes, it's about time to add CI testing for async I/O modes, but we don't have it set up yet. That would probably be worth investing in directly before or after these changes. Async I/O is still "experimental" so it's ok to break master branch on accident. Personally I would move forward with your improvements, and look into the additional testing infrastructure afterwards.
Current architecture
Currently
std.event.Loop
relies on the following general process to orchestrate async frames (to my understanding):listen
/accept
/connect
/read
/write
. If the syscall response isEAGAIN
(or equivalents), the function adds a oneshot event in epoll/kqueue/... for the specific action they need (i.e. read or write). The event contains a pointer to the async frame itself. Once the event is added, the frame suspends. The expectation is that the event loop will resume it once the socket becomes actionable again. This is used to retry some syscalls (e.g.read
,write
) and to know when one has completed (e.g.connect
). If a syscall completes withoutEAGAIN
, no event will be slotted in epoll/kqueue/..., and the async frame will not suspend.while(true)
loop and block on the polling request. The request will unblock when an event slotted in (1) triggers. The thread will then resume the frame pointer present in the event just obtained and, once the frame is done executing (the top levelreturn
in the frame "chain" is hit) or suspends again, the thread will return to its mainwhile(true)
polling loop.Event diagram
This diagram shows a possible set of events slotted in epoll/kqueue/... during a program execution.
Pros of this system:
Cons:
writable OR readable
event when used in this fashion, which is what originally made me look into the event loop code (#5286).Alternative architecture
An example of an alternative architecture is one where, when a new file descriptor is opened, we arm epoll/kqueue/.. with a "multishot" (i.e. with the oneshot flag disabled) event that triggers every time the file descriptor goes from non-actionable to actionable (e.g. from non-readable, to readable).
At this point the question becomes: if we only slot one event per file descriptor, then what information should we include in the event struct, so that we can then resume work once the event comes back from epoll/kqueue/...?
One reasonable way is to have the event point to the corresponding
std.fd.File
struct in the application, and add a field to the struct so that it points to the head of a linked list of suspended frames. See the following diagram.Event diagram
Things not shown in the above graph:
std.fs.File
struct).std.fs.File
structs don't have a single list of suspended frames, but rather 3: one list for frames awaiting for the file descriptor to become readable, one list for writable, and one list for readable or writable. The event loop inspects which condition caused the event to fire and resumes frames from the corresponding list(s).Exploration work
A couple of days ago Andrew came on stream to briefly explain some details of the current implementation and provide some advice on how to proceed. See this VOD, he shows up two times, one near the beginning, and one near the end of the stream.
Yesterday I worked on the new implementation. Since it's my first time trying this kind of work, I just aimed to get the thing working in the most straightforward way possible, ignoring error conditions, platforms other than the one I'm developing on (macOS), and necessary cleanup.
The result is the first commit of this branch: https://github.com/kristoff-it/zig/tree/loop-b It's able to run some test code (see this gist, the zag version compiles against my branch, while the zig version works with current zig). It's also currently single threaded.
I will report any progress & findings. For now I have two.
Findings
std.fs.File
pointers to epoll/kqueue/... makes those structs non-copiable. This is an annoying and pervasive change.Questions
One big question that I'm left with, is the following: all these systems expect multiple frames to be concurrently waiting to operate on the same stream, while this is not what you would want to happen in any reasonable use case that I can think of, since you would potentially mess up read/write order. For example, Python even throws an exception when this situation occurs (here you can see me demonstrate this case).
Can anybody give me an example of why this is considered an important situation to support?