Open andrewrk opened 4 years ago
I really don't like the idea of an implicit suspend point. It smells an awful lot like hidden control flow. Perhaps we should require async functions to retrieve their result location explicitly? Wait, no, then that's function colouring. Hmm.
(Also, is there a specific reason that the keyword can't just be cancel
? cancelawait
is a bit unwieldy.)
I really don't like the idea of an implicit suspend point.
I should clarify, there is already a suspend point at a return
statement in an async function. Also the fact that it is at return
makes it explicit I suppose. Anyway, the point is this doesn't add a suspend point, it moves it a little bit earlier so that the return expression will have the await result pointer before being evaluated, and so that the defers have not been executed yet.
Sounds like a nice proposal, with moving execution into await instead of async for non-suspend functions being quite the change. Was left with a few questions after reading it over:
What and how would cancellation look like for normal calls to async functions? (e.g. _ = someAsyncFn()
). Does it introduce an implicit try
, do catch unreachable
, etc.?
Also, is execution deferred until await
instead of async
for functions that dont suspend based on compile time analysis, or is this change a global property? If the latter, does this mean that async
no longer runs until the first suspend point? That sounds like it would remove the ability to start async functions concurrently, which is why I feel like i'm misunderstanding it here.
What and how would cancellation look like for normal calls to async functions?
No cancellation possible for these. The result location and the awaiter resume handle are both available from the very beginning of the call. When it gets to return
, no suspend occurs; it writes the return value to the result location, runs the (non-error) defers, and then tail-resumes the callee.
Also, is execution deferred until
await
instead ofasync
for functions that dont suspend based on compile time analysis
At the end of the compilation process, every function is assigned a calling convention. Async functions have an async calling convention. So the compiler does have to "color" functions internally for code generation purposes. So it's based on compile time analysis. (That's status quo already)
For the last part, as I understand it now, doing var frame = async someAsyncFn()
runs someAsyncFn() up until its suspend point if any. If the result location is already available at the beginning of the async fn, does that mean that the execution of someAsyncFn() now begins at its frame
's await
point (since that were the result location is specified)?
The reason I find this significant is because, if that is true, then it changes the current assumptions on what the async
keyword currently does. It would now just "setup the frame", instead of "setup the frame and run until first suspend".
If the "run" step is now only possible at await
, what does this mean for trying to run other code while an async function is suspended? Originally, after the call to async someAsyncFn()
, the async fn would then be running concurrently. Now that only await
can start running the async fn, there no longer seems to be a way to express concurrency given await
effectively serializes the async procedure.
First, I want to note that result location semantics can already be (and may already be) supported for calls to async functions that do not use the async
keyword. This gives us the rule: "the async
keyword does not support result location semantics". Any call that does not use the async
keyword can retain result location semantics, which means that two-coloring is not a problem. I think this rule is fine. It's simple, easy to explain and understand, and easy to see in code. I also think that passing the result location into @asyncCall
is a decent solution for cases where the async keyword and result locations are both required. If you're returning a large value from an async function, something is going to be slow. Our choice is whether to make that slowness obvious (copying the value a couple times) or hidden (performing indirect jumps to do computation at the await site which involves writing large amounts of memory).
That said, I see two fatal inconsistencies between blocking and async functions with this proposal. I think they are much more subtle and hard to catch than problems with result location semantics, so IMO it would be better for the language not to support result location semantics for async calls than to take on these new problems. These two examples are related but subtly different. Fixing one will not fix the other.
cancelawait
with side effectsIf a function has side effects, this definition of cancelawait
behaves very differently for async functions vs blocking functions. With async functions, the side effects will have happened when the cancelawait completes. But with this definition for blocking functions, it will not have triggered. This is especially problematic if the side effect is to free memory, as in this example:
// x is consumed by b.
fn b(x: *Thing) void {
defer Thing.free(x);
// do other stuff
}
fn a() void {
var x: *Thing = Thing.alloc();
// ownership of x is passed to b, it will clean up
var frame = async b(x);
// if b is async, this will clean up x.
// if b is blocking, this will not clean up x.
errdefer cancelawait frame;
// ...
await frame;
}
This proposal can cause undesirable behavior when nested. Consider this async function:
pub fn fetchUrl(allocator: *Allocator, url: []const u8) callconv(.Async) !FetchResult {
const urlInfo = nosuspend parseUrl(url);
return try fetchUrlInternal(allocator, urlInfo);
}
Assume for a moment that fetchUrlInternal
is blocking. According to the semantics above, it cannot run until the function is await
ed, because if the function is cancelawait
ed its side effects will not happen. For consistency, this rule should also hold for async
functions.
But that means that when fetchUrlInternal
is async, the meat of this function cannot begin executing until the await
happens. This means that if a user spawns 5 frames and then awaits each of them, each will not begin fetching its url until the previous one has completely finished. Essentially the async code has been "linearized", forced to run in order by this constraint.
The alternative is to allow async function calls in the return expression to begin executing asynchronously, and have the await
or cancelawait
in the parent be passed on to the child. But this causes a significant semantic difference between blocking and async functions, because side effects in async functions will execute but side effects in blocking functions will not.
The proposal addresses this a bit:
A function can introduce an intentional copy of the result data, if it wishes to run the logic in the return expression before an await result pointer is available.
But this is an extremely subtle difference in code for something so dramatically different in execution. I don't think this is a good idea.
It's not explicitly stated in the proposal, but cancelawait
must be allowed on completed async functions. Otherwise the example given is buggy:
fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
var download_frame = async fetchUrl(allocator, "https://example.com/");
errdefer cancelawait download_frame;
var file_frame = async readFile(allocator, "something.txt");
errdefer cancelawait file_frame;
// if this returns error, download_frame is awaited twice
const download_text = try await download_frame;
defer allocator.free(download_text);
// if this returns error, download_frame and file_frame are both awaited twice
const file_text = try await file_frame;
defer allocator.free(file_text);
}
Fixing this is actually the only useful thing cancelawait
does in this example. The calling code already needs to know how to clean up the return value, so that knowledge is not abstracted. And the returned values are slices, which are trivially fast to copy. In fact, this form incurs a significant new performance problem, because the processor now needs to make an indirect jump into the return stubs of fetchUrl
and readFile
which contain the code to copy the slice into the result location, instead of just copying 16 bytes out of the frame. In theory a sufficiently smart compiler could recognize that the stub is known in this case and inline it, but this is more work that has to happen at every async function in the program, and could have a negative impact on build times and debug performance.
I think this use is important, but it can be accomplished more directly. Here's my counterproposal:
Keep cancelawait
, but don't have it run defers or errdefers. For a function that returns T
, cancelawait
returns ?T
. If the function has been await
ed or cancelawait
ed, cancelawait
returns null. Otherwise it returns the return value.
This would allow the above example to be written as follows:
fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
var download_frame = async fetchUrl(allocator, "https://example.com/");
errdefer if (cancelawait download_frame) |text| allocator.free(text);
var file_frame = async readFile(allocator, "something.txt");
errdefer if (cancelawait file_frame) |text| allocator.free(text);
const download_text = try await download_frame;
defer allocator.free(download_text);
const file_text = try await file_frame;
defer allocator.free(file_text);
}
This is still much less efficient than avoiding defer
/errdefer
/try
and putting the cleanup code at each return statement, because there are now atomic checks that must be made to implement cancelawait
. And the optimizer will never be able to get to that level of efficiency, because it can't prove that download_frame
will not trigger something that will cause file_frame
to be awaited elsewhere and then return an error. But at least the code is a bit cleaner than including bools alongside each frame to prevent double-awaits.
For the last part, as I understand it now, doing
var frame = async someAsyncFn()
runs someAsyncFn() up until its suspend point if any. If the result location is already available at the beginning of the async fn, does that mean that the execution of someAsyncFn() now begins at itsframe
'sawait
point (since that were the result location is specified)?
In the exmaple var frame = async someAsyncFn()
the result location for return
is not available yet, not until await
happens. However, it would still setup the frame and run until first suspend, just like status quo. Here's an example that highlights the difference between status quo and this proposal:
fn main() void {
seq('a');
var frame1 = async foo();
seq('c');
var frame2 = async bar();
seq('e');
const x = await frame1;
seq('k');
const y = await frame2;
seq('m');
}
fn foo() i32 {
defer seq('j');
seq('b');
operationThatSuspends();
seq('f');
return util();
}
fn util() i32 {
seq('g');
operationThatSuspends();
seq('i');
return 1234;
}
fn bar() i32 {
defer seq('l');
seq('d');
operationThatSuspends();
seq('h');
return 1234;
}
it would still setup the frame and run until first suspend, just like status quo
Ah ok, think that was where my misunderstanding was. My last point of confusion was related to how non-suspending async fns are handled:
If we move the function call of non-suspending functions used with async/await to happen at the await site instead of the async site
Is this change in semantics something applied by compile time analysis or through some other observation? If its compile time defined, what happens to the result values of async f()
started functions before they're await'ed which conditionally suspend at runtime? Running until suspend at async
would discard the result value as theres not yet a provided result location. Running at await
would serialize the async function as explained eariler.
Instead of a new keyword, why couldn't a frame just have a cancel
function?
errdefer download_task.cancel();
@frmdstryr Nice idea. Would it make sense to extend this to other frame functionality? suspend
probably wouldn't be feasible to be a frame method since it needs to support block execution.
download_task.resume();
download_task.await();
EDIT: removed async
since its a calling convention and is invoked on the function rather than the frame
These aren't methods though -- they're built-in functionality. Writing them as methods is misleading, and breaks the principle of all control flow as keywords.
@EleanorNB All control flow isn't currently keywords as function calls themselves are a form of control flow and can have control flow inside them as well. If I understand correctly, resume
currently updates some atomic state and tail-calls into the frame's func, while await
updates some atomic state and possibly suspends
. Given both don't require source level control like async/suspend do, them being methods instead of keywords seems to be pretty fitting. One example of this is Rust where await
is a field-property keyword of Futures/Frames and the resume
equivalent is a poll()
method on the Future/Frame as well.
Thought: if cancel
runs errdefer
s, and errdefer
s can capture values, then cancel
will also need to take an error to propagate up the function. How would we specify that? We could just do cancel frame, error.Something
, but there's no precedent in the language for bare comma-separated lists... we could make cancel
a builtin rather than a keyword, but that breaks symmetry with the rest of async machinery... hmm.
Another option to maybe consider: suspend
could now return an error.Cancelled, then cancel frame
resumes the frame while making the suspend return that error. One would handle and possibly return that error after noticing a cancelled suspend which would then bubble up the normal expected route of running errdefer
and such
No good -- not all suspend points are marked with suspend
. Then we have to mark every direct async function call and return
statement with an error, or return an error union from every async function -- that's function colouring, all over again.
I was under the assumption that there are only two ways to introduce a suspend point: suspend
and await
.
The former could return the error as noted earlier, and to mimic current semantics would be to ignore the error: suspend { ... } catch unreachable
. This effectively means that the frame cannot handle cancellation at that suspension point.
The latter AFAICK has two choices:
await
return an error union with the frame's return type (along with nosuspend
catching another error). You could also ignore the error here via catch unreachable
in order to keep current semantics. In both cases, the marking is at the suspension point rather than at return
or async
invocation.
A blocking async function call is an implicit await
, so it also counts as a suspend point. For example:
fn foo() u32 {
var x: u32 = 4;
callThatMaySuspend(); // x must be saved to the frame, this call is a suspend point
// equivalent to `await async callThatMaySuspend();`
return x;
}
For cancellation to work, any function that may suspend or await (and supports cancellation) needs to return an error union which includes cancelled. This is the "function colouring, all over again" that Eleanor is describing.
Hm, forgot about compiler inserted awaits. The first bullet point sounds like the way to go there (the compiler adding catch {}
to the inserted await's suspend point) which makes await
ignore cancellations.
At first glance, this makes sense as code which expects a result (e.g. using await) isn't written in a way to handle cancellation. You would then only be able to meaningfully cancel frames which are at suspends that explicitly support/handle cancellation (e.g. suspended in a async socket/channel which has more suspend
control), while cancel frame
on those that dont simply have no effect. Is there a hole im missing here though?
Implicit catch {}
or catch unreachable
is a horrible idea. Explicit catch
is not much better.
Since we want to localise any explicitly async
behaviour to the callsite, I do believe it's cancel
that has to specify the error. Since we don't actually use the returned error, I think it's ok not to include it in the function signature.
In line with #5277, this should be consistent if we only allow cancel
on awaitable handles (anyframe->T
, *@Frame(...)
).
@EleanorNB why would implicit catch {}
be a bad idea? I feel like running defers/errdefers on cancellation without any explicit returns or scope ending sounds much more error prone.
Discarding all errors from an operation, only if the enclosing function happens to be async, which is nowhere explicitly marked? No thankyou.
In my eyes, the cancel
keyword is the explication of scope end. Yes, it's at the caller, which is unfortunate -- however, cancellation is literally an externally-mandated exit; this is the price we pay for having it at all.
@EleanorNB
Discarding all errors from an operation
I think there was a miss-comm. on my part. The await would return something like error{Cancelled}!ReturnType
instead where ReturnType could be whatever like error{Overflow}!T
for example (making it error{Cancelled}!(error{Overflow}!T)
). Im not actually sure if you can nest error sets like that but that was what I was implying. Given that, the catch {}
would only apply for the cancel error, meaning it would ignore a cancel frame
request and act similarly to a nosuspend
on resume (by asserting that there is a runtime value from the awaited frame and that it wasnt cancelled).
In my eyes, the cancel keyword is the explication of scope end.
Was under the assumption that cancel frame
would run the defers inside the frame rather than inside the caller. If it did so for the caller then that sounds like only the current frame can cancel itself, which sounds more limiting than I imagined from the original proposal.
To my knowledge, nesting error sets is impossible. Even if it weren't, in my eyes it should be. That way lies madness.
cancel
does run the defers inside the callee frame, not the caller frame, and I never proposed it should be otherwise. In the very next sentence I expressed my disappointment that it had to be separated from the scope in which it had an effect. However, this is the price we pay for cancelable functions -- cancellation needs to be possible at any suspend point for consistency, and while it may be possible (but very cumbersome) to mark every suspend
and await
, marking every point is flatly impossible. Thus, any function call may be an implicit exit point, and the programmer must be prepared for that. It's at least bearable, since every function call looks like one, but it's unfortunate.
To my knowledge, nesting error sets is impossible
The "is cancelled" state can then be switched to a bit in the frame state instead of a error set provided at await. await
itself can simply not be cancellable (panicking when it observes the bit to be set).
cancellation needs to be possible at any suspend point for consistency,
The issue with this is that: how would it behave for operations that aren't cancellable or that wish to perform asynchronous cancellation? Detecting a cancel at the suspend point gives those operations a chance to see and reject a cancellation request (if they cannot support it). A good example of this is completion based IO via io_uring where some IO operations on certain file descriptors just cannot be cancelled even when you send a IORING_OP_ASYNC_CANCEL
so you either have to block or heap allocate.
marking every point is flatly impossible
Only suspend
s would need to be marked here, not await
s. Under that scope, thats pretty reasonable considering suspends happen internally for data structures which talk with frames directly and can be abstracted upon.
any function call may be an implicit exit point
I actually think this, in a weird way, reintroduces colored functions as defers could be executed at different times depending on whether the function is synchronous or async..
all errors from an operation
Does zig have something akin to AggregateError?
await
itself can simply not be cancellable
Then whether a frame is cancelable or not depends on its current suspend point, which otherwise is completely invisible and unpredictable to the caller. What you get then is people saying that for safety, you should never try to cancel a frame. That's a C problem; Zig is better than that.
how would it behave for operations that aren't cancellable
This would typically be known by the programmer, so we would trust them not to attempt this. In such functions, the errdefer
s should clean up the state anyway, and if that has to involve blocking, then so be it. (That might mean cancel
could itself be a suspend point, but I don't think this is necessarily a problem -- we have nosuspend
, after all.)
a chance to see and reject a cancellation request
It's not a request. We don't ask nicely. When we say cancel
, we mean cancel
, not "if you'd be so kind as to cancel".
Only
suspend
s would need to be marked here
An await
or blocking call is still a suspend point. Under your model, if we cancel an awaiting frame, the defer
s in the awaited frame would run, but not in the cancelled frame. (Unless we have some idea of an error set reserved for cancellation, that does not function as an ordinary error set -- because, if a blocking call is not to a coroutine, then your nested error set idea reduces to a single error set, and there's no way to distinguish that from an ordinary returned error.)
defers could be executed at different times
The semantics of defer
don't change -- any exit point runs the defer
s above it, sync or async.
I actually think this, in a weird way, reintroduces colored functions
There is always going to be some semantic difference between synchronous and asynchronous code. That's the whole point. However, the programmer's model doesn't change, and no code needs to be rewritten -- we're still colourblind. Under your proposal, colouring would be a lot worse: asynchronous calls have to have special second error set, synchronous calls cannot have that lest it be confused with an ordinary error set.
What you get then is people saying that for safety, you should never try to cancel a frame.
I don't really follow. resume
depends on the state of the frame (is "invisible and unpredictable to the caller") and will panic if its completed or being resumed by another thread (even in ReleaseFast it seems). People aren't saying "you should never try to resume a frame". Almost all async keywords/operations excluding suspend
imply that you are aware of the state of the frame without any explicit notion in code, so I think this type of cancellation is still valuable.
In such functions, the errdefers should clean up the state anyway, and if that has to involve blocking, then so be it.
This has actually been a pain point in Rust futures as well. It requires implementing cancellation at the destructor of the Future/Frame but that is only synchronous. People want asynchronous cancellation (e.g. AsyncDrop
) but that wouldn't fit well into the ecosystem so they resort to heap allocating the async resources that cannot be synchronously cancelled in a non-blocking manner so that it outlives the async context to be cancelled in the future.
The latter of not heap-allocating, which is blocking on cancellation, can actually be both an inefficiency + logic error:
suspend
) then all worker threads could block waiting for the resource to complete without letting it, producing a deadlock.It's not a request. We don't ask nicely. When we say cancel, we mean cancel,
Again, not everything can be cancelled. So you end up introducing runtime overhead as stated above in order to accommodate a language semantic. It would be great if we don't end like rust in that regard as its sacrificing customizability for simplicity without a way to opt-out as its at the lang level.
Under your model, if we cancel an awaiting frame, the defers in the awaited frame would run, but not in the cancelled frame.
I think there has been another misunderstanding. My idea of cancellation doesn't include defers or how to run them any differently. It only introduces cancel frame
and suspend { .. } catch |err| { ... }
. Cancelling an awaiting frame would either cause a panic to the cancelling frame or the awaiting frame.
The latter was what I was suggesting before. Here, await
wouldn't introduce a magical new error to the return type. The cancellation state would be handled internally; Await inserts an implicit suspend point when the frame result isn't ready. This internal one would just go from suspend { ... }
to suspend { ... } catch panic("await not cancellable")
.
The former is also an option (that I just thought of), which could be made more forgiving by cancel frame
returning an error if it succeeded in cancelling the frame or not. This moves the decision of "is this cancellation" from the suspend point to the effective resume point. Im not too big of a fan of this approach as it tries to make Zig async/await more readiness based instead of completion base which goes against its original model and introduces a mandatory synchronization overhead to resume points that, atm, could be removed in the future.
The semantics of defer don't change -- any exit point runs the defers above it, sync or async.
The issue here is that suspend + normal function calls that aren't at the end of the scope or use try
are now exit points. This makes using defer trickier as its no longer explicit where an exit point really is in sync vs. async. In async, your defer/errdefer could run earlier than it possibly ever could in sync if a middle function suspended and was cancelled..
Under your proposal, colouring would be a lot worse: asynchronous calls have to have special second error set,
Again, this is not the case. await
would handle the cancelled error/state internally.
Without even looking at the called function, standard coding practice is enough to ensure exactly one suspension is paired with one resumption, and one invocation with one completion -- so, if the programmer has done their job well, they should not encounter language-enforced crashes. However, there is no way of inspecting the internal suspension state of a function, so the invoker can't know whether it's suspended directly or awaiting. Thus, any attempt at cancellation, no matter how careful the programmer, has a possibility of crashing the program. (Even worse, the common pattern of calling a function to register the frame with the event loop is guaranteed to crash.) Call me crazy, but if the programmer has done their due diligence, they shouldn't have to worry about language-enforced crashes.
As you've pointed out though, my model (actually Andrew's model as well in the relevant places) isn't perfect either -- cancellation would then itself be an asynchronous process, which means it would need its own frame, and that frame would itself need to be cancelable, and how the hell would that work? It seems to me that no implementation of cancellation can ever be guaranteed to succeed, which in my eyes contradicts point 11b of the Zen.
In light of this, @andrewrk, I don't believe that cancellation should be implemented at the language level. We may provide a cancel token implementation in the standard library (which is a much better and more flexible solution anyway), but async frames themselves must be awaited to complete. I do believe however that the proposed asynchronous RLS is a worthwhile idea.
We may implement one language-level feature to make userspace cancellation easier: rather than anyframe
, a resumable handle could have type anyframe<-T
-- that is, suspend
has a value, and resume
takes a value of that type to pass to the function, indicating a procession or cancellation:
// In the suspending function
const action = suspend {
event_loop.registerContinuationAndCancellation(@frame(), continuation_condition, cancellation_condition);
};
switch (action) {
.go => {},
.stop => return error.functionXCancelled;
}
// In the event loop (some details missing)
if (frame.continuation and @atomicRmw(bool, &frame.suspended, .Xchg, false, .Weak) {
resume frame.ptr, .go;
frame.* = null;
}
if (frame.cancellation and @atomicRmw(bool, &frame.suspended, .Xchg, false, .Weak) {
resume frame.ptr, .stop;
frame.* = null;
}
Since @frame()
may be called anywhere within the function, and the resumer needs to know the type before analysing the frame, the suspend type (T
in anyframe<-T
) must be part of the function's signature. I propose we reuse while
loop continuation syntax:
const suspendingFunction = fn (arg: Arg) ReturnType : ContinuationType {
// ...
};
Any function that uses the suspend
keyword must have a suspend type. This is not function colouring, as any function with explicit suspend
is necessarily asynchronous anyway (functions that only await
cannot be keyword-resume
d, so do not need a suspend type). The suspend type may be void
or error!void
(no error set inference), in which case the handle type is anyframe<-void
or anyframe<-error!void
(not anyframe
-- we require strongly typed handles for type checking, which is one drawback), and resume
does not necessarily take a second argument, as in status quo.
This not only permits flexible evented userspace cancellation, but also more specialised continuation conditions: a function waiting for multiple files to become available could receive a handle to the first one that does, and combined with a mechanism to check whether a frame has completed, #5263 could be implemented in userspace in the same manner.
At first blush, this may appear to be hostile to inlining async functions -- however, allowing that would already require semantic changes (#5277) that actually complement this quite nicely: @frame()
would return anyframe<-T
of the syntactically enclosing function's suspend type, regardless of the suspend type of the underlying frame, and there is now a strict delineation between resumable and awaitable handles.
This is, of course, a separate proposal -- I'll write up a proper one later.
What if async fn's could return a user defined Future
that is given with the callconv that holds a reference to result, the frame, and any state?
Then if you can access the result location from within the async fn and have a cancelawait
keyword as a second await location as described in the original post cancellation should work at the user level.
pub fn Future(comptime Frame: type, comptime ReturnType: type) type {
return struct {
frame: Frame,
state: enum{Running, Cancelled, Finished}, .Running,
result: ?ReturnType = null,
};
}
pub fn fetchUrl(allocator: *Allocator, url: []const u8) .callconv(.Async=Future) ![]const u8 {
// Do stuff
while (@result().state != .Cancelled ) {
// Keep working
}
// Handle however you want, this can cleanup your allocated resources
if (@result().state == .Cancelled) return error.Cancelled;
@result().state = .Finished;
}
Using async fetchUrl
would then wrap the call in the Future type given which can be used by both the caller and callee to properly handle cancellation on both sides.
var download_future = async fetchUrl(allocator, "https://example.com/");
errdefer switch (download_future.state) {
.Running => {
download_future.state = .Cancelled; // Should use atomics
cancelawait download_future.frame;
},
.Finished => allocator.free(download_future.result.?),
}
var file_future = async readFile(allocator, "something.txt");
errdefer switch (file_future.state) {
.Running => {
file_future.state = .Cancelled; // Should use atomics
cancelawait file_future.frame;
},
.Finished => allocator.free(file_future.result.?),
}
const download_text = try await download_future.frame
defer allocator.free(download_text);
const file_text = try await download_future.frame;
defer allocator.free(file_text);
I don't see how a cancel without being able to ignore it is a good idea. Some functions may need to be able to ignore the cancel request if something else fails (eg say a doBankTransfer
and logRequest
, the bank transfer could care less if the log fn fails).
Edit: I guess just adding a state flag to the existing frame would work too. Edit 2: Updated to handle case if async fn finished already
The main point of having a state flag that can be referenced from within the async function is so that it can handle cleaning up it's own resources which avoids the problem of "side effects".
await async
, so async
cannot return anything but a bare frame. This is a language level feature -- we cannot complicate it with user-level implicit detail.@frmdstryr Adding a state flag to the frame would be reimplementing the state flags that are already inside the frame. Exposing the state to the user like this specifically means it cant do optimizations like
This also doesn't take into account multi-threaded access to the frame. The state load/check/store there would need to be a CAS, and being able to hide that from the user may allow the compiler to utilize more efficient atomic ops for interacting with the state like atomic swap.
See point 11b of the Zen: "resource deallocation must succeed".
@EleanorNB It must succeed but there's no requirement on when it does so or how it reports success. Arena allocators are a good example as their .free()/.destroy() functions succeed even though they don't actually deallocate the resource. It assumes that the resource will be deallocated in the future from another manner (particularly the allocator's deinit()). cancel
can succeed without internally deallocated the frame.
Ah, so scratch the idea of adding it to the frame itself.
I guess I'm just making more noise here... since this is roughly a worse version of https://github.com/ziglang/zig/issues/5263#issuecomment-624880004 except the Future/CancelToken is returned by using async someFn()
instead of needing to create it and pass a ref.
somehow wound up thinking about this. I like @EleanorNB's suggestion about introducing a cancellation token scheme in stdlib, in part, because, that's what I did, with beam.yield.
somehow wound up thinking about this. I like @EleanorNB's suggestion about introducing a cancellation token scheme in stdlib, in part, because, that's what I did, with beam.yield.
Wanted to vouch for this idea of having a cancellation token scheme in the standard library over requiring new syntax and logic in place for canceling async frames.
I've been using cancellation tokens for canceling I/O and arbitrary tasks in my code using a Context
(which is essentially a simplified version of Go's context.Context
or folly's CancellationToken
) and it has 1) made cancellation points and hierarchies clear, 2) made it obvious whether a function in my codebase has the possibility of suspending, and 3) allowed for easier debugging of what functions have been canceled/are bound to be canceled by isolating and keeping track of stack traces/debug information within a single Context
.
Here are some links to some code I'm working on which contains and makes heavy use of a Context
(cancellation token).
A single-threaded Context
(cancellation token) implementation: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/runtime.zig#L129-L163
send(), recv(), read(), write(), accept(), connect(), timeout syscalls that are driven by io_uring which take in a Context
and are cancellable: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/runtime.zig#L356-L780
A set of single-threaded synchronization primitives which take in a Context
and thus are cancellable: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/sync.zig
A cancellable worker loop function which takes in a Context
that sleeps for N milliseconds and then performs some CPU-bound work in an infinite loop: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/main.zig#L464-L482
An async TCP client pool and TCP server with backpressure support which supports cancellation: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/net.zig
A multi-threaded Context
(cancellation token) implementation: https://gist.github.com/lithdew/2802fa5cb398ccca7d77a899a4b4441f
Isn't it adding user data to @Frame
? It could be used for other stuff as well.
callee:
fn foo() {
suspend {}
if (@frame().user_data.suspend) {}
}
caller:
var frame = async foo();
frame.user_data.suspend = true;
Counter argument :
I don't think there should be a way to cancel async functions. This shouldn't be a language feature. This is user-space stuff.
Rational:
There are two main mental models for coroutines/CPS/async-await.
One : They are like threads, w/o using OS threads (e.g. cooperative multitasking). You can't cancel a thread from the outside. You shouldn't be able to cancel a async call from the outside.
Two : They are just "hiding" call-backs, and auto-magicaly creating your callback "context" for you. You can only cancel callbacks from the outside.
Since neither "model" has the concept of a generic way to cancel, neither should suspend/resume. Doing the correct cleanup code is so case specific, this shouldn't be a language feature. Maybe sometimes you want to run the waiting code w/ a flag telling it to exit (e.g. most I/O), sometimes the eventloop can just delete the frame and go on it's way (most timer callbacks).
Anecdotally, all of the horrible nastiness in other languages coroutines impl surrounds cancellation and error propagation when canceling. Just don't do it.
Also Anecdotally, I've used coroutines of a number or large projects. The only times I've ever wanted to cancel one is when my code sucked, and I was too lazy to reflector it correctly.
It's also a solved problem. How did you "cancel" I/O when using threads for the past 20 years? Just do that.
Oh, that's brilliant! The reason why cancellation seems necessary is that there are two fundamental concurrent operations. Given two "futures" / concurrent operations a and b, you might want to run then concurrently and
join
--- wait for both to completerace
--- to wait for first one to completeSo:
const a = async update_db();
const b = async update_cache();
await @join(a, b); // Want to update _both_ db and cache
const a = async read_db();
const b = async read_cache();
await @race(a, b); // Wait for _one of_ db and cache, whichever is faster
But race
is a special case of join
! You can implement race
in terms of join
if the tasks the finishes join first cancels the other
So, the second example can be re-written roughly as
const ct: CancelationToken = .{};
const a = async read_db(&ct);
const b = async read_cache(&ct);
await @join(a, b)
where both functions:
I bet this scales to fully-general select
as well.
I don't think there should be a way to cancel async functions. This shouldn't be a language feature. This is user-space stuff.
I believe we do need the ability to cancel async functions. There are many examples, for details: Timeouts and cancellation for humans.
There are two main mental models for coroutines/CPS/async-await.
One : They are like threads, w/o using OS threads (e.g. cooperative multitasking). You can't cancel a thread from the outside. You shouldn't be able to cancel a async call from the outside.
Coroutines aren't like threads. We definitely can "cancel" a process by sending a signal. Threads can't be killed from the outside because they share everything in a process and they are not cooperative. Even though, we can cancel threads if we make them "cooperative" somehow. (e.g. There is a main loop in each thread which checks cancellation requests and handles them.) The detail can be wrapped by languages or libraries so that it looks like we are "cancelling" threads. There's no technical reason that we can't cancel coroutines.
Since neither "model" has the concept of a generic way to cancel, neither should suspend/resume. Doing the correct cleanup code is so case specific, this shouldn't be a language feature. Maybe sometimes you want to run the waiting code w/ a flag telling it to exit (e.g. most I/O), sometimes the eventloop can just delete the frame and go on it's way (most timer callbacks).
That's not too difficult, a cancellation is just like a specific error. If the cleanup code works on some regular errors, it works on cancellations.
Anecdotally, all of the horrible nastiness in other languages coroutines impl surrounds cancellation and error propagation when canceling. Just don't do it.
That's because there are few languages/libraries designed with structured concurrency, another reference: Notes on structured concurrency.
Oh, that's brilliant! The reason why cancellation seems necessary is that there are two fundamental concurrent operations. Given two "futures" / concurrent operations a and b, you might want to run then concurrently and
join
--- wait for both to completerace
--- to wait for first one to completeBut
race
is a special case ofjoin
! You can implementrace
in terms ofjoin
if the tasks the finishes join first cancels the other
That's right! We call them task groups in structured concurrency, it's kind of primitive for concurrency (except that if you can't pass a task group as an argument, it's not easy to spawn background tasks when needed).
I believe we do need the ability to cancel async functions. [...] There's no technical reason that we can't cancel coroutines.
The issue is that not all async functions are cancellable. Certain operations are atomic to the caller (or stateful) but still use asynchronous operations. This is the idea of cancellation safety.
For threads, you can send a signal to either request a cancellation (i.e. SIGTERM) or force it regardless of the thread's decision (i.e. SIGKILL). Regarding semantics, the latter is most likely undesirable as you can't recover (or in zig speak, "run defers"). Cancellation requests should be the solution then IMO.
But since it's only a request, the thread has the opportunity to ignore it for various reasons (i.e. it's not cancel-safe). This means you must wait for the thread to complete regardless before you relinquish its resources. If not, you risk leaks (unstructured concurrency) or UAFs (structured concurrency).
Cancellation Tokens are a great solution here as they're 1) opt-in for tasks which are cancel-safe and 2) require joining the task anyways to account for those that aren't cancel-safe. That they're shared between tasks in @matklad's proposed API is a composability nicety (each task could as well just have their own Token and a separate construct shared between tasks could cancel each Token separately).
The issue is that not all async functions are cancellable
Here are two specific, simple examples which are useful as an intuition pump and a litmus test for any cancellation mechanism.
Example 1: an asynchronous tasks submits a read request to io_uring and then gets cancel. To actually cancel the task, what is needed is submitting another cancel request to io_uring (so, another syscall) and then waiting for it to complete. If you don't do this, then the read might still be executing in the kernel while your task is already "canceled", effectively writing to some now-deallocated memory
Example 2: without anything exotic, an async tasks offloads some CPU-heavy task (like computing a checksum) to a thread pool. To cancel this task, we also must cancel the thread-pool job, but that doesn't have cancellation built-in, as it is deeply in some simd loop. So the async task just have to wait until until the CPU side finishes. If you cancel only the async task, and let the CPU part run its course, you are violating structured cocnurrency (and potentially memory safety, if the CPU part uses any resources owned by the async part)
That is, cancellation is only superficially similar to error handling: error handling is unilateral and synchronous. General cancellation is an asynchronous communication protocol: first, you request cancellation, then you wait for the job to actually get canceled.
A more useful framing is that cancellation is serendipitous success
In zig-aio I do cancelation by making async io functions and yield in coroutines return error.Canceled
. This still won't prevent person from catching the error and while looping the coroutine endlessly, but it works pretty okay in practice. For blocking tasks through a thread pool, programmer has to opt-in to cancelation by taking a CancellationToken
as first argument and actively cancel the task if the token is marked canceled, otherwise the coroutine (and thus the caller who wishes to collect the result) has to wait until the blocking task is complete.
I've spent many hours in the past trying to solve this, and never quite tied up all the loose ends, but I think I've done it this time.
Related Proposals:
Problem 1: Error Handling & Resource Management
Typical async await usage when multiple async functions are "in-flight", written naively, looks like this:
Spot the problem? If the first
try
returns an error, the in-flightfile_frame
becomes invalid memory while thereadFile
function is still using the memory. This is nasty undefined behavior. It's too easy to do this on accident.Problem 2: The Await Result Location
Function calls directly write their return values into the result locations. This is important for pinned memory, and will become more noticeable when these are implemented:
return
statement #2765However this breaks when using
async
andawait
. It is possible to use the advanced builtin@asyncCall
and pass a result location pointer toasync
, but there is not a way to do it withawait
. The duality is messy, and a function that relies on pinning its return value will have its guarantees broken when it becomes an async function.Solution
I've tried a bunch of other ideas before, but nothing could quite give us good enough semantics. But now I've got something that solves both problems. The key insight was making obtaining a result location pointer for the
return
statement of anasync
function, implicitly a suspend point. This suspends the async function at thereturn
statement, to be resumed by theawait
site, which will pass it a result location pointer. The crucial point here is that it also provides a suspension point that can be used forcancelawait
to activate. If an async function is cancelled, then it resumes, but instead of returning a value, it runs theerrdefer
anddefer
expressions that are in scope. So - async functions will simply have to retain the property that idiomatic code already has, which is that all the cleanup that possibly needs to be done is in scope in a defer at areturn
statement.I think this is the best of both worlds, between automatically running a function up to the first suspend point, and what e.g. Rust does, not running a function until
await
is called. A function can introduce an intentional copy of the result data, if it wishes to run the logic in the return expression before anawait
result pointer is available. It means async function frames can get smaller, because they no longer need the return value in the frame.Now this leaves the problem of blocking functions which are used with
async
/await
, and whatcancelawait
does to them. The proposal #782 is open for that purpose, but it has a lot of flaws. Again, here, the key insight ofawait
working properly with result location pointers was the answer. If we move the function call of non-suspending functions used with async/await to happen at the await site instead of the async site, thencancelawait
becomes a no-op.async
will simply copy the parameters into the frame, andawait
would do the actual function call. Note that function parameters must be copied anyway for all function calls, so this comes at no penalty, and in fact should be better all around because we don't have "undoing" of allocated resources but we have simply not doing extra work in the first place.Example code:
Now, calling an async function looks like any resource allocation that needs to be cleaned up when returning an error. It works like
await
in that it is a suspend point, however, it discards the return value, and it atomically sets a flag in the function's frame which is observable from within.Cancellation tokens and propagating whether an async function has been cancelled I think can be out of scope of this proposal. It's possible to build higher level cancellation abstractions on top of this primitive. For example, https://github.com/ziglang/zig/issues/5263#issuecomment-624880004 could be improved with the availability of
cancelawait
. But more importantly,cancelawait
makes it possible to casually useasync
/await
on arbitrary functions in a maintainable and correct way.