Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cancelawait keyword to abort an async function call #5913

Open
andrewrk opened this issue Jul 23, 2020 · 44 comments
Open

cancelawait keyword to abort an async function call #5913

andrewrk opened this issue Jul 23, 2020 · 44 comments
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Milestone

Comments

@andrewrk
Copy link
Member

I've spent many hours in the past trying to solve this, and never quite tied up all the loose ends, but I think I've done it this time.

Related Proposals:

Problem 1: Error Handling & Resource Management

Typical async await usage when multiple async functions are "in-flight", written naively, looks like this:

fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    var file_frame = async readFile(allocator, "something.txt");

    const download_text = try await download_frame; // NO GOOD!!!
    defer allocator.free(download_text);

    const file_text = try await file_frame;
    defer allocator.free(file_text);
}

Spot the problem? If the first try returns an error, the in-flight file_frame becomes invalid memory while the readFile function is still using the memory. This is nasty undefined behavior. It's too easy to do this on accident.

Problem 2: The Await Result Location

Function calls directly write their return values into the result locations. This is important for pinned memory, and will become more noticeable when these are implemented:

However this breaks when using async and await. It is possible to use the advanced builtin @asyncCall and pass a result location pointer to async, but there is not a way to do it with await. The duality is messy, and a function that relies on pinning its return value will have its guarantees broken when it becomes an async function.

Solution

I've tried a bunch of other ideas before, but nothing could quite give us good enough semantics. But now I've got something that solves both problems. The key insight was making obtaining a result location pointer for the return statement of an async function, implicitly a suspend point. This suspends the async function at the return statement, to be resumed by the await site, which will pass it a result location pointer. The crucial point here is that it also provides a suspension point that can be used for cancelawait to activate. If an async function is cancelled, then it resumes, but instead of returning a value, it runs the errdefer and defer expressions that are in scope. So - async functions will simply have to retain the property that idiomatic code already has, which is that all the cleanup that possibly needs to be done is in scope in a defer at a return statement.

I think this is the best of both worlds, between automatically running a function up to the first suspend point, and what e.g. Rust does, not running a function until await is called. A function can introduce an intentional copy of the result data, if it wishes to run the logic in the return expression before an await result pointer is available. It means async function frames can get smaller, because they no longer need the return value in the frame.

Now this leaves the problem of blocking functions which are used with async/await, and what cancelawait does to them. The proposal #782 is open for that purpose, but it has a lot of flaws. Again, here, the key insight of await working properly with result location pointers was the answer. If we move the function call of non-suspending functions used with async/await to happen at the await site instead of the async site, then cancelawait becomes a no-op. async will simply copy the parameters into the frame, and await would do the actual function call. Note that function parameters must be copied anyway for all function calls, so this comes at no penalty, and in fact should be better all around because we don't have "undoing" of allocated resources but we have simply not doing extra work in the first place.

Example code:

fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    errdefer cancelawait download_frame;

    var file_frame = async readFile(allocator, "something.txt");
    errdefer cancelawait file_frame;

    const download_text = try await download_frame;
    defer allocator.free(download_text);

    const file_text = try await file_frame;
    defer allocator.free(file_text);
}

Now, calling an async function looks like any resource allocation that needs to be cleaned up when returning an error. It works like await in that it is a suspend point, however, it discards the return value, and it atomically sets a flag in the function's frame which is observable from within.

Cancellation tokens and propagating whether an async function has been cancelled I think can be out of scope of this proposal. It's possible to build higher level cancellation abstractions on top of this primitive. For example, #5263 (comment) could be improved with the availability of cancelawait. But more importantly, cancelawait makes it possible to casually use async/await on arbitrary functions in a maintainable and correct way.

@andrewrk andrewrk added the proposal This issue suggests modifications. If it also has the "accepted" label then it is planned. label Jul 23, 2020
@andrewrk andrewrk added this to the 0.7.0 milestone Jul 23, 2020
@ghost
Copy link

ghost commented Jul 23, 2020

I really don't like the idea of an implicit suspend point. It smells an awful lot like hidden control flow. Perhaps we should require async functions to retrieve their result location explicitly? Wait, no, then that's function colouring. Hmm.

(Also, is there a specific reason that the keyword can't just be cancel? cancelawait is a bit unwieldy.)

@andrewrk
Copy link
Member Author

I really don't like the idea of an implicit suspend point.

I should clarify, there is already a suspend point at a return statement in an async function. Also the fact that it is at return makes it explicit I suppose. Anyway, the point is this doesn't add a suspend point, it moves it a little bit earlier so that the return expression will have the await result pointer before being evaluated, and so that the defers have not been executed yet.

@kprotty
Copy link
Member

kprotty commented Jul 23, 2020

Sounds like a nice proposal, with moving execution into await instead of async for non-suspend functions being quite the change. Was left with a few questions after reading it over:

What and how would cancellation look like for normal calls to async functions? (e.g. _ = someAsyncFn()). Does it introduce an implicit try, do catch unreachable, etc.?

Also, is execution deferred until await instead of async for functions that dont suspend based on compile time analysis, or is this change a global property? If the latter, does this mean that async no longer runs until the first suspend point? That sounds like it would remove the ability to start async functions concurrently, which is why I feel like i'm misunderstanding it here.

@andrewrk
Copy link
Member Author

What and how would cancellation look like for normal calls to async functions?

No cancellation possible for these. The result location and the awaiter resume handle are both available from the very beginning of the call. When it gets to return, no suspend occurs; it writes the return value to the result location, runs the (non-error) defers, and then tail-resumes the callee.

Also, is execution deferred until await instead of async for functions that dont suspend based on compile time analysis

At the end of the compilation process, every function is assigned a calling convention. Async functions have an async calling convention. So the compiler does have to "color" functions internally for code generation purposes. So it's based on compile time analysis. (That's status quo already)

@kprotty
Copy link
Member

kprotty commented Jul 23, 2020

For the last part, as I understand it now, doing var frame = async someAsyncFn() runs someAsyncFn() up until its suspend point if any. If the result location is already available at the beginning of the async fn, does that mean that the execution of someAsyncFn() now begins at its frame's await point (since that were the result location is specified)?

The reason I find this significant is because, if that is true, then it changes the current assumptions on what the async keyword currently does. It would now just "setup the frame", instead of "setup the frame and run until first suspend".
If the "run" step is now only possible at await, what does this mean for trying to run other code while an async function is suspended? Originally, after the call to async someAsyncFn(), the async fn would then be running concurrently. Now that only await can start running the async fn, there no longer seems to be a way to express concurrency given await effectively serializes the async procedure.

@SpexGuy
Copy link
Contributor

SpexGuy commented Jul 23, 2020

First, I want to note that result location semantics can already be (and may already be) supported for calls to async functions that do not use the async keyword. This gives us the rule: "the async keyword does not support result location semantics". Any call that does not use the async keyword can retain result location semantics, which means that two-coloring is not a problem. I think this rule is fine. It's simple, easy to explain and understand, and easy to see in code. I also think that passing the result location into @asyncCall is a decent solution for cases where the async keyword and result locations are both required. If you're returning a large value from an async function, something is going to be slow. Our choice is whether to make that slowness obvious (copying the value a couple times) or hidden (performing indirect jumps to do computation at the await site which involves writing large amounts of memory).

That said, I see two fatal inconsistencies between blocking and async functions with this proposal. I think they are much more subtle and hard to catch than problems with result location semantics, so IMO it would be better for the language not to support result location semantics for async calls than to take on these new problems. These two examples are related but subtly different. Fixing one will not fix the other.

cancelawait with side effects

If a function has side effects, this definition of cancelawait behaves very differently for async functions vs blocking functions. With async functions, the side effects will have happened when the cancelawait completes. But with this definition for blocking functions, it will not have triggered. This is especially problematic if the side effect is to free memory, as in this example:

// x is consumed by b.
fn b(x: *Thing) void {
    defer Thing.free(x);
    // do other stuff
}

fn a() void {
    var x: *Thing = Thing.alloc();
    
    // ownership of x is passed to b, it will clean up
    var frame = async b(x);

    // if b is async, this will clean up x.
    // if b is blocking, this will not clean up x.
    errdefer cancelawait frame;
    
    // ...
    
    await frame;
}

Return statements with side effects

This proposal can cause undesirable behavior when nested. Consider this async function:

pub fn fetchUrl(allocator: *Allocator, url: []const u8) callconv(.Async) !FetchResult {
    const urlInfo = nosuspend parseUrl(url);
    return try fetchUrlInternal(allocator, urlInfo);
}

Assume for a moment that fetchUrlInternal is blocking. According to the semantics above, it cannot run until the function is awaited, because if the function is cancelawaited its side effects will not happen. For consistency, this rule should also hold for async functions.

But that means that when fetchUrlInternal is async, the meat of this function cannot begin executing until the await happens. This means that if a user spawns 5 frames and then awaits each of them, each will not begin fetching its url until the previous one has completely finished. Essentially the async code has been "linearized", forced to run in order by this constraint.

The alternative is to allow async function calls in the return expression to begin executing asynchronously, and have the await or cancelawait in the parent be passed on to the child. But this causes a significant semantic difference between blocking and async functions, because side effects in async functions will execute but side effects in blocking functions will not.

The proposal addresses this a bit:

A function can introduce an intentional copy of the result data, if it wishes to run the logic in the return expression before an await result pointer is available.

But this is an extremely subtle difference in code for something so dramatically different in execution. I don't think this is a good idea.


It's not explicitly stated in the proposal, but cancelawait must be allowed on completed async functions. Otherwise the example given is buggy:

fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    errdefer cancelawait download_frame;

    var file_frame = async readFile(allocator, "something.txt");
    errdefer cancelawait file_frame;

    // if this returns error, download_frame is awaited twice
    const download_text = try await download_frame;
    defer allocator.free(download_text);
    
    // if this returns error, download_frame and file_frame are both awaited twice
    const file_text = try await file_frame;
    defer allocator.free(file_text);
}

Fixing this is actually the only useful thing cancelawait does in this example. The calling code already needs to know how to clean up the return value, so that knowledge is not abstracted. And the returned values are slices, which are trivially fast to copy. In fact, this form incurs a significant new performance problem, because the processor now needs to make an indirect jump into the return stubs of fetchUrl and readFile which contain the code to copy the slice into the result location, instead of just copying 16 bytes out of the frame. In theory a sufficiently smart compiler could recognize that the stub is known in this case and inline it, but this is more work that has to happen at every async function in the program, and could have a negative impact on build times and debug performance.

I think this use is important, but it can be accomplished more directly. Here's my counterproposal:

Keep cancelawait, but don't have it run defers or errdefers. For a function that returns T, cancelawait returns ?T. If the function has been awaited or cancelawaited, cancelawait returns null. Otherwise it returns the return value.

This would allow the above example to be written as follows:

fn asyncAwaitTypicalUsage(allocator: *Allocator) !void {
    var download_frame = async fetchUrl(allocator, "https://example.com/");
    errdefer if (cancelawait download_frame) |text| allocator.free(text);

    var file_frame = async readFile(allocator, "something.txt");
    errdefer if (cancelawait file_frame) |text| allocator.free(text);

    const download_text = try await download_frame;
    defer allocator.free(download_text);
    
    const file_text = try await file_frame;
    defer allocator.free(file_text);
}

This is still much less efficient than avoiding defer/errdefer/try and putting the cleanup code at each return statement, because there are now atomic checks that must be made to implement cancelawait. And the optimizer will never be able to get to that level of efficiency, because it can't prove that download_frame will not trigger something that will cause file_frame to be awaited elsewhere and then return an error. But at least the code is a bit cleaner than including bools alongside each frame to prevent double-awaits.

@andrewrk
Copy link
Member Author

For the last part, as I understand it now, doing var frame = async someAsyncFn() runs someAsyncFn() up until its suspend point if any. If the result location is already available at the beginning of the async fn, does that mean that the execution of someAsyncFn() now begins at its frame's await point (since that were the result location is specified)?

In the exmaple var frame = async someAsyncFn() the result location for return is not available yet, not until await happens. However, it would still setup the frame and run until first suspend, just like status quo. Here's an example that highlights the difference between status quo and this proposal:

This Proposal

fn main() void {
    seq('a');
    var frame1 = async foo();
    seq('c');
    var frame2 = async bar();
    seq('e');
    const x = await frame1;
    seq('k');
    const y = await frame2;
    seq('m');
}

fn foo() i32 {
    defer seq('j');
    seq('b');
    operationThatSuspends();
    seq('f');
    return util();
}

fn util() i32 {
    seq('g');
    operationThatSuspends();
    seq('i');
    return 1234;
}

fn bar() i32 {
    defer seq('l');
    seq('d');
    operationThatSuspends();
    seq('h');
    return 1234;
}

@kprotty
Copy link
Member

kprotty commented Jul 23, 2020

it would still setup the frame and run until first suspend, just like status quo

Ah ok, think that was where my misunderstanding was. My last point of confusion was related to how non-suspending async fns are handled:

If we move the function call of non-suspending functions used with async/await to happen at the await site instead of the async site

Is this change in semantics something applied by compile time analysis or through some other observation? If its compile time defined, what happens to the result values of async f() started functions before they're await'ed which conditionally suspend at runtime? Running until suspend at async would discard the result value as theres not yet a provided result location. Running at await would serialize the async function as explained eariler.

@frmdstryr
Copy link
Contributor

Instead of a new keyword, why couldn't a frame just have a cancel function?

errdefer download_task.cancel();

@kprotty
Copy link
Member

kprotty commented Jul 29, 2020

@frmdstryr Nice idea. Would it make sense to extend this to other frame functionality? suspend probably wouldn't be feasible to be a frame method since it needs to support block execution.

download_task.resume();
download_task.await();

EDIT: removed async since its a calling convention and is invoked on the function rather than the frame

@ghost
Copy link

ghost commented Jul 30, 2020

These aren't methods though -- they're built-in functionality. Writing them as methods is misleading, and breaks the principle of all control flow as keywords.

@kprotty
Copy link
Member

kprotty commented Jul 30, 2020

@EleanorNB All control flow isn't currently keywords as function calls themselves are a form of control flow and can have control flow inside them as well. If I understand correctly, resume currently updates some atomic state and tail-calls into the frame's func, while await updates some atomic state and possibly suspends. Given both don't require source level control like async/suspend do, them being methods instead of keywords seems to be pretty fitting. One example of this is Rust where await is a field-property keyword of Futures/Frames and the resume equivalent is a poll() method on the Future/Frame as well.

@ghost
Copy link

ghost commented Nov 27, 2020

Thought: if cancel runs errdefers, and errdefers can capture values, then cancel will also need to take an error to propagate up the function. How would we specify that? We could just do cancel frame, error.Something, but there's no precedent in the language for bare comma-separated lists... we could make cancel a builtin rather than a keyword, but that breaks symmetry with the rest of async machinery... hmm.

@kprotty
Copy link
Member

kprotty commented Nov 27, 2020

Another option to maybe consider: suspend could now return an error.Cancelled, then cancel frame resumes the frame while making the suspend return that error. One would handle and possibly return that error after noticing a cancelled suspend which would then bubble up the normal expected route of running errdefer and such

@ghost
Copy link

ghost commented Nov 28, 2020

No good -- not all suspend points are marked with suspend. Then we have to mark every direct async function call and return statement with an error, or return an error union from every async function -- that's function colouring, all over again.

@kprotty
Copy link
Member

kprotty commented Nov 30, 2020

I was under the assumption that there are only two ways to introduce a suspend point: suspend and await.

The former could return the error as noted earlier, and to mimic current semantics would be to ignore the error: suspend { ... } catch unreachable. This effectively means that the frame cannot handle cancellation at that suspension point.

The latter AFAICK has two choices:

  • keep current semantics by ignoring a cancellation error (see above)
  • have await return an error union with the frame's return type (along with nosuspend catching another error). You could also ignore the error here via catch unreachable in order to keep current semantics.

In both cases, the marking is at the suspension point rather than at return or async invocation.

@SpexGuy
Copy link
Contributor

SpexGuy commented Nov 30, 2020

A blocking async function call is an implicit await, so it also counts as a suspend point. For example:

fn foo() u32 {
    var x: u32 = 4;
    callThatMaySuspend(); // x must be saved to the frame, this call is a suspend point
    // equivalent to `await async callThatMaySuspend();`
    return x;
}

For cancellation to work, any function that may suspend or await (and supports cancellation) needs to return an error union which includes cancelled. This is the "function colouring, all over again" that Eleanor is describing.

@kprotty
Copy link
Member

kprotty commented Nov 30, 2020

Hm, forgot about compiler inserted awaits. The first bullet point sounds like the way to go there (the compiler adding catch {} to the inserted await's suspend point) which makes await ignore cancellations.

At first glance, this makes sense as code which expects a result (e.g. using await) isn't written in a way to handle cancellation. You would then only be able to meaningfully cancel frames which are at suspends that explicitly support/handle cancellation (e.g. suspended in a async socket/channel which has more suspend control), while cancel frame on those that dont simply have no effect. Is there a hole im missing here though?

@ghost
Copy link

ghost commented Dec 1, 2020

Implicit catch {} or catch unreachable is a horrible idea. Explicit catch is not much better.

Since we want to localise any explicitly async behaviour to the callsite, I do believe it's cancel that has to specify the error. Since we don't actually use the returned error, I think it's ok not to include it in the function signature.

@ghost
Copy link

ghost commented Dec 1, 2020

In line with #5277, this should be consistent if we only allow cancel on awaitable handles (anyframe->T, *@Frame(...)).

@kprotty
Copy link
Member

kprotty commented Dec 1, 2020

@EleanorNB why would implicit catch {} be a bad idea? I feel like running defers/errdefers on cancellation without any explicit returns or scope ending sounds much more error prone.

@ghost
Copy link

ghost commented Dec 2, 2020

Discarding all errors from an operation, only if the enclosing function happens to be async, which is nowhere explicitly marked? No thankyou.

In my eyes, the cancel keyword is the explication of scope end. Yes, it's at the caller, which is unfortunate -- however, cancellation is literally an externally-mandated exit; this is the price we pay for having it at all.

@Mouvedia
Copy link

Mouvedia commented Dec 2, 2020

all errors from an operation

Does zig have something akin to AggregateError?

@ghost
Copy link

ghost commented Dec 3, 2020

await itself can simply not be cancellable

Then whether a frame is cancelable or not depends on its current suspend point, which otherwise is completely invisible and unpredictable to the caller. What you get then is people saying that for safety, you should never try to cancel a frame. That's a C problem; Zig is better than that.

how would it behave for operations that aren't cancellable

This would typically be known by the programmer, so we would trust them not to attempt this. In such functions, the errdefers should clean up the state anyway, and if that has to involve blocking, then so be it. (That might mean cancel could itself be a suspend point, but I don't think this is necessarily a problem -- we have nosuspend, after all.)

a chance to see and reject a cancellation request

It's not a request. We don't ask nicely. When we say cancel, we mean cancel, not "if you'd be so kind as to cancel".

Only suspends would need to be marked here

An await or blocking call is still a suspend point. Under your model, if we cancel an awaiting frame, the defers in the awaited frame would run, but not in the cancelled frame. (Unless we have some idea of an error set reserved for cancellation, that does not function as an ordinary error set -- because, if a blocking call is not to a coroutine, then your nested error set idea reduces to a single error set, and there's no way to distinguish that from an ordinary returned error.)

defers could be executed at different times

The semantics of defer don't change -- any exit point runs the defers above it, sync or async.

I actually think this, in a weird way, reintroduces colored functions

There is always going to be some semantic difference between synchronous and asynchronous code. That's the whole point. However, the programmer's model doesn't change, and no code needs to be rewritten -- we're still colourblind. Under your proposal, colouring would be a lot worse: asynchronous calls have to have special second error set, synchronous calls * cannot* have that lest it be confused with an ordinary error set.

@kprotty
Copy link
Member

kprotty commented Dec 3, 2020

What you get then is people saying that for safety, you should never try to cancel a frame.

I don't really follow. resume depends on the state of the frame (is "invisible and unpredictable to the caller") and will panic if its completed or being resumed by another thread (even in ReleaseFast it seems). People aren't saying "you should never try to resume a frame". Almost all async keywords/operations excluding suspend imply that you are aware of the state of the frame without any explicit notion in code, so I think this type of cancellation is still valuable.

In such functions, the errdefers should clean up the state anyway, and if that has to involve blocking, then so be it.

This has actually been a pain point in Rust futures as well. It requires implementing cancellation at the destructor of the Future/Frame but that is only synchronous. People want asynchronous cancellation (e.g. AsyncDrop) but that wouldn't fit well into the ecosystem so they resort to heap allocating the async resources that cannot be synchronously cancelled in a non-blocking manner so that it outlives the async context to be cancelled in the future.

The latter of not heap-allocating, which is blocking on cancellation, can actually be both an inefficiency + logic error:

  • You monopolize a worker thread in a multi-threaded event loop where other tasks could have been running while you're waiting for your resource to complete to free it in a non-deterministic amount of time.
  • If the resource can only be satisfied at the event loop scheduling points (e.g. from a suspend) then all worker threads could block waiting for the resource to complete without letting it, producing a deadlock.

It's not a request. We don't ask nicely. When we say cancel, we mean cancel,

Again, not everything can be cancelled. So you end up introducing runtime overhead as stated above in order to accommodate a language semantic. It would be great if we don't end like rust in that regard as its sacrificing customizability for simplicity without a way to opt-out as its at the lang level.

Under your model, if we cancel an awaiting frame, the defers in the awaited frame would run, but not in the cancelled frame.

I think there has been another misunderstanding. My idea of cancellation doesn't include defers or how to run them any differently. It only introduces cancel frame and suspend { .. } catch |err| { ... }. Cancelling an awaiting frame would either cause a panic to the cancelling frame or the awaiting frame.

The latter was what I was suggesting before. Here, await wouldn't introduce a magical new error to the return type. The cancellation state would be handled internally; Await inserts an implicit suspend point when the frame result isn't ready. This internal one would just go from suspend { ... } to suspend { ... } catch panic("await not cancellable").

The former is also an option (that I just thought of), which could be made more forgiving by cancel frame returning an error if it succeeded in cancelling the frame or not. This moves the decision of "is this cancellation" from the suspend point to the effective resume point. Im not too big of a fan of this approach as it tries to make Zig async/await more readiness based instead of completion base which goes against its original model and introduces a mandatory synchronization overhead to resume points that, atm, could be removed in the future.

The semantics of defer don't change -- any exit point runs the defers above it, sync or async.

The issue here is that suspend + normal function calls that aren't at the end of the scope or use try are now exit points. This makes using defer trickier as its no longer explicit where an exit point really is in sync vs. async. In async, your defer/errdefer could run earlier than it possibly ever could in sync if a middle function suspended and was cancelled..

Under your proposal, colouring would be a lot worse: asynchronous calls have to have special second error set,

Again, this is not the case. await would handle the cancelled error/state internally.

@ghost
Copy link

ghost commented Dec 3, 2020

Without even looking at the called function, standard coding practice is enough to ensure exactly one suspension is paired with one resumption, and one invocation with one completion -- so, if the programmer has done their job well, they should not encounter language-enforced crashes. However, there is no way of inspecting the internal suspension state of a function, so the invoker can't know whether it's suspended directly or awaiting. Thus, any attempt at cancellation, no matter how careful the programmer, has a possibility of crashing the program. (Even worse, the common pattern of calling a function to register the frame with the event loop is guaranteed to crash.) Call me crazy, but if the programmer has done their due diligence, they shouldn't have to worry about language-enforced crashes.

As you've pointed out though, my model (actually Andrew's model as well in the relevant places) isn't perfect either -- cancellation would then itself be an asynchronous process, which means it would need its own frame, and that frame would itself need to be cancelable, and how the hell would that work? It seems to me that no implementation of cancellation can ever be guaranteed to succeed, which in my eyes contradicts point 11b of the Zen.

In light of this, @andrewrk, I don't believe that cancellation should be implemented at the language level. We may provide a cancel token implementation in the standard library (which is a much better and more flexible solution anyway), but async frames themselves must be awaited to complete. I do believe however that the proposed asynchronous RLS is a worthwhile idea.

@ghost
Copy link

ghost commented Dec 3, 2020

We may implement one language-level feature to make userspace cancellation easier: rather than anyframe, a resumable handle could have type anyframe<-T -- that is, suspend has a value, and resume takes a value of that type to pass to the function, indicating a procession or cancellation:

// In the suspending function
const action = suspend {
    event_loop.registerContinuationAndCancellation(@frame(), continuation_condition, cancellation_condition);
};

switch (action) {
    .go => {},
    .stop => return error.functionXCancelled;
}


// In the event loop (some details missing)
if (frame.continuation and @atomicRmw(bool, &frame.suspended, .Xchg, false, .Weak) {
    resume frame.ptr, .go;
    frame.* = null;
}

if (frame.cancellation and @atomicRmw(bool, &frame.suspended, .Xchg, false, .Weak) {
    resume frame.ptr, .stop;
    frame.* = null;
}

Since @frame() may be called anywhere within the function, and the resumer needs to know the type before analysing the frame, the suspend type (T in anyframe<-T) must be part of the function's signature. I propose we reuse while loop continuation syntax:

const suspendingFunction = fn (arg: Arg) ReturnType : ContinuationType {
    // ...
};

Any function that uses the suspend keyword must have a suspend type. This is not function colouring, as any function with explicit suspend is necessarily asynchronous anyway (functions that only await cannot be keyword-resumed, so do not need a suspend type). The suspend type may be void or error!void (no error set inference), in which case the handle type is anyframe<-void or anyframe<-error!void (not anyframe -- we require strongly typed handles for type checking, which is one drawback), and resume does not necessarily take a second argument, as in status quo.

This not only permits flexible evented userspace cancellation, but also more specialised continuation conditions: a function waiting for multiple files to become available could receive a handle to the first one that does, and combined with a mechanism to check whether a frame has completed, #5263 could be implemented in userspace in the same manner.

At first blush, this may appear to be hostile to inlining async functions -- however, allowing that would already require semantic changes (#5277) that actually complement this quite nicely: @frame() would return anyframe<-T of the syntactically enclosing function's suspend type, regardless of the suspend type of the underlying frame, and there is now a strict delineation between resumable and awaitable handles.

This is, of course, a separate proposal -- I'll write up a proper one later.

@ghost ghost mentioned this issue Dec 3, 2020
@frmdstryr
Copy link
Contributor

frmdstryr commented Dec 3, 2020

What if async fn's could return a user defined Future that is given with the callconv that holds a reference to result, the frame, and any state?

Then if you can access the result location from within the async fn and have a cancelawait keyword as a second await location as described in the original post cancellation should work at the user level.

pub fn Future(comptime Frame: type, comptime ReturnType: type) type {
    return struct {
         frame: Frame,
         state: enum{Running, Cancelled, Finished}, .Running,
         result: ?ReturnType = null,
    };
}

pub fn fetchUrl(allocator: *Allocator, url: []const u8) .callconv(.Async=Future) ![]const u8 {
     // Do stuff
     while (@result().state != .Cancelled ) {
           // Keep working
     } 
     // Handle however you want, this can cleanup your allocated resources
     if (@result().state == .Cancelled) return error.Cancelled; 
     @result().state = .Finished;
}

Using async fetchUrl would then wrap the call in the Future type given which can be used by both the caller and callee to properly handle cancellation on both sides.

var download_future = async fetchUrl(allocator, "https://example.com/");
errdefer switch (download_future.state) {
    .Running => {
        download_future.state = .Cancelled; // Should use atomics
        cancelawait download_future.frame;
    },
    .Finished => allocator.free(download_future.result.?),
}

var file_future = async readFile(allocator, "something.txt");
errdefer switch (file_future.state) {
    .Running => {
        file_future.state = .Cancelled; // Should use atomics
        cancelawait file_future.frame;
    },
    .Finished => allocator.free(file_future.result.?),
}

const download_text = try await download_future.frame
defer allocator.free(download_text);

const file_text = try await download_future.frame;
defer allocator.free(file_text);

I don't see how a cancel without being able to ignore it is a good idea. Some functions may need to be able to ignore the cancel request if something else fails (eg say a doBankTransfer and logRequest, the bank transfer could care less if the log fn fails).

Edit: I guess just adding a state flag to the existing frame would work too.
Edit 2: Updated to handle case if async fn finished already

@frmdstryr
Copy link
Contributor

The main point of having a state flag that can be referenced from within the async function is so that it can handle cleaning up it's own resources which avoids the problem of "side effects".

@ghost
Copy link

ghost commented Dec 3, 2020

  • A blocking call is semantically equivalent to await async, so async cannot return anything but a bare frame. This is a language level feature -- we cannot complicate it with user-level implicit detail.
  • Storing cancellation state in the frame does not change the fact that only the topmost frame can access it; so the awaiter, with only a handle to the bottommost frame, cannot.
  • An ignorable cancellation request has no place in the language base. See point 11b of the Zen: "resource deallocation must succeed".

@kprotty
Copy link
Member

kprotty commented Dec 3, 2020

@frmdstryr Adding a state flag to the frame would be reimplementing the state flags that are already inside the frame. Exposing the state to the user like this specifically means it cant do optimizations like

  • storing the awaiter/frame-state in the same atomic usize to save on memory and atomic accesses
  • using the suspend context union(enum) memory to also store the return type instead of a separate field

This also doesn't take into account multi-threaded access to the frame. The state load/check/store there would need to be a CAS, and being able to hide that from the user may allow the compiler to utilize more efficient atomic ops for interacting with the state like atomic swap.

See point 11b of the Zen: "resource deallocation must succeed".

@EleanorNB It must succeed but there's no requirement on when it does so or how it reports success. Arena allocators are a good example as their .free()/.destroy() functions succeed even though they don't actually deallocate the resource. It assumes that the resource will be deallocated in the future from another manner (particularly the allocator's deinit()). cancel can succeed without internally deallocated the frame.

@frmdstryr
Copy link
Contributor

Ah, so scratch the idea of adding it to the frame itself.

I guess I'm just making more noise here... since this is roughly a worse version of #5263 (comment) except the Future/CancelToken is returned by using async someFn() instead of needing to create it and pass a ref.

@andrewrk andrewrk modified the milestones: 0.8.0, 0.9.0 May 19, 2021
@ityonemo
Copy link
Contributor

somehow wound up thinking about this. I like @EleanorNB's suggestion about introducing a cancellation token scheme in stdlib, in part, because, that's what I did, with beam.yield.

@lithdew
Copy link
Contributor

lithdew commented Jul 30, 2021

somehow wound up thinking about this. I like @EleanorNB's suggestion about introducing a cancellation token scheme in stdlib, in part, because, that's what I did, with beam.yield.

Wanted to vouch for this idea of having a cancellation token scheme in the standard library over requiring new syntax and logic in place for canceling async frames.

I've been using cancellation tokens for canceling I/O and arbitrary tasks in my code using a Context (which is essentially a simplified version of Go's context.Context or folly's CancellationToken) and it has 1) made cancellation points and hierarchies clear, 2) made it obvious whether a function in my codebase has the possibility of suspending, and 3) allowed for easier debugging of what functions have been canceled/are bound to be canceled by isolating and keeping track of stack traces/debug information within a single Context.

Here are some links to some code I'm working on which contains and makes heavy use of a Context (cancellation token).

A single-threaded Context (cancellation token) implementation: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/runtime.zig#L129-L163

send(), recv(), read(), write(), accept(), connect(), timeout syscalls that are driven by io_uring which take in a Context and are cancellable: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/runtime.zig#L356-L780

A set of single-threaded synchronization primitives which take in a Context and thus are cancellable: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/sync.zig

A cancellable worker loop function which takes in a Context that sleeps for N milliseconds and then performs some CPU-bound work in an infinite loop: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/main.zig#L464-L482

An async TCP client pool and TCP server with backpressure support which supports cancellation: https://github.com/lithdew/rheia/blob/dde13020d069b6819a5ad8bd0980863009a17195/net.zig

A multi-threaded Context (cancellation token) implementation: https://gist.github.com/lithdew/2802fa5cb398ccca7d77a899a4b4441f

@andrewrk andrewrk modified the milestones: 0.9.0, 0.10.0 Nov 23, 2021
@andrewrk andrewrk modified the milestones: 0.10.0, 0.11.0 Apr 16, 2022
@andrewrk andrewrk modified the milestones: 0.11.0, 0.12.0 Apr 9, 2023
@iacore
Copy link
Contributor

iacore commented May 26, 2023

Isn't it adding user data to @Frame? It could be used for other stuff as well.

callee:

fn foo() {
  suspend {}
  if (@frame().user_data.suspend) {}
}

caller:

var frame = async foo();
frame.user_data.suspend = true;

@andrewrk andrewrk modified the milestones: 0.13.0, 0.12.0 Jul 9, 2023
@TwoClocks
Copy link
Contributor

Counter argument :

I don't think there should be a way to cancel async functions. This shouldn't be a language feature. This is user-space stuff.

Rational:

There are two main mental models for coroutines/CPS/async-await.

One : They are like threads, w/o using OS threads (e.g. cooperative multitasking). You can't cancel a thread from the outside. You shouldn't be able to cancel a async call from the outside.

Two : They are just "hiding" call-backs, and auto-magicaly creating your callback "context" for you. You can only cancel callbacks from the outside.

Since neither "model" has the concept of a generic way to cancel, neither should suspend/resume. Doing the correct cleanup code is so case specific, this shouldn't be a language feature. Maybe sometimes you want to run the waiting code w/ a flag telling it to exit (e.g. most I/O), sometimes the eventloop can just delete the frame and go on it's way (most timer callbacks).

Anecdotally, all of the horrible nastiness in other languages coroutines impl surrounds cancellation and error propagation when canceling. Just don't do it.

Also Anecdotally, I've used coroutines of a number or large projects. The only times I've ever wanted to cancel one is when my code sucked, and I was too lazy to reflector it correctly.

It's also a solved problem. How did you "cancel" I/O when using threads for the past 20 years? Just do that.

@matklad
Copy link
Contributor

matklad commented Apr 24, 2024

Oh, that's brilliant! The reason why cancellation seems necessary is that there are two fundamental concurrent operations. Given two "futures" / concurrent operations a and b, you might want to run then concurrently and

  • join --- wait for both to complete
  • race --- to wait for first one to complete

So:

const a = async update_db();
const b = async update_cache();
await @join(a, b); // Want to update _both_ db and cache
const a = async read_db();
const b = async read_cache();
await @race(a, b); // Wait for _one of_ db and cache, whichever is faster

But race is a special case of join! You can implement race in terms of join if the tasks the finishes join first cancels the other

So, the second example can be re-written roughly as

const ct: CancelationToken = .{};
const a = async read_db(&ct);
const b = async read_cache(&ct);
await @join(a, b)

where both functions:

  • request cancellation once they are done
  • short-circuit internally, if the token is canceled by something else

I bet this scales to fully-general select as well.

@GalaxySnail
Copy link

I don't think there should be a way to cancel async functions. This shouldn't be a language feature. This is user-space stuff.

I believe we do need the ability to cancel async functions. There are many examples, for details: Timeouts and cancellation for humans.

There are two main mental models for coroutines/CPS/async-await.

One : They are like threads, w/o using OS threads (e.g. cooperative multitasking). You can't cancel a thread from the outside. You shouldn't be able to cancel a async call from the outside.

Coroutines aren't like threads. We definitely can "cancel" a process by sending a signal. Threads can't be killed from the outside because they share everything in a process and they are not cooperative. Even though, we can cancel threads if we make them "cooperative" somehow. (e.g. There is a main loop in each thread which checks cancellation requests and handles them.) The detail can be wrapped by languages or libraries so that it looks like we are "cancelling" threads. There's no technical reason that we can't cancel coroutines.

Since neither "model" has the concept of a generic way to cancel, neither should suspend/resume. Doing the correct cleanup code is so case specific, this shouldn't be a language feature. Maybe sometimes you want to run the waiting code w/ a flag telling it to exit (e.g. most I/O), sometimes the eventloop can just delete the frame and go on it's way (most timer callbacks).

That's not too difficult, a cancellation is just like a specific error. If the cleanup code works on some regular errors, it works on cancellations.

Anecdotally, all of the horrible nastiness in other languages coroutines impl surrounds cancellation and error propagation when canceling. Just don't do it.

That's because there are few languages/libraries designed with structured concurrency, another reference: Notes on structured concurrency.

Oh, that's brilliant! The reason why cancellation seems necessary is that there are two fundamental concurrent operations. Given two "futures" / concurrent operations a and b, you might want to run then concurrently and

  • join --- wait for both to complete
  • race --- to wait for first one to complete

But race is a special case of join! You can implement race in terms of join if the tasks the finishes join first cancels the other

That's right! We call them task groups in structured concurrency, it's kind of primitive for concurrency (except that if you can't pass a task group as an argument, it's not easy to spawn background tasks when needed).

@kprotty
Copy link
Member

kprotty commented Apr 24, 2024

I believe we do need the ability to cancel async functions. [...] There's no technical reason that we can't cancel coroutines.

The issue is that not all async functions are cancellable. Certain operations are atomic to the caller (or stateful) but still use asynchronous operations. This is the idea of cancellation safety.

For threads, you can send a signal to either request a cancellation (i.e. SIGTERM) or force it regardless of the thread's decision (i.e. SIGKILL). Regarding semantics, the latter is most likely undesirable as you can't recover (or in zig speak, "run defers"). Cancellation requests should be the solution then IMO.

But since it's only a request, the thread has the opportunity to ignore it for various reasons (i.e. it's not cancel-safe). This means you must wait for the thread to complete regardless before you relinquish its resources. If not, you risk leaks (unstructured concurrency) or UAFs (structured concurrency).

Cancellation Tokens are a great solution here as they're 1) opt-in for tasks which are cancel-safe and 2) require joining the task anyways to account for those that aren't cancel-safe. That they're shared between tasks in @matklad's proposed API is a composability nicety (each task could as well just have their own Token and a separate construct shared between tasks could cancel each Token separately).

@matklad
Copy link
Contributor

matklad commented Apr 26, 2024

The issue is that not all async functions are cancellable

Here are two specific, simple examples which are useful as an intuition pump and a litmus test for any cancellation mechanism.

Example 1: an asynchronous tasks submits a read request to io_uring and then gets cancel. To actually cancel the task, what is needed is submitting another cancel request to io_uring (so, another syscall) and then waiting for it to complete. If you don't do this, then the read might still be executing in the kernel while your task is already "canceled", effectively writing to some now-deallocated memory

Example 2: without anything exotic, an async tasks offloads some CPU-heavy task (like computing a checksum) to a thread pool. To cancel this task, we also must cancel the thread-pool job, but that doesn't have cancellation built-in, as it is deeply in some simd loop. So the async task just have to wait until until the CPU side finishes. If you cancel only the async task, and let the CPU part run its course, you are violating structured cocnurrency (and potentially memory safety, if the CPU part uses any resources owned by the async part)


That is, cancellation is only superficially similar to error handling: error handling is unilateral and synchronous. General cancellation is an asynchronous communication protocol: first, you request cancellation, then you wait for the job to actually get canceled.

A more useful framing is that cancellation is serendipitous success

@Cloudef
Copy link
Contributor

Cloudef commented Jul 2, 2024

In zig-aio I do cancelation by making async io functions and yield in coroutines return error.Canceled. This still won't prevent person from catching the error and while looping the coroutine endlessly, but it works pretty okay in practice. For blocking tasks through a thread pool, programmer has to opt-in to cancelation by taking a CancellationToken as first argument and actively cancel the task if the token is marked canceled, otherwise the coroutine (and thus the caller who wishes to collect the result) has to wait until the blocking task is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal This issue suggests modifications. If it also has the "accepted" label then it is planned.
Projects
None yet
Development

No branches or pull requests