-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-42130: Fix for explicit suppressing of cancellations in wait_for() #28149
Conversation
That bot probably shouldn't add the awaiting review tag when it's a draft PR... |
This PR is stale because it has been open for 30 days with no activity. |
@asvetlov Updated this PR, so should be ready to merge now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general, but please address small improvements if you agree with them.
task = loop.create_task(asyncio.wait_for(fut, timeout=1)) | ||
loop.call_later(0.1, task.cancel) | ||
res = loop.run_until_complete(task) | ||
self.assertEqual(res, "ok") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting this test back in, with a comment. It does manage to trigger the race condition aaliddel mentioned.
fut = self.new_future(loop) | ||
loop.call_later(0.1, fut.set_result, "ok") | ||
res = loop.run_until_complete(asyncio.wait_for(inner(), timeout=0.1)) | ||
self.assertEqual(res, "ok") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted my last test to use new_test_loop()
, which allows to reproduce the issue without suppressing cancellation.
@@ -439,7 +439,7 @@ async def wait_for(fut, timeout): | |||
# after wait_for() returns. | |||
# See https://bugs.python.org/issue32751 | |||
await _cancel_and_wait(fut, loop=loop) | |||
raise | |||
return fut.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that a cancellation suppressed in the inner future, also suppresses it in wait_for()
as one would expect (see test_wait_for_suppress_cancellation).
The changes hardly improve the behaviour of I've also tested the current version from this PR and it ignores the cancellation in edge cases. At this point I'm not sure what the goal is here, but just tweaking the existing implementation seems to be a fruitless endeavour. |
See my comments above, which explain the change to the PR and a proposal for how the race condition can actually be fixed: #28149 (comment) I believe you said that wait-for2 still does not fix the problem, so I don't see how that's going to help either. To actually fix these race conditions requires changes outside of the wait_for() function. |
The idea to fix the order in which different coroutines are scheduled is interesting. Obsolete/incorrect pointsAlthough I can easily come up with an example that would not be solved by it:
Here the
To make a scheduling based solution work, the return-state of a coroutine would need to be protected, so until a return or raise is in progress, a cancellation will block or fail. And that is still not enough, because as the second snippet demonstrates, a return can be aborted, which means a task may fail cancellation because it is
I provides a way to actually resolve the race-condition, so for now we can write well behaving applications.
Yes, when a waiting construct is cancelled explicitly, that cancellation is produced in a (usually) different execution context than what is being waited for a result. Unless you can absolutely avoid these two execution contexts (coroutine/task/thread by implementation) to reach the point of cancellation/result-production simultaneously, you can't fix the wait-for with any other changes outside of it. I'd argue (though I could be wrong) that such is not possible since by definition the two execution contexts are executed in parallel, independently. So to reach an ideal wait-for behaviour I agree, we need more changes outside of it, but I doubt that they are possible to do. This problem appears to be deceptively easy by a glance, but it is very complex to get right. I've made an alternative wait-for implementation that does not adhere to the infeasible design goals of the original as it stands right now. That is why it can work better. At this point I've already pointed out that ignoring the cancellation or the result is the problem and we're not any closer to fixing that. This PR in its current state is more likely to ignore cancellations than before. I can see it without any testing, there are two returns in the explicit cancellation handling, while none should be there.
If anyone has doubts that the current wait-for is much more broken that it appears to be, see the new behaviour comparison tables I've put into the readme. They are compiled from the tests in my repository: https://github.com/Traktormaster/wait-for2 I'm not sure what else to say, I'd like to use wait-for from the builtin library, but as it stands now, it is unusably broken, and this PR does not change that. |
Isn't that a race condition in the user's code, not a problem with
At which point it is obvious that the task can be cancelled after creating the resource, but before returning it. i.e. The same issue would still be present if not using
Why? I don't see that. The 2nd |
I don't think these are representative examples of the race condition, rather they are badly formed coroutines that don't take into account cancellation being possible at any point of a coroutine's execution until it completes. A fixed The race condition issue is cancellation happening after the coroutine completes but before the
@Dreamsorcerer: Is this solvable by amending the event loop interface to allow saying "when coroutine A completes (the wrapped task), switch straight to this other coroutine B (the return to wait_for) rather than moving to any ready task"? This prevents explicitly having to scan a dependency graph and would prevent execution of another coroutine that could inject the external cancellation. e.g. a very dumbed down event loop pseudocode: ready_tasks = []
while True:
# The usual asyncio behaviour
task = ready_tasks.pop(0)
task.run()
# Run down dependency chain until we hit a task with next_task == None
# In this case, next_task would be the return to wait_for
while task.next_task:
task = task.next_task
task.run()
# Cleanup tasks list needed here |
I concede that the example was faulty for the scheduling proposal. I've already started to amend it while your comments arrived. I guess the question would be if that would not burn too much resources for handling rare edge cases, sounds promising otherwise.
The problem there is, that the inner task may have finished and not even got the cancellation. It is different if the inner task receives the cancellation and ignores it, or it finishes, despite the explicit cancellation from the outer code. This was the original issue why I made
In the example if the |
Hmm, I'm not sure. Maybe I'm misunderstanding, but I don't think that works, because you're saying you want to run So, the situation we are at, is that So, we could declare that coro() has a dependency on wait_for(), then when the loop picks up coro(), it could check Alternatively, we could declare the relationship in reverse, from wait_for() to coro(). Then when we cancel wait_for() and it is getting scheduled into |
This example explicitly hangs on the first iteration for me every time:
I'd argue it must never hang by ignoring the cancellation, as there is no code here that would indicate such is possible. Update: This demonstration seemed to be very stable and simple so I integrated it to |
I'll try to rewrite it as an actual test and add it to this PR. Probably, we'll just have to look at fixing the race condition before merging any other changes. |
Would the cancellation scheduling fix this case? Would this not certainly produce a race condition? (when an explicit CancellationError also has a result)
Edit: I was looking through some codes when I realized this is a prominent use-case. I have several of these (where a |
Hmm, using a |
I'm a bit late to the party, but I think there's an underlying problem that For example, the following code doesn't use Show code
The output from running this is:
See in particular that there's a difference between The "leak" happens in tasks.py:236, where Another problem that a coroutine doesn't really have any way of knowing if it's being called within a Task or not, and so if it gets a
This is related to the problem in
I'm not sure I have a good solution to the problem. Even though there's a bit more control with coroutines, it feels very similar to the problem with terminating threads, in that it's hard to make sure you do it when it's safe. One idea I tried was this, which effectively makes it so only a Future can raise CancellationError when you await on it - the Task code won't spuriously generate new CancellationError exceptions in places where a value might get leaked. It does however also make it so that once you cancel() a Task, it will keep cancelling any Future's that you await on until the Task completes (or lets itself be cancelled). It might be possible to add some mechanism like |
Just making a note to myself for another test based on that. We could have Again, if we can create some kind of dependency that ensures that |
I made the inner task cancel the outer one just to ensure the race condition happened, but it's not limited to happenning in that case. It could easily have been done from another Task that ran after
Now if you had Furthermore, I don't think there's anything in the documentation saying that a task is prevented from cancelling itself or something awaiting on it. In fact, that code in tasks.py only makes sense if it's possible for a task to become cancelled while it's running. If |
Yep, I understood, just clarifying that my proposal wouldn't fix it, but that specific case probably doesn't matter, as it doesn't make much sense to cancel the current task right before returning.
Maybe. I was thinking of making it an explicit dependency, maybe something like i.e. The code would run in the same order with both of these examples:
whereas this could run in a different order (another task might be run after
So, the objective is just to be able to run a task as if it were just a nested coro. This approach does add a burden to the user to recognise such risks and use this parameter, but I've not come up with any better ideas yet. |
I have just recently found out that one of my libraries had always been affected by this bug. Here's a script that reliably reproduces the bug on Python 3.9 - 3.11, without relying on timing coincidence:
Actual output:
|
Superseded by #96764 |
And fixed in 3.12 by gh-96764. (We're not 100.00% sure that the fix doesn't disturb some workaround, so we're not backporting the fix to 3.11 even though that has |
Fix
wait_for()
cancelling when the inner task suppresses a cancellation.