Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a strategy for handling (async) generator cleanup #265

Closed
njsmith opened this issue Aug 4, 2017 · 18 comments · Fixed by #1564
Closed

Implement a strategy for handling (async) generator cleanup #265

njsmith opened this issue Aug 4, 2017 · 18 comments · Fixed by #1564

Comments

@njsmith
Copy link
Member

njsmith commented Aug 4, 2017

There's a problem with generators and async generators: they don't necessarily iterate to the end. Sometimes they get suspended in the middle, and then garbage collected. When this happens, their __del__ method throws in a GeneratorExit exception, so that finally blocks and __exit__ methods get a chance to run. BUT, __del__ is very weird: it can be run in the middle of any arbitrary python bytecode instruction. It's like a signal handler but even more so.

So for example, this code:

def generator_fn():
    try:
        yield
    finally:
        print(trio.current_task())

async def caller():
    for _ in generator_fn():
        break

Here the generator's finally block gets run at some totally arbitrary time, so it could in principle print any task, or crash because no task is currently active, or because it doesn't even run in the trio thread. And similar problems arise for pretty much any access to trio APIs. (Note that even without trio, the same thing happens for threading.current_thread(), accessing thread-local storage, etc. But it trio the use of this kind of implicit thread-bound state is much more common, since the run loop itself is accessed via thread-local storage.)

PEP 533 would be the real fix, but that seems to be stalled for now.

For regular generators, we just need to document this; I can't see what else we can do. There are some related issues in #264 that also need documentation; probably they should go in the same section of the manual.

For async generators, this problem is both better and worse. The thing about async generators is that the gc can't actually clean them up directly: it would like to do the same thing as it does for regular generators and have __del__ throw in GeneratorExit, but this doesn't really work, because for an async generator you should use athrow and let the coroutine runner iterate it, but __del__ doesn't have access to a coroutine runner. So PEP 525 defines some hooks that trio can set, to get (a) notification that an async generator has been started, and (b) notification when an async generator is ready for garbage collection.

The intention was that these hooks are used to let the event loop take care of doing gen.athrow(GeneratorExit) to emulate how regular generators work. But as we saw above, regular generators are already broken so... we don't want to emulate them. Being able to re-enter trio's run loop doesn't help, because we need a task context. And this is even more true for async generators, since the whole point of an async generator is that it can do async-y things, and if you want to await something you need to make sure you have a timeout set, which means you really really really need to make sure that this happens in the original task context where the user was expecting it to happen. Letting an await sneak outside of its with fail_after is definitely not OK.

So we aren't going to use these hooks in the way they were originally intended. But, until PEP 533 becomes a thing, maybe we can at least use these hooks for harm reduction.

Some options:

  • Don't define the hooks at all. In this case I'm not 100% sure what Python does, but I think it does something like: throw in GeneratorExit and iterate once, and if it yields again (e.g. because there's an async with or a finally: await ...) then it gives up and prints an error message. This is basically like synchronous generators, but maybe a little less friendly. It's not awful, but maybe we can do better.

  • Use the firstiter hook to forbid the use of async generators, unless we have some explicit evidence that someone is taking responsibility for cleaning them up. For example, we could make it so that our @acontextmanager marks its async generators as being cleaned up (though 3.7's contextlib.asynccontextmanager is going to be a problem...), and provide an aclosing context manager that does the same. This approach was pioneered by Curio, which goes a step further and introspects the generator's bytecode, and if it doesn't see any dangerous operations (yield or yield from inside an except, finally, with, or async with), then it allows those async generators as well. I really don't want to have this kind of bytecode introspection in trio. OTOH without it then async comprehensions like (2 * i async for i in aiter) become essentially unusable.

  • Use the finalizer hook to catch when an async generator is cleaned up, and run it in a context that (a) is carefully set up so that any attempt to access trio state generates a clear error message, and (b) any attempt to yield causes a clear error / crashes the program (as opposed to just being printed to stderr, like regular __del__ does). (If doing this we'd probably also want a firstiter hook like asyncio's, so we can force all async generators to be finalized when the run loop exits. This is just a minor refinement, though.)

The good thing about the finalizer approach is that it has no false negatives: all async generators that can be run safely are allowed to run. The downside is that it has false positives: if an async generator is written poorly, then it's possible that it will be allowed to run and everything will look fine in tests, but then in production it blows up (e.g. because in tests it's always iterated to the end, and in production someone breaks out of a loop so the generator gets collected while only half-finished). The firstiter approach is the opposite: it only allows generators to run if it can be sure that they're safe in all circumstances.

(If PEP 533 is too much for 3.7, then maybe a smaller ask would be set_gen_hooks() that work the same way but for synchronous generators?)

@oremanj
Copy link
Member

oremanj commented Mar 5, 2018

Here's an idea: what if we scope async generators to the task that first iterated them? That is:

  • each task contains an _asyncgens attribute containing a weakref list (can be initially None to avoid having a bunch of empty sets for the tasks that don't use asyncgens)
  • we install a firstiter hook that adds a weakref to the new async generator to the _asyncgens list on the current task
  • when a child task of nursery exits, we start a task in the nursery that does the equivalent of await ag.aclose() for each of its still-alive asyncgens, in reverse order of firstiter

There's some subtlety here as far as treating the nested child sorta-task like a task for these purposes, and scoping asyncgens created there to the direct-parent nursery rather than to the lifetime of the overarching task that opened that nursery, but I think that's mostly details.

This is only a partial solution; it still doesn't deal correctly with asyncgens that create nurseries/cancel scopes.

async def agen():
    with trio.move_on_after(6):
        for i in range(100):
            yield i
            await trio.sleep(i * 0.1)

async def caller():
    ag = agen
    last = None
    while True:
        with trio.move_on_after(0.8) as scope:
            last = await ag.anext()
        await trio.sleep(1)
        if scope.cancelled_caught:
            break

From reading the discussion in #264, it seems like the desired behavior here would be to limit each turn of agen() to 0.8 seconds, and the whole thing (including time spent by caller() when agen() is suspended) to 6 seconds. That gives you a shorter-lived cancel scope (in caller()) more deeply in the cancel scope stack than a longer-lived one (in agen()). I guess context variables are supposed to solve this? I feel vaguely like it should be solvable without them, but can't currently think of how. I guess cancel scopes could detect that they were created inside an asyncgen (via... stack introspection? ugh) and mark themselves accordingly, and cancel scope __exit__ could know it might need to reorder such scopes and delivery of cancels would skip them if the asyncgen is suspended due to a yield. This gets really complicated though. Thoughts?

@oremanj
Copy link
Member

oremanj commented Mar 5, 2018

Thinking about this more:

  • The fundamental problem for tracking of cancel scopes and other contexts within an asyncgen is knowing whether it's currently suspended by a yield (to its async caller) or by a trap (to the event loop) yielded by some other async function it awaited. An asyncgen can't directly yield to the event loop, because "yield" in the body of an async function syntactically means yield to asyncgen caller. So any event loop trap must be behind an await.
  • We know when asyncgens are created and destroyed, if we install the hooks appropriately. We can scope the lifetime of an asyncgen to the lifetime of the task that created it, as explained above. So, whenever a task traps to the event loop, we know what asyncgens might be running at the time of the trap.
  • We can iterate over each of these asyncgens and check its ag_await member. If this is None, the asyncgen is suspended by a yield to its caller and we shouldn't consider its contexts to be in effect at the time of the trap. If it's not None, the asyncgen is suspended by awaiting someone else and we should consider its contexts to be in effect at the time of the trap.
  • The missing piece is figuring out which contexts (pertaining to some task-local state like the cancel scope stack) "belong" to the asyncgen and which "belong" to the code that's calling it. I'll think about this some more.

@oremanj
Copy link
Member

oremanj commented Mar 5, 2018

Expanding on that last bullet point: we want to distinguish

async def agen():
    with trio.move_on_after(1):
        yield 42
async def caller():
    async for val in agen:
        await trio.sleep(val)

from

async def agen():
    yield 42
async def caller():
    with trio.move_on_after(1):
        async for val in agen:
            await trio.sleep(val)

(the second should interrupt the sleep while the first should not). Luckily, the firstiter hook is called when the coroutine returned by asend() is constructed, before the send(None) is performed and thus before any of the async generator code is executed. So if we add asyncgens to Task._asyncgens in firstiter, we can have CancelScope.__enter__ (and any other relevantly context-y things) check each asyncgen's ag_running (works despite bpo-32526 since this is a sync call) and do whatever is necessary to remember that this cancel scope belongs to this asyncgen. I just checked and ag_running appears to be True from the perspective of an async function awaited from the generator; it just flips back to False when the call stack actually traps to the event loop. So I don't think the async-ness of open_nursery() will get in the way either.

I'm sure the devil will be in the details, but can you see any missing pieces in the overall idea?

None of this helps with regular non-async generators that create a cancel scope. But since there's not much point to making a cancel scope that you can't put any awaits in, I don't think there's any use for such a generator besides decorating it with @contextmanager, and we've established we like the existing behavior in that case.

@oremanj
Copy link
Member

oremanj commented Mar 5, 2018

Even more wrinkles: asyncgens can iterate over other asyncgens! I don't have time to think through the implications of this right now; mostly leaving a note here for my future self. I can't immediately come up with any reasons this would invalidate the approach I'm proposing, but it probably does at least mean we need to be careful about the order in which we apply the asyncgen cancel scopes to the cancel scope stack.

Also, ag_await is None is not enough to know an asyncgen is suspended at a yield -- it could just be exhausted. The condition we want for is-currently-suspended-via-yield-to-caller is ag_await is None and ag_frame is not None.

@njsmith
Copy link
Member Author

njsmith commented Apr 3, 2018

Random technical notes

I was definitely going to ask about bpo-32526; that's a pretty obscure issue, so nice catch :-).

It's also legal to create an async generator in one task and then iterate it in another task. Actually I've even used this in an example... in the live-coded happy eyeballs, I cheat a little bit, because I claim that it's doing the same thing as the twisted happy eyeballs code, but twisted's version actually supports incremental getaddrinfo. (Which isn't supported by any real OS, but never mind.) But it's easy to adapt the example to handle the full twisted semantics: suppose we have an incremental getaddrinfo that returns an async iterator:

async def fake_incremental_getaddrinfo(*args, **kwargs):
    for target in await getaddrinfo(*args, **kwargs):
        yield target

Then in the happy eyeballs code, each child task instead of doing target = targets[i] instead does:

try:
    target = await targets_aiter.__anext__()
except StopAsyncIteration:
    return

You also need to do a bit of fiddling with the failed_attempts variable because you don't know ahead of time how many attempts there will be, but never mind -- the point is that if anyone ever does ask about the twisted semantics, then it's also quite easy to do and only adds a few lines. But... it's also a realistic case where iterating a generator in different tasks would be useful.

I suppose one could keep a global list of all living async generators, and scan them whenever a cancel scope is created or a task traps, but this would be way too expensive. (It might even be too expensive if we could restrict to a single task's generators... these are pretty performance-sensitive operations.)

Another potential way to map cancel scopes to async generators would be to use the firstiter hook to keep a global set of living async generator frame objects (!), and then when a cancel scope is created, check whether any of them are on the stack. On CPython at least this would be pretty quick (because now it's O(stack depth), not O(total number of generators in the program)). On PyPy it might be OK, I'm not sure – forcing the JIT to materialize frame objects isn't great, but if we're not looking at the contents of the objects, and it's only for async function frames in practice, then maybe it's not too bad?

But let's step back a bit

The bigger question is what semantics we're even going for here. I think in this issue we should try to focus on just the GC issues, since they're simpler. (I added a link back to here from #264 though so the comments don't get forgotten.)

If you want to work on this, then probably the first thing that would be useful would be to figure out exactly what Python is doing right now, with no hooks defined :-). I wrote down a guess in the first post, but I haven't checked the details (which may require reading CPython's source and stuff).

There's also a simple thing we could do that might be better than what we do now, not sure: guarantee that all async generators will get await trio.aclose_forcefully(agen) called on them.

@njsmith
Copy link
Member Author

njsmith commented Apr 3, 2018

Oh, I should also note that if we do decide to start using the async gen hooks, then as part of that we'll probably want to implement support for those hooks in async_generator. There's some discussion in python-trio/async_generator#13

@oremanj
Copy link
Member

oremanj commented Apr 5, 2018

Thanks for the happy-eyeballs example; that does seem like something we shouldn't gratuitously forbid.

python-trio/async_generator#14 has the GC hooks support, and AFAICT behaves identically to native async generators whether or not a finalizer hook is installed (and your hypothesis in the first post on this thread matches the results of my investigation).

GC: I like the idea of doing aclose_forcefully() to finalize a generator, if no one put an aclosing() block around it to make sure the potentially-blocking operations in its cleanup don't leak out of the cancel scope intended for them. As far as where to do this (for purposes of exception propagation), how do you feel about "the nearest enclosing task that's still around"? That is, we'd install a firstiter hook that adds a weakref to the async generator object to a list in the current Task, as above, but instead of closing async generators when their task completes, we'd just roll them up to be async generators of the parent task. And the finalizer hook would make it so aclose_forcefully() gets called in the context of the task in whose asyncgens list this generator currently lives (at the time of its GC, whatever that might be).

Cancel scopes: Unfortunately, you can't weakly reference a frame object. I guess we could still keep a set of id(frame) values, and remove from the set when the async generator is destroyed. With the work on async_generator showing how fungible yield and await yield_() are, I came up with another idea: what if we provided an API like

@trio.scoped_async_generator
async def agen(count, *, yield_):
    with trio.move_on_after(6):
        for i in range(count):
            await yield_(i)
            await trio.sleep(i * 0.1)

which might be implemented something like

@attr.s
class ScopedYielder:
    scope = attr.ib()
    def __call__(self, value=None):
        with scope.suspend():
            return async_generator.yield_(value)
    def from_(self, delegate):
        with scope.suspend():
            return async_generator.yield_from_(delegate)

def scoped_async_generator(afn):
    @async_generator.async_generator
    async def wrapper(*args, **kw):
        with trio.open_cancel_scope() as scope:
            await afn(*args, **kw, yield_=ScopedYielder(scope))
    return wrapper

And then we just have to implement CancelScope.suspend().

@oremanj
Copy link
Member

oremanj commented Apr 5, 2018

Er, nix the yield_from_ bit; clearly we do want to keep our cancel scopes active until the delegate actually yields something. But we could make async_generator.yield_from_ take an optional yield_ parameter, and say e.g.

    def from_(self, delegate):
        return async_generator.yield_from_(delegate, yield_=self)

Another idea: still assuming we make async_generator.yield_from_ support a user-specified yield_ function, we could move the cancel scope "stack switching" to the side doing the iteration rather than the side doing the yielding, and then people can write native-syntax async generators if they don't care about 3.5 compatibility:

def scoped_async_generator(agen=None, **agen_options):
    if agen is None:
        return partial(scoped_async_generator, **agen_options)
    if not async_generator.isasyncgenfunction(agen):
        agen = async_generator.async_generator(agen, **agen_options)
    @async_generator.async_generator
    async def wrapper(*a, **kw):
        with trio.open_cancel_scope() as scope:
            async def yield_(value=None):
                with scope.suspend():
                    return await async_generator.yield_(value)
            await async_generator.yield_from_(agen(*a, **kw), yield_=yield_)
    return wrapper

@thehesiod
Copy link

so interestingly in aiohttp we were getting a very strange issue where a task was abandoned, followed by a GeneratorExit and a finally getting entered twice from an impossible callstack. I have a feeling it's related to what's being discussed in this thread. Following the discussion from the first comment I think I have a full test case that demonstrates the issue discussed: https://gist.github.com/thehesiod/c2189c41dabdde133a980c62cb33da40

Note how the generator except/finally doesn't have the same task as when it was entered, and further that it's a completely different task that I did not create.

@njsmith
Copy link
Member Author

njsmith commented Aug 1, 2018

@thehesiod yes, that gist is showing the asyncgen gc hooks working: the generator's __del__ method basically does a loop.call_soon to re-enter the loop, and then spawns a new task to call await agen.aclose(). See PEP 525 for the gory details.

@njsmith njsmith removed the todo soon label Sep 6, 2018
@devxpy
Copy link

devxpy commented Nov 17, 2018

@njsmith

This is unrelated, but do we have a workaround for PEP 533 currently?

Specifically, the open() and nested for-loop issue you have illustrated.

The proposal sounds great BTW -- how it should've been from the beginning. Thank you!

@njsmith
Copy link
Member Author

njsmith commented Nov 24, 2018

@devxpy We don't have a general workaround, and I don't think there's any prospect of 533 making it into 3.8. But I think the discussion in #638 may be getting closer to an acceptable workaround – in particular starting here: #638 (comment)

The idea is instead of having a generator that you start and stop and then have to figure out how to handle it being abandoned in the "stop" state, you run the "generator" as a concurrent task that sends values to the loop.

@njsmith
Copy link
Member Author

njsmith commented Feb 9, 2019

With another year+ of experience, I think we can see better what's going on here. There are two problems: (1) yield inside of a cancel scope/nursery is just... not really workable. See #264, #638 for details. The best hope right now is that we might be able to disallow them entirely. (2) For garden-variety async generators, we currently don't clean them up properly at all. There isn't any prospect on the horizon for handling them better than regular generators. But we could at least handle them as well as regular generators. This would mean:

  • Use the async generator hooks to capture when an async generator is GCed
  • Spawn a system task to call aclose_forcefully on the async generator
  • If this raises an exception, dump it to stderr (or use logging to dump it to stderr?), and discard it

The whole "discard the exception" thing makes me wince, but that's how Python handles exceptions in __del__ in general, so this would just be making it so async generator's __del__ is no worse than every other object's __del__.

Implementing this is straightforward, though there are some fiddly details that have to be taken care of. In particular: with the async generator hooks you have to handle both the case where the objects are GCed while the run loop is running, and the case where they outlive the run loop (in which case you forcibly "GC" them when the run loop exits). And you have to be prepared for the hooks to be called from any thread, so you need to use a TrioToken to re-enter Trio before doing anything. (But keeping in mind that the Trio loop may be gone already...)

@njsmith
Copy link
Member Author

njsmith commented Feb 9, 2019

Some discussion of this starting here: https://gitter.im/python-trio/general?at=5c5e2cb5d1e3093ab5eb34ad

@njsmith
Copy link
Member Author

njsmith commented Aug 5, 2019

Example of a user getting very confused by async generator/async with/trio/gc interactions: https://gitter.im/python-trio/general?at=5d475c7be939ab2447f895aa

The generator code was pretty simple:

async def read_next_line_from_file_async(self, file_path):
    async with await trio.open(file_path) as f:
        async for line in f:
            yield line

When it got garbage collected, the GC injected GeneratorExit, which caused the file's __aexit__ to run, which then blew up because it was running outside of Trio context, and then the blowup caused a RuntimeError, and then the GC dumped that on the console, and this was all happening in the context of a Trio cancellation, and the user didn't have the secret decoder ring to tell them that Exception ignored in is actually a crucial clue that the garbage collector is involved, so they didn't mention this in the initial report...

If Trio had used the async gen GC hooks to move the cleanup into a pre-cancelled system task, then that would have solved their issue completely.

@gbgduh
Copy link

gbgduh commented Feb 1, 2020

I wonder if it can be of any help - https://bugs.python.org/issue35409 (Async generator might re-throw GeneratorExit on aclose()) and I came across this in https://github.com/vxgmichel/aiostream/blob/master/aiostream/aiter_utils.py#L179:L185.

@oremanj
Copy link
Member

oremanj commented May 23, 2020

I'm working on implementing this. My current plan works like you've suggested by default, but also allows users to create "walled gardens" in which async generator cleanup behaves in a more structured-concurrency-ish way:

async with trio.manage_asyncgens():
    # All async generators first iterated in here (including child tasks)
    # will be cleaned up before exiting the manage_asyncgens block,
    # and any exceptions they raise during cleanup will propagate out of it;
    # the aclose() tasks don't get cancelled until the code inside the
    # block is done

async with trio.manage_asyncgens(cleanup_timeout=5):
    # These get 5 seconds after the async with body is done before
    # cancelling their aclose() tasks

async with trio.manage_asyncgens(ambient=False) as manager:
    # This one gets the global cleanup treatment:
    async for x in some_agen():
        ...

    # But this one is bound to the local scope
    async for x in manager.manage(some_agen()):
        ...

I think this offers a nice middle ground between totally acausal/"afterthought" cleanup on the one hand, and strictly writing async with aclosing every time you create an asyncgen on the other. Thoughts?

@njsmith
Copy link
Member Author

njsmith commented May 23, 2020

I dunno. The basic cleanup feature is obviously necessary. The "walled garden" idea is interesting, but seems way more tentative/experimental. And that's a lot of different APIs you're suggesting there :-). (I'm particularly dubious of the timeout stuff... there's a lot of room for obscure issues if folks start expecting that objects can keep running after being GC'ed.) So I think I'd like to see the basic feature land first and then think about extensions (possibly in a new issue).

(NB it should also be possible to experiment with cleanup managers in tricycle; just overwrite trio's default asyncgen hooks with extended versions.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants