Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink KeyboardInterrupt protection APIs #733

Open
njsmith opened this issue Oct 13, 2018 · 12 comments
Open

Rethink KeyboardInterrupt protection APIs #733

njsmith opened this issue Oct 13, 2018 · 12 comments

Comments

@njsmith
Copy link
Member

njsmith commented Oct 13, 2018

Sometime before 1.0, we should go through the KeyboardInterrupt protection APIs and think them through again.

There seem to be a limited number of patterns: decorating functions/methods that should always have KI protection enabled; setting the default protection in a new task (can be overridden if the function has a preference); temporarily disabling and then reverting to what it was before (especially, reverting inside a call to outcome.capture or acapture). Some of these are kind of awkward with the current API, or impossible (in some places I think we do disable/enable, rather than disable/revert-to-previous-state).

It would be nice to fix #550. This might be doable on 3.7+ by exploiting the atomicity of Context.run?

We'll also want to look again at the issues around atomicity when exiting context managers, I guess.

Should tasks inherit their default KI protection state from the parent nursery?

@oremanj
Copy link
Member

oremanj commented May 10, 2020

I've been looking at this some yesterday and today. I have a few possible approaches that I'd like feedback on.

The evolutionary one

I'd like to see us move away from the magic locals key if we can. It adds some runtime overhead at most protection status changes, adds extra stack frames in tracebacks, makes @enable_ki_protected-ified async functions return false from inspect.iscoroutinefunction(), and is noisy when inspecting locals in a debugger. Instead, I think we should use some sort of static mechanism (based on what code you're executing right now), plus a dynamic/policy-level mechanism (so the same function/context can allow KI in some cases and not in others). KI is only delivered immediately if both mechanisms say it's fine.

The dynamic mechanism can just be a contextvar as long as the static mechanism protects all the contextvar machinery. We'll use this to provide "system tasks start out KI-protected", and provide context managers for toggling it back and forth. Then roundtripping through a thread (the only place in Trio where we explicitly disable KI protection) will automatically preserve KI status once it preserves other contextvars (#178).

I've prototyped this with the static mechanism being a dictionary from code objects to protection state changes. The KI protection decorators just add the decorated function's code object to the dictionary. This works pretty well; it's a little more expensive at import time but effectively free when going from unprotected to protected or vice versa at runtime. It also lets us KI-protect code we don't control, such as the plumbing of contextvars.Context.run, outcome.acapture, contextlib._GeneratorContextDecorator, and so on.

The revolutionary one

The only reason to ever not be KI-protected is so that you can interrupt an infinite loop that's not executing any checkpoints. If we can come up with a better solution for that, can we ditch the KI-protection state changing machinery entirely, and always be "KI protected" (ie KI delivered only at checkpoints).

The current KI protection code does a great job at letting you make protection granular per function. But in my experience that's somewhat too granular for typical use. In practice all of my Trio apps use restrict_keyboard_interrupt_to_checkpoints=True and my libraries use @enable_ki_protection on every public-facing function (assuming I remember it!), because thinking about KI being potentially raised at every point is way too much mental overhead. If our goal is to make it easy for our users to get things right, it doesn't seem well-advised to give them, by default, a bomb that can go off at any individual bytecode position anywhere in their code. We've taken careful precautions that it won't mess up Trio's state, but it's just as likely to mess up theirs.

The easiest "better solution for infinite loops" that I can think of is an easy user-accessible "dump task tree" feature. If we think that having Ctrl+C not work in that case is too surprising, it's probably possible to have a SIGINT handler that spawns a watchdog thread using the principles discussed in #591, but the details seem likely to be super finicky. But it's possible to wind up with Ctrl+C not working on current Trio regardless (using shielding or an improperly written abort_func), for reasons that have nothing to do with KI protection or infinite loops, and it would be nice to have one recipe that works for all such cases. Maybe we distinguish between an initial and a subsequent Ctrl+C by default.

A medium-aggressiveness compromise

What if we tracked KI protection per package? If the innermost frame is in Trio, a Trio dependency that we trust not to introduce infinite loops (outcome, async_generator, etc), or a library written against Trio that has opted in to KI protection, then defer the KI. If the innermost frame is in another non-stdlib package, deliver the KI immediately. If it's in the stdlib or in a Trio dependency that might be implicated in a non-Trio infinite loop (such as sortedcontainers), walk outward until you get to a package that fits one of the other cases, then do what it says.

We could combine this with the existing locals-key-based per-function KI protection to allow granularity less than package-level where needed. But the package-level protection would catch almost all cases, so the locals key would see actual use only rarely, which minimizes most of the above-described drawbacks of using it. (The only use we'd need that I can think of is the one at the top of each task.)

@njsmith
Copy link
Member Author

njsmith commented May 12, 2020

Hmm. So I guess there are three somewhat-but-not-entirely distinct topics here:

  • What mechanisms should we use to store and track KI protection state?
  • What API should we use to manage that state?
  • What program-wide policies should we expose to users?

Starting with the last one first:

In practice all of my Trio apps use restrict_keyboard_interrupt_to_checkpoints=True [...] If our goal is to make it easy for our users to get things right, it doesn't seem well-advised to give them, by default, a bomb that can go off at any individual bytecode position anywhere in their code. We've taken careful precautions that it won't mess up Trio's state, but it's just as likely to mess up theirs.

I hear you. Python's standard KeyboardInterrupt system makes me itch too, because it's just not possible to know whether you have all the edge cases correct.

But... there's a kind of "when in Rome" principle for API designs. If we were designing our own language, maybe we'd do it differently, but if you're working within an existing platform, then a suboptimal-but-consistent design is often better overall than a design that's theoretically more optimal but requires a constant fight with the underlying platform and user expectations. (See also: trying to shove functional programming into an imperative language or vice-versa, using snake_case in a language where camelCase is the convention, etc.)

And in this case... whatever the theoretical problems with KeyboardInterrupt, the objective fact is that 99.99% of Python programs are using it, and it works out just fine for them.

So it makes me very reluctant to give up the default user-friendliness of "it probably just works, no matter how broken your code is", in return for improving the handling of theoretical edge cases that 99% of users don't even understand and rarely encounter.

What we can definitely do though, is make it easier for folks who care to opt-in to the more rigorous/unforgiving approach.

In theory that's what restrict_keyboard_interrupt_to_checkpoints=True is supposed to do, but I kind of hate it. The alternative is to have a task that listens for KeyboardInterrupt and then raises an exception, which:

  • doesn't require special-case hacks inside the run loop
  • is more flexible (pick your favorite exception, print a message, whatever)
  • the semantics are way more obvious
  • and the semantics are probably more correct, too

In particular, with restrict_keyboard_interrupt_to_checkpoints, you can still get KeyboardInterrupt anywhere, including inside cancel-shielded code (#151). With a task that converts KeyboardInterrupt into a regular exception, then trio's normal exception-handling logic will convert that exception into a cancellation, and cancellation is even easier to reason about than KeyboardInterrupt-at-checkpoints.

(Note, this is also an issue with @enable_ki_protection – it should really be named @restrict_ki_to_checkpoints. Speaking of which, I think right now restrict_keyboard_interrupt_to_checkpoints is basically equivalent to putting @enable_ki_protection on your main function, so we have two independent ways to do the same thing? This whole API smells funny.)

The big downside of a custom task, of course, is that writing it, making sure it gets cancelled correctly on shutdown, etc., is pretty onerous compared to setting a single flag in run.

However! There's the independent suggestion that maybe when a KeyboardInterrupt arrives at a delicate moment, then instead of injecting it through a checkpoint, we should cancel the whole program and then re-raise it at the top. If we switched to this system, then:

  • @enable_ki_protection would actually mean you never see KeyboardInterrupt
  • Putting @enable_ki_protection on your main function would be just like setting up a listener task that does async for _ in sigint_receiver: raise KeyboardInterrupt

In this approach, maybe we should rename it to @interrupt_as_cancel or something? And probably get rid of @disable_ki_protection entirely; replace it with some kind of capture-state/restore-state API.

The easiest "better solution for infinite loops" that I can think of is an easy user-accessible "dump task tree" feature. If we think that having Ctrl+C not work in that case is too surprising, it's probably possible to have a SIGINT handler that spawns a watchdog thread using the principles discussed in #591, but the details seem likely to be super finicky. But it's possible to wind up with Ctrl+C not working on current Trio regardless (using shielding or an improperly written abort_func), for reasons that have nothing to do with KI protection or infinite loops, and it would be nice to have one recipe that works for all such cases. Maybe we distinguish between an initial and a subsequent Ctrl+C by default.

These all seem like valuable things to do regardless of whether we get revolutionary or not :-). I think we're coming to the conclusion that if the user hits control-C, and doesn't have their own SIGINT handler registered, then the only possible correct outcome is that trio.run should exit soon. (Rationale: The fact that Trio has multiple tasks means that no except KeyboardInterrupt inside Trio can possibly work reliably, because the exception might be delivered to some other task. So therefore no correct program can rely on the default handler + catching KeyboardInterrupt inside Trio. Therefore any program where the default handler has fired needs to exit trio.run soon.)

And note that this is true regardless of whether they have some kind of restrict_keyboard_interrupt_to_checkpoints setting enabled -- either way KeyboardInterrupt is expected to tear down the whole trio.run, just the mechanism for doing so is slightly different.

So I think it'd be totally legit to add watchdogs, task-dumping, etc., regardless of what else we decide here.

my libraries use @enable_ki_protection on every public-facing function (assuming I remember it!)

With a library author hat on, I do think this might be overkill. Of course there's some taste involved, but inasmuch as we have an official guideline, I think it should be something like: libraries should use @enable_ki_protection on functions where a surprise exception could cause other tasks to deadlock or corrupt the scheduler state. (So, mostly only code that's messing with wait_task_rescheduled.)

If a KeyboardInterrupt can cause some weird effects but still kills the program... well, that's how every library on PyPI has always worked, that's what most users want/expect. And if you're a user who doesn't want that then you can configure that once for your whole program; it's not something you'll leave up to individual libraries.

OK, so in conclusion, I do think we should continue to distinguish between special protected code and regular unprotected code, rather than always treating the whole program as protected. So we still need some mechanism for doing that:

I'd like to see us move away from the magic locals key if we can.

Ugh yes I would love to kill that code.

I think you're right that contextvars and tracking code objects are probably a better system. And I like your point about propagating the context across threads automatically carrying the protection status with it ... though I think we'd need a trio.lowlevel.copy_context() that's the same as contextvars.copy_context() except that it bakes-in the current protection state (including the effects of any marked code objects currently on the stack).

One thing to watch out for with code objects is that they can potentially be shared by logically distinction functions:

def f():
    ...

f1 = enable_ki_protection(f)
f2 = disable_ki_protection(f)

Currently, f1 and f2 are distinct objects with distinct KI protection state. But they share the same code object, so in a code-object world this wouldn't work.

I don't think we actually care about this, though.

It would also be fine to have some hard-coded special cases, like: if code_obj.co_globals["__name__"] in ["contextvars", "contextlib"] -> protection enabled. From a quick skim it doesn't look like either of those modules call use any utility functions from other modules on the critical path.

@oremanj
Copy link
Member

oremanj commented May 12, 2020

Thanks so much for the very detailed & thoughtful response!

But... there's a kind of "when in Rome" principle for API designs. If we were designing our own language, maybe we'd do it differently, but if you're working within an existing platform, then a suboptimal-but-consistent design is often better overall than a design that's theoretically more optimal but requires a constant fight with the underlying platform and user expectations.

OK, I'm sufficiently convinced that letting KI be raised everywhere in user code is a reasonable default. :-)

However! There's the independent suggestion that maybe when a KeyboardInterrupt arrives at a delicate moment, then instead of injecting it through a checkpoint, we should cancel the whole program and then re-raise it at the top.

Yep, this is still my goal too! I didn't want to raise too many issues at once, but "KI at a delicate moment should cancel the whole run and then reraise KI just as we're exiting" is my next plan on the KI thread after we figure out the API design questions.

In this approach, maybe we should rename it to @interrupt_as_cancel or something? And probably get rid of @disable_ki_protection entirely; replace it with some kind of capture-state/restore-state API.

I've been thinking of this in the context of two different ways I imagine people will want to think about the question of "can KI be raised here?", which might want to use two different mechanisms.

  • As a policy question: do I want to think about KI-being-raised-potentially-anywhere for this {task, library, program} or not? Once you set this policy, you probably want it to be inherited across tasks and thread-roundtrips until it's changed, which argues for a contextvar. This is the one where marking main() would be equivalent to the current restrict_keyboard_interrupt_to_checkpoints feature. Implementing package-based KI protection (my third section above) unlocks applying this policy at the per-library level.

  • As a local safeguard: "I'm doing some delicate meddling here and I need to not be raising KI at unexpected points, no matter what context I'm called from". This one would probably be a dictionary keyed off code objects. In theory it would be just as reasonable to implement it using the contextvar, as long as we special-case Context.run() and a few other things as always being protected. In practice I worry about the overhead of adding that contextvar-fiddling to a great many low-level frequently-executed paths in Trio.

It's easy to capture and restore a contextvar; it's harder (more expensive, at the very least) to capture and restore the state implied by the code objects of the frames that are currently on the stack. I'd propose that if we have the "local safeguard" mechanism at all, we just make it not get inherited by child tasks nor passed through threads. That way we don't need trio.lowlevel.copy_context().

Speaking of which, I think right now restrict_keyboard_interrupt_to_checkpoints is basically equivalent to putting @enable_ki_protection on your main function, so we have two independent ways to do the same thing?

This is not currently true, because the enable_ki_protection state is not inherited by child tasks (since child tasks don't hook up to their parent's stack at the Python interpreter level, although boy would it be nice if they did). I think it's a desirable end state though.

With a library author hat on, I do think this might be overkill. Of course there's some taste involved, but inasmuch as we have an official guideline, I think it should be something like: libraries should use @enable_ki_protection on functions where a surprise exception could cause other tasks to deadlock or corrupt the scheduler state.

You're right, and on further examination I think I don't actually wind up protecting the whole library in practice. I would do so if it were easy though, just so there's one fewer thing to think about. (I think it's more likely that I'll write something KI-unsafe than something that infinite-loops without executing any checkpoints.)

One thing to watch out for with code objects is that they can potentially be shared by logically distinction functions:

Yep, that would be a semantic difference: under the new system you're not allowed to do things like that. (You can fall back to the contextvar-level approach if you need to.)


Anyway, synthesizing all of this into a concrete API proposal...

  • Deprecate enable_ki_protection and disable_ki_protection. Until they're actually removed, let them continue to use the magic locals key, and have the state of that locals key override all other considerations if it's present. Once they're removed, the magic locals key goes away too.

  • Add take_ki_immediately and take_ki_as_cancel, which affect the state of the contextvar. I think these should work as both (synchronous) context managers and as decorators. The context manager evaluates to a token that can be used to restore the previous state for an inner scope. System tasks start out as take_ki_as_cancel, while the main task starts out as take_ki_immediately.

  • Deprecate restrict_keyboard_interrupt_to_checkpoints. In its place add a run parameter on_keyboard_interrupt which can take string values "raise" (the default), "cancel_then_raise" (acts like the main function was decorated with take_ki_as_cancel), or "cancel_silently" (on KI trio.run() returns None, useful for servers).

  • Add ki_critical_section, which is a decorator that remembers the decorated code object. Apart from working at a code object level rather than a function level, it works exactly like the current enable_ki_protection: KI will not be raised at arbitrary points in this function or any of its callees, regardless of the state of the contextvar. There is no corresponding disable; if you need that capability, use the take_ki_* functions. The critical-section-ness is not inherited by child tasks or propagated across threads.

@njsmith
Copy link
Member Author

njsmith commented May 13, 2020

Deprecate enable_ki_protection and disable_ki_protection. Until they're actually removed, let them continue to use the magic locals key, and have the state of that locals key override all other considerations if it's present. Once they're removed, the magic locals key goes away too.

By "override all other considerations", I guess you mean: if the stack walk hits a frame with the magic locals key, then it exits then with that as its answer?

Add take_ki_immediately and take_ki_as_cancel, which affect the state of the contextvar. I think these should work as both (synchronous) context managers and as decorators. The context manager evaluates to a token that can be used to restore the previous state for an inner scope. System tasks start out as take_ki_as_cancel, while the main task starts out as take_ki_immediately.

Setting the contextvar at the top of a task works well, but modifying it within the task is pretty tricky. E.g., suppose you have:

async def top():
    await middle()

@enable_ki_protection
async def middle():
    await bottom()

async def bottom():
    with take_ki_immediately():
        ...  # <-- signal handler runs here

Here, the contextvar has been explicitly set to take the KI immediately... but we don't know which stack level that was set at. So here the stack walker will encounter the @enable_ki_protection before it reaches the top of the stack and consults the contextvar, so the with take_ki_immediately() is actually a no-op.

Of course, it does work for cases where there are no decorated functions on the stack, and for the one case we have in mind (trio.from_thread) that's sufficient. But that's still a kind of long-range implicit coupling that makes me nervous...

Deprecate restrict_keyboard_interrupt_to_checkpoints. In its place add a run parameter on_keyboard_interrupt which can take string values "raise" (the default), "cancel_then_raise" (acts like the main function was decorated with take_ki_as_cancel), or "cancel_silently" (on KI trio.run() returns None, useful for servers).

Quick thought: I'm not a big fan of cancel_silently. Server shutdown mostly involves SIGTERM, not SIGINT. And if someone wants to customize how KeyboardInterrupt is handled/printed, they can just do:

try:
    trio.run(...)
except KeyboardInterrupt:
    print("Exiting due to control-C")
    sys.exit(1)

...or whatever. Plus silently exiting is often not the best default behavior anyway :-).

@oremanj
Copy link
Member

oremanj commented May 13, 2020

By "override all other considerations", I guess you mean: if the stack walk hits a frame with the magic locals key, then it exits then with that as its answer?

Yep. This allows @disable_ki_protection to override the contextvar, which is important because that was the previously documented way to enable KI delivery in a system task.

Setting the contextvar at the top of a task works well, but modifying it within the task is pretty tricky. E.g., suppose you have: [...]

@enable_ki_protection is on its way out under this proposal. Old code that still uses it won't be modifying the contextvar; it will be using @disable_ki_protection if it's calling back into user code. The non-disable-able replacement for @enable_ki_protection (what I was calling @ki_critical_section) does indeed apply to the whole callstack underneath it, no matter what the contextvar says. That's because, being non-disableable, it's not useful in cases where you need to call back into user code. If you need to toggle back and forth, you use the contextvar.

I'm not a big fan of cancel_silently. Server shutdown mostly involves SIGTERM, not SIGINT.

Sure, happy to remove it! SIGINT-for-clean-exit was the convention at my previous employer, and is also useful for running stuff from the command line, but I don't feel strongly about it.

@njsmith
Copy link
Member Author

njsmith commented May 13, 2020

On further thought, I wonder if contextvars are maybe a red herring. One thing that's making me wonder is that the simplest semantics for "KI disabled" would be "you never get a KeyboardInterrupt", but that means that the "KI disabled" propagation should match exception propagation... and that's not how contextvar propagation works, because new tasks get contextvars from whoever called start_soon, not from the enclosing nursery.

So I guess the cases we actually care are:

  • We need to figure out whether the signal handler is being called inside a task or not (very tricky b/c atomicity requirements)
  • If it's inside a task, we need to figure out whether this task as a whole has KI enabled or not
  • We want the blocked-or-not state to be propagated through to_threadfrom_thread chains
  • We want to be able to disable KI for specific critical sections, and we want this to be cheap
  • We want the blocked-or-not state to be propagated through nurseries, I guess? and we want this to be cheap

Is there anything else we actually care about?

  • To figure out whether a signal handler is being called inside a task or not, then we could keep a mapping from task root frame object → Task. That's pretty cheap to maintain, and it's atomic: the signal handler is inside a task iff one of the frames it sees on the stack is the root of some task. So no need for locals or contextvars or anything there.

  • Once we find the Task, there are lots of mechanisms we could use to figure out whether this Task is KI enabled or not, e.g. a field on the Task object.

  • One of the other changes we want to make to to_threadfrom_thread is to re-enter the original task, instead of entering a new system task. The original motivation was to make cancellation work (see Propagating cancellation through threads #606, and also Better usability for threads #810, Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961, etc.), but this would also automatically make KI state propagation work... as long as the user didn't change it inside the thread

  • For critical sections, using code objects makes calling them cheap and easy (and makes it much easier to handle some of the gross cases like generators).

    • OTOH it makes checking whether a given frame is currently protected much more expensive (have to walk the stack), which might be bad if nurseries need to propagate this...

...for a bit I thought maybe we could handle it by making the signal stack-walking code smart enough to walk through nurseries, so nurseries could propagate the state without having to actually calculate it ahead of time. But that doesn't work. (Also I dunno how signal-safe the nursery bookkeeping data structures are.)

I guess we could measure how expensive it is for nurseries to capture the whole stack when created (up to the task root, or the first sync function, which might be cheaper since you can read it directly off frame.f_code.co_flags). Maybe we're just being over-paranoid and it's actually cheap enough to be worth it. It would certainly be convenient for debugging.

Hmm, also, I guess in the to_threadfrom_thread, we actually do need to temporarily block KI and then re-enable even if we're using the same task, because the actual to_thread stuff needs to block KI.

I thought I was on the trail of a nice simple solution but it's not quite there yet is it :-)

@njsmith
Copy link
Member Author

njsmith commented May 13, 2020

Maybe we can exploit somehow that we don't really want a "enable KI delivery" option, only the option to disable KI delivery in particular spans of code? So it's a strict "AND": if the task vetoes KI, then KI doesn't happen. If the stack vetoes KI, then KI doesn't happen. That makes the anomalous case I pointed out (with the top/middle/bottom) much less of an issue.

@smurfix
Copy link
Contributor

smurfix commented May 13, 2020

So it's a strict "AND"

I think you're right. This is the only sensible way out of this mess IMHO.

@oremanj
Copy link
Member

oremanj commented May 13, 2020

So it's a strict "AND": if the task vetoes KI, then KI doesn't happen. If the stack vetoes KI, then KI doesn't happen.

Yes, that was my intent, which I probably clouded by talking about the backward-compatibility behavior of the @enable / @disable_ki_protection decorators after this change. The equivalent of the "task-level vetoing" in my proposal is the contextvar, and of the "stack-level vetoing" is the code object metadata. (More below on why I still think a contextvar is a reasonable choice.)

The reason we need the weird exception about @disable_ki_protection being more powerful than a task-level veto is to support old code that uses @disable_ki_protection to allow receiving KI in a system task (system tasks would be vetoing KI at the task level in the new model). It's OK in my mind if the old decorators have some slightly strange interactions with the new approach, as long as we document them and do our best to make patterns that might already be in use continue to work. We'll be rid of the old decorators in a couple months anyway.

I don't think inheriting the code-object-based state (@ki_critical_section) across tasks is worth the trouble. The current @enable_ki_protection doesn't get inherited across tasks either. I know Trio really likes the model of "choosing to spawn child tasks is an implementation detail as long as they don't outlive you", and I agree with it in general, but I'm arguing for making a somewhat unprincipled exception in this case. All the uses for @ki_critical_section that I think are important from a performance or correctness perspective obviously don't spawn any child task; there's another mechanism available (the contextvar-based one) which does allow child tasks; and it's going to be in lowlevel and can bend the rules like some other things in lowlevel do, as long as we document it.

To be clear, I think we should document the contextvar mutator as the primary interface, and the code object-based decorator as an optimization that is applicable under specific circumstances and limitations only.

To figure out whether a signal handler is being called inside a task or not, then we could keep a mapping from task root frame object → Task.

Oh, that's a nice trick! I don't know if it's necessary though; even if we don't want to store the KI state in a contextvar, we could have a "you're in a task, bool yes/no" contextvar, and if it's true then current_task() works fine to see which one.

the simplest semantics for "KI disabled" would be "you never get a KeyboardInterrupt", but that means that the "KI disabled" propagation should match exception propagation

I don't think I agree. If you call a function which reenables KI, then that function can get a KeyboardInterrupt, which might escape from your call to it and which you thus might need to handle. I think "starting a task which reenables KI can cause you to need to handle KI yourself" is morally equivalent. And at that point it doesn't really matter whether the reenablement is at top level in the new task, or around the start_soon call that creates it. I think it will ultimately be less confusing if this behaves like any other contextvar. However, that point is separable from the rest of the design -- we could replace "contextvar" with "task attribute that is captured by NurseryManager and for which new tasks inherit the parent nursery's captured value" (the same as the behavior for cancel status, basically) and the remaining aspects of the design would be unchanged.

Separately, it would be nice to have an extensible facility for "per-task attribute that is inherited via the nursery, not the spawner". Like a ScopedContextVar or something. Let the nursery do copy_context() (cheap by design!) and say ScopedContextVar wraps ContextVar with some logic to look itself up in the current task's parent nursery's context. (Honestly I sometimes wish all contextvars worked this way, but that ship has long since sailed.)

@njsmith
Copy link
Member Author

njsmith commented May 15, 2020

Oh, that's a nice trick! I don't know if it's necessary though; even if we don't want to store the KI state in a contextvar, we could have a "you're in a task, bool yes/no" contextvar, and if it's true then current_task() works fine to see which one.

Yeah, I guess there are a bunch of ways to do this. Using contextvar would mean we have to jump through some special hoops to make sure the contextvars backport is atomic, but I guess that's harmless. Or I guess if we really wanted to shove all the costs into the signal handler, we could even have it iterate over all Tasks to build the root frame → Task mapping on the fly.

I feel like all these options are a little bit awkward but basically workable, so ¯\(ツ)

I know Trio really likes the model of "choosing to spawn child tasks is an implementation detail as long as they don't outlive you", and I agree with it in general, but I'm arguing for making a somewhat unprincipled exception in this case. All the uses for @ki_critical_section that I think are important from a performance or correctness perspective obviously don't spawn any child task; there's another mechanism available (the contextvar-based one) which does allow child tasks; and it's going to be in lowlevel and can bend the rules like some other things in lowlevel do, as long as we document it.

Yeah, that seems fairly reasonable.

If you call a function which reenables KI, then that function can get a KeyboardInterrupt, which might escape from your call to it and which you thus might need to handle.

Okay, but this is the idea of the "strict AND" – if we simply don't offer any API for enabling KI, then this case is inexpressible :-).

Of course the one flaw in this elegant design is that it doesn't work, because trio.from_thread does want to re-enable KI :-). But! This is an extremely narrow and specific case, so maybe we can custom-tailor our API to expose just the minimal functionality needed for trio.from_thread. How about this:

  • There's a per-task/per-nursery flag that tracks whether KI is enabled for a task as a whole. The only way to mutate it from within a task is a with no_keyboard_interrupt(): ... context manager, that saves enables protection on entry, and then restores the original state on exit.

  • You can also query the current KI protection state.

  • We provide versions of outcome.capture and acapture that take an boolean for whether keyboard-interrupt should be enabled while running the function. (This does override the per-task state temporarily, but it always restores it again afterwards, and if a KeyboardInterrupt is raised then it get captured, so we keep the property that if you had keyboard-interrupt disabled then you don't have to worry about the code you call raising it.)

Separately, it would be nice to have an extensible facility for "per-task attribute that is inherited via the nursery, not the spawner". Like a ScopedContextVar or something. Let the nursery do copy_context() (cheap by design!) and say ScopedContextVar wraps ContextVar with some logic to look itself up in the current task's parent nursery's context. (Honestly I sometimes wish all contextvars worked this way, but that ship has long since sailed.)

Sounds like something that should maybe be split off into another issue?

@oremanj
Copy link
Member

oremanj commented May 15, 2020

Another issue with remembering the root frame of each task is that it breaks when you insert your own gizmo into Task.coro that wraps the original one. I know that's Evil(TM) but it's also an incredibly useful tool when writing monitoring and debugging tools, so I don't want to break it gratuitously. Scanning all tasks from the signal handler doesn't have that problem, but I think protecting the contextvars backport will be less hassle than that.

if we simply don't offer any API for enabling KI, then this case is inexpressible

Oh, I see what you were going for now! The "special version of capture/acapture" was the missing piece for me.

Sounds like something that should maybe be split off into another issue?

Yep, will do :-)

I think I have enough to take another stab at implementing this now - thanks for your help!

@njsmith
Copy link
Member Author

njsmith commented May 15, 2020

@oremanj maybe the first step should be to stop injecting KeyboardInterrupt through checkpoints? That's a self-contained change and I think it might be a precondition for the semantics of the new KI control APIs to make sense?

njsmith added a commit to njsmith/trio that referenced this issue Jun 14, 2020
This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the
cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

- `await checkpoint()`: ~24k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

After this change:

- `await checkpoint()`: ~170k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

Benchmark script:

```python
import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)
```
njsmith added a commit to njsmith/trio that referenced this issue Jun 14, 2020
This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the
cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

- `await checkpoint()`: ~24k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

After this change:

- `await checkpoint()`: ~170k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

Benchmark script:

```python
import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants