Add a blocked task watchdog #591

njsmith · 2018-08-04T00:27:24Z

In this stackoverflow question, someone gets very confused because they're using time.sleep and never using await, so one of their tasks seems to run correctly but the other doesn't. There's no indication of what's going on. Easy to fix once you know how, but if you don't, then how are you supposed to figure it out?

We could, pretty cheaply, add a watchdog to catch such situations. What I'm imagining is that when entering trio.run, we'd spawn a system task that spins up a thread and watches for the main loop to become stuck. In the main run loop, we'd send it messages occasionally, whenever we (a) start processing a batch of tasks, (b) stop processing a bunch of tasks. If the watchdog thread detects that the time between (a) and (b) exceeds, say, 5 seconds, then it should print an informative message and use the faulthandler module to dump everyone's stack traces. I think this might be cheap enough it could be enabled-by-default.

The text was updated successfully, but these errors were encountered:

njsmith · 2019-02-13T09:45:48Z

There was an attempt to do this in #596, which I think is abandoned. One of the reasons is that we had trouble figuring out how to keep the watchdog overhead low and get the cross-thread communication right.

I just had a thought for a pretty simple approach, that doesn't require a Condition or other fancy stuff:

@attr.s
class Watchdog:
    lock = attr.ib(factory=threading.Lock)
    shutdown_event = attr.ib(factory=threading.Event)
    handle_io_count = attr.ib(default=0)
    
    def watchdog_thread(self):
        while True:
            with self.lock:
                hic1 = self.handle_io_count
            if self.shutdown_event.wait(timeout=timeout):
                return
            with self.lock:
                hic2 = self.handle_io_count:
            if hic1 == hic2:
                # ... watchdog fires ...

# trio.run
while True:
    timeout = ...
    with watchdog.lock:
        watchdog.handle_io_count += 1
        runner.io_manager.handle_io(timeout)
    ...

So obviously this is deadlock-free, since there's only one lock. But does it work?

Every time the main thread takes the lock, it increments the counter.

The watchdog takes the lock twice per pass through the loop. Case 1: the main thread takes the lock in between these. That means it incremented the counter, and thus the two reads of self.handle_io_count will be different, and the watchdog won't fire. Case 2: the main thread doesn't take the lock in between these. That means that timeout seconds passed during which trio was never inside handle_io, so the watchdog should fire. And indeed, we'll find hic1 == hic2, so the watchdog does fire.

The thing that's clever/tricky about this what happens if handle_io blocks for a long time (if there's nothing to do). Since we hold the lock while calling it, this will cause the watchdog thread to block for just as long! So it doesn't keep polling useless while the main thread is sleeping. And then when the watchdog thread is finally released, it may have actually slept for a very long time indeed, way longer than its normal timeout... but it shouldn't fire, and it doesn't.

Also, the watchdog thread's overhead is very small, and bounded. In particular, if the main loop is very busy and generating lots and lots of calls to handle_io, then that doesn't cause any extra work for the watchdog thread, except for slightly higher mutex contention. But the contention should still be very low, since the watchdog thread only wakes up and and takes the lock for a very very brief time, very rarely (assuming a reasonable timeout).

We don't need to worry about handle_io_count overflowing, because Python ints don't overflow. Do we need to worry about it not overflowing, and getting so huge that it causes slowdowns? On my laptop, incrementing 0 takes ~26 ns, and incrementing 2**60 takes ~38 ns. So the slowdown is pretty miniscule. And if we somehow managed to pass through the run loop in 1 ns (which is optimistic by a factor of >1000x), then 2**60 loops would take ~36 years. So I don't think we need to worry about the counter getting too large.

Basically this seems... super simpler and like it should just work.

njsmith · 2019-06-02T11:02:57Z

Hmm, while trying to fall asleep tonight my brain just pointed out what might be a serious problem for this idea. There is a case where it's quite normal for trio to block and take arbitrarily long between scheduler loops: when the user hits control-Z or sends a SIGSTOP to pause the program.

Suppose this signal arrives just after the watchdog goes to sleep. When the program wakes up again, the sleep will immediately complete. So the watchdog's effective deadline might get reduced arbitrarily close to zero.

Can we detect this somehow? It might be tricky. Technically we could install a SIGCONT handler, I think, but that sounds really finicky.

We could check how long we slept, and if it's longer than we expected to sleep then it's good evidence that something weird is going on and maybe we should start over. Of course, it's possible for someone to un-suspend us exactly at the moment we would have woken up anyway, so this can't be totally reliable, but at least it catches lots of obvious cases.

It would be really nice if there were some to find out when you mysteriously missed time, but I don't know any. Maybe something buried on getrusage or timerfd or similar?

I guess debugging is another case where the watchdog would go off spuriously. Though there at least you hopefully know that you're sitting and looking at a debugger; it's not a message being delivered to some random end user.

njsmith · 2019-06-06T02:19:05Z

Weird fact that's probably not quite enough to be useful: On linux, if you use select, and the signal has SA_RESTART set so that the kernel/libc handle restarting syscalls automatically, then the select timeout's countdown will pause while the process is sleeping. So e.g. if you request a sleep of 10 seconds, and in the middle you put the process to sleep for 5 seconds, then the process sleeps for 15 seconds.

This doesn't work on macOS, and it doesn't work with any other syscalls on Linux AFAICT. (In particular it definitely doesn't work with poll, epoll_wait, or whichever condition-related syscall that threading.Event.wait uses. It does work with Python's time.sleep, but that's because time.sleep calls select internally.)

Also, it could break if somehow the signal doesn't have SA_RESTART behavior. I guess SIGSTOP effectively always does, and the other go-to-sleep signals like SIGTSTP, SIGTTIN, SIGTTOU do by default, but if someone registers a handler on one of these three then they might or might not use SA_RESTART.

Oh but wait, in our case the select would be running in a child thread, so it's unlikely that we'd get asked to run the signal handler. And we could even use pthread_sigmask to make sure.

So.... I guess that actually does solve this on Linux? We'd need to use a socket to send shutdown messages from the main thread to the watchdog thread, so the watchdog thread can use select to look for them, but that's not a huge deal.

Unfortunately this doesn't help with macOS... or Windows, for that matter, but on Windows it's very rare to suspend processes, so it's not as big a deal.

njsmith · 2019-06-06T02:24:44Z

Maybe it would be sufficient to just like... reduce the timeout threshold some, and then require it to fail N times in a row before triggering a warning.

njsmith · 2019-06-06T03:50:10Z

Oh but on macOS we have kqueue, so this is easy – the watchdog could use EVFILT_SIGNAL to watch for SIGCONT. At least according to the man page, this is reliable and shouldn't interfere with the rest of the program:

     EVFILT_SIGNAL    Takes the signal number to monitor as the identifier and
                      returns when the given signal is generated for the
                      process.  This coexists with the signal() and
                      sigaction() facilities, and has a lower precedence.
                      Only signals sent to the process, not to a particular
                      thread, will trigger the filter. The filter will record
                      all attempts to deliver a signal to a process, even if
                      the signal has been marked as SIG_IGN.  Event notifica-
                      tion happens before normal signal delivery processing.

njsmith added design discussion polish user happiness labels Aug 4, 2018

Fuyukai mentioned this issue Aug 12, 2018

Add task blocked watchdog. #596

Closed

8 tasks

njsmith mentioned this issue Aug 29, 2018

Should we deprecate trio.Event.clear? #637

Closed

njsmith mentioned this issue Feb 13, 2019

Provide a helper to dump the task tree / stack tree #927

Open

komuw mentioned this issue Feb 23, 2019

validate that code inside Task.async_run is not blocking komuw/wiji#4

Closed

oremanj added the debugging label May 4, 2019

njsmith added the todo soon label May 12, 2019

oremanj mentioned this issue May 10, 2020

Rethink KeyboardInterrupt protection APIs #733

Open

njsmith mentioned this issue Apr 28, 2023

In-process detection njsmith/perpetuo#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a blocked task watchdog #591

Add a blocked task watchdog #591

njsmith commented Aug 4, 2018

njsmith commented Feb 13, 2019

njsmith commented Jun 2, 2019

njsmith commented Jun 6, 2019

njsmith commented Jun 6, 2019

njsmith commented Jun 6, 2019

Add a blocked task watchdog #591

Add a blocked task watchdog #591

Comments

njsmith commented Aug 4, 2018

njsmith commented Feb 13, 2019

njsmith commented Jun 2, 2019

njsmith commented Jun 6, 2019

njsmith commented Jun 6, 2019

njsmith commented Jun 6, 2019