-
-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a blocked task watchdog #591
Comments
There was an attempt to do this in #596, which I think is abandoned. One of the reasons is that we had trouble figuring out how to keep the watchdog overhead low and get the cross-thread communication right. I just had a thought for a pretty simple approach, that doesn't require a @attr.s
class Watchdog:
lock = attr.ib(factory=threading.Lock)
shutdown_event = attr.ib(factory=threading.Event)
handle_io_count = attr.ib(default=0)
def watchdog_thread(self):
while True:
with self.lock:
hic1 = self.handle_io_count
if self.shutdown_event.wait(timeout=timeout):
return
with self.lock:
hic2 = self.handle_io_count:
if hic1 == hic2:
# ... watchdog fires ...
# trio.run
while True:
timeout = ...
with watchdog.lock:
watchdog.handle_io_count += 1
runner.io_manager.handle_io(timeout)
... So obviously this is deadlock-free, since there's only one lock. But does it work? Every time the main thread takes the lock, it increments the counter. The watchdog takes the lock twice per pass through the loop. Case 1: the main thread takes the lock in between these. That means it incremented the counter, and thus the two reads of The thing that's clever/tricky about this what happens if Also, the watchdog thread's overhead is very small, and bounded. In particular, if the main loop is very busy and generating lots and lots of calls to We don't need to worry about Basically this seems... super simpler and like it should just work. |
Hmm, while trying to fall asleep tonight my brain just pointed out what might be a serious problem for this idea. There is a case where it's quite normal for trio to block and take arbitrarily long between scheduler loops: when the user hits control-Z or sends a SIGSTOP to pause the program. Suppose this signal arrives just after the watchdog goes to sleep. When the program wakes up again, the sleep will immediately complete. So the watchdog's effective deadline might get reduced arbitrarily close to zero. Can we detect this somehow? It might be tricky. Technically we could install a SIGCONT handler, I think, but that sounds really finicky. We could check how long we slept, and if it's longer than we expected to sleep then it's good evidence that something weird is going on and maybe we should start over. Of course, it's possible for someone to un-suspend us exactly at the moment we would have woken up anyway, so this can't be totally reliable, but at least it catches lots of obvious cases. It would be really nice if there were some to find out when you mysteriously missed time, but I don't know any. Maybe something buried on I guess debugging is another case where the watchdog would go off spuriously. Though there at least you hopefully know that you're sitting and looking at a debugger; it's not a message being delivered to some random end user. |
Weird fact that's probably not quite enough to be useful: On linux, if you use This doesn't work on macOS, and it doesn't work with any other syscalls on Linux AFAICT. (In particular it definitely doesn't work with Also, it could break if somehow the signal doesn't have Oh but wait, in our case the So.... I guess that actually does solve this on Linux? We'd need to use a socket to send shutdown messages from the main thread to the watchdog thread, so the watchdog thread can use Unfortunately this doesn't help with macOS... or Windows, for that matter, but on Windows it's very rare to suspend processes, so it's not as big a deal. |
Maybe it would be sufficient to just like... reduce the timeout threshold some, and then require it to fail N times in a row before triggering a warning. |
Oh but on macOS we have
|
In this stackoverflow question, someone gets very confused because they're using
time.sleep
and never usingawait
, so one of their tasks seems to run correctly but the other doesn't. There's no indication of what's going on. Easy to fix once you know how, but if you don't, then how are you supposed to figure it out?We could, pretty cheaply, add a watchdog to catch such situations. What I'm imagining is that when entering
trio.run
, we'd spawn a system task that spins up a thread and watches for the main loop to become stuck. In the main run loop, we'd send it messages occasionally, whenever we (a) start processing a batch of tasks, (b) stop processing a bunch of tasks. If the watchdog thread detects that the time between (a) and (b) exceeds, say, 5 seconds, then it should print an informative message and use thefaulthandler
module to dump everyone's stack traces. I think this might be cheap enough it could be enabled-by-default.The text was updated successfully, but these errors were encountered: