drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

timotheecour · 2020-07-02T20:08:27Z

this could be a recent regression (since #14849 at least but I don't see how this PR would be the root cause) or maybe the problem is from earlier on; but according to https://dev.azure.com/nim-lang/Nim/_test/analytics?definitionId=1&contextType=build it's since 2020-06-30.

drain is still broken on windows, in particular tests/async/testmanyasyncevents.nim is very flaky on windows.
as you can see in WIP #14853 (to make async joinable), a lot of async tests would fail on windows if joined through megatest, ie, hasPendingOperations would return true instead of false, at least at module scope (haven't tried in proc scope or block scope); this seems related.

Example

I keep seeing these failures in a lot of my PR's recently:
https://dev.azure.com/nim-lang/255dfe86-e590-40bb-a8a2-3c0295ebdeb1/_apis/build/builds/6605/logs/94

2020-07-02T14:50:23.9151912Z FAIL: tests/async/testmanyasyncevents.nim C
2020-07-02T14:50:23.9153015Z Test "tests\async\testmanyasyncevents.nim" in category "async"
2020-07-02T14:50:23.9153827Z Failure: reOutputsDiffer
2020-07-02T14:50:23.9155038Z Expected:
2020-07-02T14:50:23.9727753Z hasPendingOperations: false
2020-07-02T14:50:24.3853674Z triggerCount: 100
2020-07-02T14:50:24.4513555Z 
2020-07-02T14:50:24.5775716Z Gotten:
2020-07-02T14:50:24.6880714Z hasPendingOperations: true
2020-07-02T14:50:24.6882296Z triggerCount: 55
2020-07-02T14:50:24.6882759Z

Additional Information

devel 1.3.5 a6cbe58
history: https://dev.azure.com/nim-lang/Nim/_test/analytics?definitionId=1&contextType=build
Fix asyncdispatch drain behavior (#14820) #14838 says:

Drain was totally broken
it's because the Azure VM's are slow and can easily stall async. Slowing down the test by increasing the timeouts solved the problem.
I'm not sure I'm completely satisfied by this solution, but it's good enough for now, and hopefully good enough to merge.

so maybe there is a more robust fix? /cc @rayman22201

see also this related closed issue: timeout of asyncdispatch.drain is not respected and behaviour doesn't match poll #14820

note

this is the most flaky test according to https://dev.azure.com/nim-lang/Nim/_test/analytics?definitionId=1&contextType=build as you can see by filtering by branch and selecting refs/heads/devel

The text was updated successfully, but these errors were encountered:

rayman22201 · 2020-07-02T22:15:27Z

drain is still broken on windows, in particular tests/async/testmanyasyncevents.nim is very flaky on windows.

I'm not sure if the problem is drain, or if the problem is the test.
The test may need to be adjusted to take into account the (lack of) performance of the Azure VMs.

as you can see in #14853 (to make async joinable), a lot of async tests would fail on windows if joined through megatest, ie, hasPendingOperations would return true instead of false, at least at module scope (haven't tried in proc scope or block scope); this seems related.

This makes me wonder if there is a more fundamental (and subtle) bug deeper in the async implementation. Specifically related to environments that may stall or have slow IO performance (such as a highly loaded multitenant VM system like Azure).

I have some theories but it's difficult to debug.
It's hard to debug because I don't have an environment that reproduces the issue reliably outside of Azure.
At this point I'm left with print statement debugging via Pull Request (far from optimal).
I suppose I could try to set up a particularly crippled VM? I'm not sure. 🤷‍♂️

It seems like the underlying selector events may stall in such a way that it causes the async event loop to miss timeout deadlines (maybe it's stuck in the OS callback?) or possibly the OS does not close a finished selector right away (zombie selector). Either situation could cause false positives from hasPendingOperations. It's also possible that there is some bug in hasPendingOperations, but that function is just a reference count of open OS resources basically, so I expect the issue is more likely in the underlying accounting.

timotheecour · 2020-07-02T23:07:23Z

It seems like the underlying selector events may stall in such a way that it causes the async event loop to miss timeout deadlines

could this be related? #14634

looks like there is a lag involved, not quite sure why (I thought timers at least on OSX had high resolution). note that running locally, I can reproduce the flake but only with smaller values close to the t0=100 set in var timer = selector.registerTimer(100, false, 0)

(note, this is for OSX but running in azure)

things to try

create a test that fails all the time
see whether it still fails when we just run that test or whether it only fails in the context of other tests running;
see why test always fails in some PR's (that have an unrelated change), but not in other PR's
try this on other CI: github actions/travis/appveyor just to debug this issue

my hunch is there's a bug in our code, but it only gets triggered when running in azure for some reason

rayman22201 · 2020-07-03T00:20:32Z

(note, this is for OSX but running in azure)
my hunch is there's a bug in our code, but it only gets triggered when running in azure for some reason

I suspect the issue is specific to OSX and triggered by a slow or overloaded VM.
I believe the flakeyness is correlated with the amount of load Azure is under at the time the test runs.
I agree, the bug could be in our code. It's just hard to debug :-(

rayman22201 · 2020-07-03T22:23:49Z

I have used this tool in the past to simulate load on a system to help me track down bugs that only happen on during high system stress.
I wonder if it will be helpful here:
https://github.com/ColinIanKing/stress-ng

bung87 · 2022-10-14T13:42:36Z

works on current devel be18f4e on my windows 11

timotheecour added OS/Architecture Specific Regression Async Everything related to Nim's async labels Jul 2, 2020

timotheecour changed the title ~~regression? drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim very flaky on windows~~ drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows Jul 2, 2020

This was referenced Jul 2, 2020

investigate whether tests/async/testmanyasyncevents.nim timotheecour/Nim#334

Open

deprecate existsDir; use dirExists instead #14884

Merged

timotheecour mentioned this issue Jul 4, 2020

tests/async/testmanyasyncevents.nim windows flaky test timotheecour/Nim#333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

timotheecour commented Jul 2, 2020 •

edited

Loading

rayman22201 commented Jul 2, 2020

timotheecour commented Jul 2, 2020 •

edited

Loading

rayman22201 commented Jul 3, 2020

rayman22201 commented Jul 3, 2020

bung87 commented Oct 14, 2022

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

Comments

timotheecour commented Jul 2, 2020 • edited Loading

Example

Additional Information

note

rayman22201 commented Jul 2, 2020

timotheecour commented Jul 2, 2020 • edited Loading

things to try

rayman22201 commented Jul 3, 2020

rayman22201 commented Jul 3, 2020

bung87 commented Oct 14, 2022

timotheecour commented Jul 2, 2020 •

edited

Loading

timotheecour commented Jul 2, 2020 •

edited

Loading