Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows #14885

Open
timotheecour opened this issue Jul 2, 2020 · 5 comments
Labels
Async Everything related to Nim's async OS/Architecture Specific Regression

Comments

@timotheecour
Copy link
Member

timotheecour commented Jul 2, 2020

this could be a recent regression (since #14849 at least but I don't see how this PR would be the root cause) or maybe the problem is from earlier on; but according to https://dev.azure.com/nim-lang/Nim/_test/analytics?definitionId=1&contextType=build it's since 2020-06-30.

  • drain is still broken on windows, in particular tests/async/testmanyasyncevents.nim is very flaky on windows.
  • as you can see in WIP #14853 (to make async joinable), a lot of async tests would fail on windows if joined through megatest, ie, hasPendingOperations would return true instead of false, at least at module scope (haven't tried in proc scope or block scope); this seems related.

Example

I keep seeing these failures in a lot of my PR's recently:
https://dev.azure.com/nim-lang/255dfe86-e590-40bb-a8a2-3c0295ebdeb1/_apis/build/builds/6605/logs/94

2020-07-02T14:50:23.9151912Z FAIL: tests/async/testmanyasyncevents.nim C
2020-07-02T14:50:23.9153015Z Test "tests\async\testmanyasyncevents.nim" in category "async"
2020-07-02T14:50:23.9153827Z Failure: reOutputsDiffer
2020-07-02T14:50:23.9155038Z Expected:
2020-07-02T14:50:23.9727753Z hasPendingOperations: false
2020-07-02T14:50:24.3853674Z triggerCount: 100
2020-07-02T14:50:24.4513555Z 
2020-07-02T14:50:24.5775716Z Gotten:
2020-07-02T14:50:24.6880714Z hasPendingOperations: true
2020-07-02T14:50:24.6882296Z triggerCount: 55
2020-07-02T14:50:24.6882759Z

Additional Information

Drain was totally broken
it's because the Azure VM's are slow and can easily stall async. Slowing down the test by increasing the timeouts solved the problem.
I'm not sure I'm completely satisfied by this solution, but it's good enough for now, and hopefully good enough to merge.

so maybe there is a more robust fix? /cc @rayman22201

note

this is the most flaky test according to https://dev.azure.com/nim-lang/Nim/_test/analytics?definitionId=1&contextType=build as you can see by filtering by branch and selecting refs/heads/devel

@timotheecour timotheecour added OS/Architecture Specific Regression Async Everything related to Nim's async labels Jul 2, 2020
@timotheecour timotheecour changed the title regression? drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim very flaky on windows drain/hasPendingOperations seems broken: tests/async/testmanyasyncevents.nim recently very flaky on windows Jul 2, 2020
@rayman22201
Copy link
Contributor

drain is still broken on windows, in particular tests/async/testmanyasyncevents.nim is very flaky on windows.

I'm not sure if the problem is drain, or if the problem is the test.
The test may need to be adjusted to take into account the (lack of) performance of the Azure VMs.

as you can see in #14853 (to make async joinable), a lot of async tests would fail on windows if joined through megatest, ie, hasPendingOperations would return true instead of false, at least at module scope (haven't tried in proc scope or block scope); this seems related.

This makes me wonder if there is a more fundamental (and subtle) bug deeper in the async implementation. Specifically related to environments that may stall or have slow IO performance (such as a highly loaded multitenant VM system like Azure).

I have some theories but it's difficult to debug.
It's hard to debug because I don't have an environment that reproduces the issue reliably outside of Azure.
At this point I'm left with print statement debugging via Pull Request (far from optimal).
I suppose I could try to set up a particularly crippled VM? I'm not sure. 🤷‍♂️

It seems like the underlying selector events may stall in such a way that it causes the async event loop to miss timeout deadlines (maybe it's stuck in the OS callback?) or possibly the OS does not close a finished selector right away (zombie selector). Either situation could cause false positives from hasPendingOperations. It's also possible that there is some bug in hasPendingOperations, but that function is just a reference count of open OS resources basically, so I expect the issue is more likely in the underlying accounting.

@timotheecour
Copy link
Member Author

timotheecour commented Jul 2, 2020

It seems like the underlying selector events may stall in such a way that it causes the async event loop to miss timeout deadlines

could this be related? #14634

looks like there is a lag involved, not quite sure why (I thought timers at least on OSX had high resolution). note that running locally, I can reproduce the flake but only with smaller values close to the t0=100 set in var timer = selector.registerTimer(100, false, 0)

(note, this is for OSX but running in azure)

things to try

  • create a test that fails all the time
  • see whether it still fails when we just run that test or whether it only fails in the context of other tests running;
  • see why test always fails in some PR's (that have an unrelated change), but not in other PR's
  • try this on other CI: github actions/travis/appveyor just to debug this issue

my hunch is there's a bug in our code, but it only gets triggered when running in azure for some reason

@rayman22201
Copy link
Contributor

(note, this is for OSX but running in azure)
my hunch is there's a bug in our code, but it only gets triggered when running in azure for some reason

I suspect the issue is specific to OSX and triggered by a slow or overloaded VM.
I believe the flakeyness is correlated with the amount of load Azure is under at the time the test runs.
I agree, the bug could be in our code. It's just hard to debug :-(

@rayman22201
Copy link
Contributor

I have used this tool in the past to simulate load on a system to help me track down bugs that only happen on during high system stress.
I wonder if it will be helpful here:
https://github.com/ColinIanKing/stress-ng

@bung87
Copy link
Collaborator

bung87 commented Oct 14, 2022

works on current devel be18f4e on my windows 11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Async Everything related to Nim's async OS/Architecture Specific Regression
Projects
None yet
Development

No branches or pull requests

3 participants