Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-dns #5554

Closed
Trott opened this issue Mar 3, 2016 · 14 comments
Closed

Investigate flaky test-dns #5554

Trott opened this issue Mar 3, 2016 · 14 comments
Labels
arm Issues and PRs related to the ARM platform. dns Issues and PRs related to the dns subsystem. test Issues and PRs related to the tests.

Comments

@Trott
Copy link
Member

Trott commented Mar 3, 2016

Example failure:

@Trott Trott added test Issues and PRs related to the tests. arm Issues and PRs related to the ARM platform. labels Mar 3, 2016
@mscdex mscdex added the dns Issues and PRs related to the dns subsystem. label Mar 4, 2016
@Trott
Copy link
Member Author

Trott commented Mar 8, 2016

Another failure:

@Trott
Copy link
Member Author

Trott commented Mar 13, 2016

And another:

@Trott
Copy link
Member Author

Trott commented Mar 14, 2016

It appears that this test is flaky due to timing out when the callback never fires on one of the two test cases:

assert.doesNotThrow(function() {
  dns.lookup('www.google.com', {}, noop);
});

assert.doesNotThrow(function() {
  dns.lookup('www.google.com', {
    family: 4,
    hints: 0
  }, noop);
});

I'm not sure what can be done (from within Node) to make sure those dns.lookup() calls always return. dns.getServers() returns an empty array so I'd imagine that these tests don't try to actually do the lookup. But run locally, if I replace noop with something that logs the found address, it is in fact finding the address. (Maybe c-ares is looking at my machine's DNS cache or something?)

Anyway, if anyone has ideas how to make those tests more robust, I'd love to hear it. /cc @nodejs/testing @nodejs/build

@Trott
Copy link
Member Author

Trott commented Mar 14, 2016

Oh, I see now that I had it backwards in my mind: dns.lookup() always tries to resolve. So dns.getServers() is irrelevant. (It's relevant for dns.resolve(). I thought it was the other way around.)

OK, so I'm thinking there's might be no foolproof way around this problem and maybe these test cases should be moved from parallel to internet. But I'd welcome a more informed opinion on this.

@Trott
Copy link
Member Author

Trott commented Mar 17, 2016

Maybe we can move those two tests into their own test file and find a way to replace cares.getaddrinfo() in lib/dns.js with a test double?

@Trott
Copy link
Member Author

Trott commented Mar 25, 2016

@Trott
Copy link
Member Author

Trott commented Mar 31, 2016

Still happening: https://ci.nodejs.org/job/node-test-binary-arm/1540/RUN_SUBSET=1,nodes=pi2-raspbian-wheezy/console

Any ideas, anyone? Pretty thin roster using git shortlog but maybe @bnoordhuis or @evanlucas might have some ideas what's going wrong here?

@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

Specific part of the test that's being problematic was added in 43067864. /cc @cjihrig although I'm really just kind of throwing darts at a wall to pick out names to cc at this point...

@cjihrig
Copy link
Contributor

cjihrig commented Apr 1, 2016

Do those requests come back, given enough time? I don't think 4306786 would cause the flakey behavior.

@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

They timeout after 120 seconds, so I think the answer is "no, they don't come back".

I imagine this is related to the general uptick in pi2-raspbian-wheezy flakiness we've seen lately but I don't know what the ultimate source is. It does seem to be primarily or exclusively in tests that involve network connections (to localhost in most/all cases, but that's the nature of our CI tests). I haven't dug into c-ares so I'm not sure if this test might actually touch the network for certain configurations or whatever. It passes without any network connectivity, so it doesn't need the network.

Yeah, and I agree that it seems unlikely that 4306786 is the problem. It's more likely the victim here than the perpetrator.

@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

Good news. I tried putting all those hints tests in their own file to see if that fixed the flakiness on the theory that the tests were either interacting with other tests or else cumulatively hitting some sort of threshold that was triggering Pi2 to sometimes drop a ... packet or something.

Anyway, looks like that fixes it, and I'm all for splitting out the really big honkin' test files into smaller honkin' test files. So, win-win, I guess.

Stress test on current master confirming flakiness of current test: https://ci.nodejs.org/job/node-stress-single-test/577/nodes=pi2-raspbian-wheezy/console

Stress test on my branch confirming non-flakiness of the test-dns stuff split across two files: https://ci.nodejs.org/job/node-stress-single-test/576/nodes=pi2-raspbian-wheezy/console (215 tests and counting... still time for it to go sideways, but I'm optimistic...) UPDATE: 1 failure in 500+ runs, which is a whole better than the 10 failures in 100+ runs that we get on current master, but still, there's an issue...

Trott added a commit to Trott/io.js that referenced this issue Apr 1, 2016
A few of the hints tests were flaky in CI on pi2-raspbian-wheezy. Moving
them to their own file fixes it. It could be that there is an unnoticed
interaction with other tests in the file, or it could be that there is a
cumulative threshold on some resource that is reached that causes Pi2 to
sometimes stall out. (The test was timing out.)

Fixes: nodejs#5554
@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

Stress test for further splitting up the test more in the hops of increased reliability: https://ci.nodejs.org/job/node-stress-single-test/578/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

OK, so it does use the network despite setServers([]) (which I guess I should have known). But since all we're testing is throws vs. not throws, we can change these to look up an empty string rather than actual hostname, since all the throwing stuff happens during parameter validation. That will avoid the network lookup and probably result in a completely reliable test. PR coming soon...

Trott added a commit to Trott/io.js that referenced this issue Apr 1, 2016
Use empty string instead of `www.google.com` for tests where we are just
doing parameter evaluation. This will avoid DNS lookups which appear to
be causing flakiness on Raspberry Pi devices in CI.

Fixes: nodejs#5554
@Trott
Copy link
Member Author

Trott commented Apr 1, 2016

New PR: #5996

@Trott Trott closed this as completed in 8d96300 Apr 4, 2016
MylesBorins pushed a commit that referenced this issue Apr 5, 2016
Use empty string instead of `www.google.com` for tests where we are just
doing parameter evaluation. This will avoid DNS lookups which appear to
be causing flakiness on Raspberry Pi devices in CI.

PR-URL: #5996
Fixes: #5554
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Apr 11, 2016
Use empty string instead of `www.google.com` for tests where we are just
doing parameter evaluation. This will avoid DNS lookups which appear to
be causing flakiness on Raspberry Pi devices in CI.

PR-URL: #5996
Fixes: #5554
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. dns Issues and PRs related to the dns subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants