Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-http-agent #5184

Closed
Trott opened this issue Feb 10, 2016 · 20 comments
Closed

Investigate flaky test-http-agent #5184

Trott opened this issue Feb 10, 2016 · 20 comments
Labels
arm Issues and PRs related to the ARM platform. http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests.

Comments

@Trott
Copy link
Member

Trott commented Feb 10, 2016

Failure:

@Trott Trott added http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests. arm Issues and PRs related to the ARM platform. labels Feb 10, 2016
@Trott
Copy link
Member Author

Trott commented Feb 12, 2016

And another:

Trott added a commit to Trott/io.js that referenced this issue Feb 12, 2016
@santigimeno
Copy link
Member

Can anyone copy the test output? Thanks

@r-52
Copy link
Contributor

r-52 commented Feb 13, 2016

@santigimeno I've excluded some # X 200 output to keep it short...

not ok 50 test-http-agent.js
# TIMEOUT
# 0 200
# 1 200
# 2 200
# 3 200
# 4 200
# 5 200
# 6 200
# 7 200
# 8 200
# 9 200
....
# 95 200
# 96 200
# 97 200
# 98 200
  ---

Trott added a commit to Trott/io.js that referenced this issue Feb 16, 2016
Ref: nodejs#5184
PR-URL: nodejs#5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
@Trott
Copy link
Member Author

Trott commented Feb 16, 2016

Another one:

MylesBorins pushed a commit that referenced this issue Feb 18, 2016
Ref: #5184
PR-URL: #5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
rvagg pushed a commit that referenced this issue Feb 18, 2016
Ref: #5184
PR-URL: #5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
MylesBorins pushed a commit that referenced this issue Feb 18, 2016
Ref: #5184
PR-URL: #5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Another one:

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Stress test confirming failure on current master branch. (Mostly useful to show that it's flaky in stress tests. It's possible for a test to be flaky in CI but not in stress test.) https://ci.nodejs.org/job/node-stress-single-test/510/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Stress test showing that adding a common.platformTimeout() to the setTimeout() call doesn't fix it. This isn't surprising because the timeout gets incremented by 1 millisecond each time in the loop. Incrementing by 1 millisecond or 7 milliseconds on slow hardware, you might as well not increment at all.

https://ci.nodejs.org/job/node-stress-single-test/509/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Stress test showing that switching to setImmediate() seems to improve things but does not solve the issue entirely: https://ci.nodejs.org/job/node-stress-single-test/511/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Stress test showing similar results for process.nextTick() as for setImmediate(): https://ci.nodejs.org/job/node-stress-single-test/512/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

Lastly (for now), test showing that eliminating the timer altogether doesn't fix things either: https://ci.nodejs.org/job/node-stress-single-test/513/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 20, 2016

For good measure, here's a stress test (results pending as of this writing) that changes the timeout value from i (which is between 0 and 9 inclusive) to common.platform(i * 100)).

https://ci.nodejs.org/job/node-stress-single-test/514/nodes=pi2-raspbian-wheezy/console

EDIT: That one failed too.

@santigimeno
Copy link
Member

@Trott do you think reducing M and/or N would help?

@Trott
Copy link
Member Author

Trott commented Feb 21, 2016

@santigimeno That is a very good question. I've reduced both from 10 to 5. Stress test: https://ci.nodejs.org/job/node-stress-single-test/515/nodes=pi2-raspbian-wheezy/console

UPDATE: Fails. Same thing, it loses one connection somewhere.

@Trott
Copy link
Member Author

Trott commented Feb 21, 2016

So this is interesting: I added some logging to keep track of the value of i and j that is missed. And it always misses when i is 2 and j is one of 2, 3, or (rarely) 4. In other words, 23rd, 24th, or (rarely) 25th HTTP request sent. Out of 200 100 requests. When a request is dropped, it's always the 23rd, 24th, or 25th request. I'm not sure what that could be. Some ideas of varying levels of seeming plausibility are:

  • the Raspberry Pi is doing some request throttling somewhere
  • TCP congestion window (but across connections? wha?)
  • gremlins

EDIT: Here it is (currently in progress, but what I write above is valid for the first 11 failures at least): https://ci.nodejs.org/job/node-stress-single-test/516/nodes=pi2-raspbian-wheezy/console

@Trott
Copy link
Member Author

Trott commented Feb 21, 2016

Now that there's at least an interesting/curious observation: /cc @nodejs/testing

@Trott
Copy link
Member Author

Trott commented Feb 21, 2016

So, given the above observation, here's a stress test with M and N both reduced to 4, so there will be 16 connections rather than 100. https://ci.nodejs.org/job/node-stress-single-test/517/nodes=pi2-raspbian-wheezy/console

124 runs so far and no failures, so we have a winner, I think. If it makes it to 999 tests, I'll submit a PR. It's not perfect, but it's better than a flaky test, in my opinion at least. And there's nothing magical about the 100 clients in this case. It was just an arbitrary number.

Trott added a commit to Trott/io.js that referenced this issue Feb 21, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

Fixes: nodejs#5184
@Trott Trott closed this as completed in bbf4621 Feb 22, 2016
stefanmb pushed a commit to stefanmb/node that referenced this issue Feb 23, 2016
Ref: nodejs#5184
PR-URL: nodejs#5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
@mhdawson
Copy link
Member

@Trott
Copy link
Member Author

Trott commented Feb 25, 2016

@mhdawson That does not appear to be output from the current version of the test. That looks like output from the previous version of the test.

The current version should only show 16 lines of output. The old version showed 100 lines of output. The relevant commit you want to make sure has landed in the ref/branch/whatever you're testing is 76657c5.

@Trott
Copy link
Member Author

Trott commented Feb 25, 2016

@mhdawson Confirmed that the branch being tested (https://github.com/stefanmb/node/commits/test-runner-fix) does not have the fix committed.

I guess node-test-pull-request does not rebase against current master the way node-accept-pull-request does/did. I wonder if it ought to? /cc @nodejs/build @nodejs/testing

Closing this for now. Feel free to re-open if you think that's not right. But I don't believe the test is currently flaky on master.

@Trott Trott closed this as completed Feb 25, 2016
@orangemocha
Copy link
Contributor

I guess node-test-pull-request does not rebase against current master the way node-accept-pull-request does/did. I wonder if it ought to?

It's supposed to, but it's a parameter and rebasing can be skipped. Let me know if you think it's not working as expected.

rvagg pushed a commit that referenced this issue Feb 27, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

Fixes: #5184
PR-URL: #5346
Reviewed-By: James M Snell <jasnell@gmail.com>
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
Ref: #5184
PR-URL: #5209
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Roman Klauke <romaaan.git@gmail.com>
MylesBorins pushed a commit that referenced this issue Mar 17, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

Fixes: #5184
PR-URL: #5346
Reviewed-By: James M Snell <jasnell@gmail.com>
MylesBorins pushed a commit that referenced this issue Mar 21, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

Fixes: #5184
PR-URL: #5346
Reviewed-By: James M Snell <jasnell@gmail.com>
Trott added a commit to Trott/io.js that referenced this issue Mar 28, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

Fixes: nodejs#5938
Refs: nodejs#5184
Trott added a commit to Trott/io.js that referenced this issue Mar 30, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

PR-URL: nodejs#5939
Fixes: nodejs#5938
Refs: nodejs#5184
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
evanlucas pushed a commit that referenced this issue Mar 31, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

PR-URL: #5939
Fixes: #5938
Refs: #5184
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
evanlucas pushed a commit that referenced this issue Mar 31, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

PR-URL: #5939
Fixes: #5938
Refs: #5184
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
MylesBorins pushed a commit that referenced this issue Apr 11, 2016
Reduce number of clients from 100 to 16 as Raspberry Pi in CI starts to
exhibit flakiness around 22 or so clients.

PR-URL: #5939
Fixes: #5938
Refs: #5184
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. http Issues or PRs related to the http subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

5 participants