Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: test-buffer-creation-regression intermittent failure on smartos #10166

Closed
mhdawson opened this issue Dec 7, 2016 · 13 comments
Closed

test: test-buffer-creation-regression intermittent failure on smartos #10166

mhdawson opened this issue Dec 7, 2016 · 13 comments
Labels
buffer Issues and PRs related to the buffer subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.

Comments

@mhdawson
Copy link
Member

mhdawson commented Dec 7, 2016

  • Version: master
  • Platform: smartos
  • Subsystem: buffer

Failure on smartos15-64 during regression run for change not related to this test:

https://ci.nodejs.org/job/node-test-commit-smartos/5631/nodes=smartos15-64/console

not ok 96 parallel/test-buffer-creation-regression
  ---
  duration_ms: 61.210
  severity: fail
  stack: |-
    timeout
  ...

@mhdawson
Copy link
Member Author

mhdawson commented Dec 7, 2016

@nodejs/platform-smartos

@cjihrig
Copy link
Contributor

cjihrig commented Dec 7, 2016

Related: #10161

@mscdex mscdex added buffer Issues and PRs related to the buffer subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests. labels Dec 7, 2016
@Trott
Copy link
Member

Trott commented Dec 7, 2016

Fixed in #10161.

@Trott Trott closed this as completed Dec 7, 2016
@mhdawson
Copy link
Member Author

Still failed in sequential run here: https://ci.nodejs.org/job/node-test-commit-smartos/5739/nodes=smartos15-64/console

not ok 1271 sequential/test-buffer-creation-regression
  ---
  duration_ms: 369.121
  severity: fail
  stack: |-
    timeout
  ...

@bnoordhuis
Copy link
Member

Suggestion: mark the test as flaky on smartos and move on? I suspect this is caused by the infamously terrible virtual memory management that smartos/illumos inherited from solaris. It's not the first time we've hit this.

@bcantrill
Copy link
Contributor

Counter-suggestion: don't debug by superstition or prejudice. Could we please get a little more detail on what it is you're talking about? If there's a SmartOS or illumos issue here, we'd like to understand and fix it. And if you are unable or unwilling to debug it please let us know, and we will find someone to do it.

@misterdjules
Copy link

In the description of the second failure, duration_ms: 369.121 means that it took 369 seconds (not milliseconds) for tools/test.py to notice that this specific test did not complete.

However, the default timeout is 60 seconds, and tools/test.py actively polls its child processes' status every at least as frequently as 1/10th of a second.

It seems to suggest that the system was heavily loaded and that the test harness process couldn't be scheduled as frequently as usual, thus making this failure possibly caused by the environment more than by the test itself or a bug in node on SmartOS.

Given that this test doesn't seem to fail that frequently, and that I haven't been able to reproduce that failure manually, I would suggest the following way forward:

  1. Pass --abort-on-timeout to tools/test.py when running tests on SmartOS as described in Enable --abort-on-timeout for node-test-commit-smartos Jenkins job build#613.

  2. Continue adding instrumentation features to the test harness so that we can get processes' and system stats when a test times out.

  3. Close this issue without marking this test as flaky, and reopen it when the next failure happens. The reason for not marking this test as flaky is that we currently don't know if the test itself is actually flaky, and that it is possible that failures were due to the environment. Marking it as flaky would hide future actual failures. Hopefully, when the next failure occurs, we'll be able to use the instrumentation described in 1) and 2) to have a better idea of what the problem is.

How does that sound?

@Trott
Copy link
Member

Trott commented Feb 3, 2017

SGTM

@misterdjules
Copy link

thus making this failure possibly caused by the environment more than by the test itself or a bug in node on SmartOS.

After further testing, it seems that the performance of this test on SmartOS could be related to the version of the platform on which the VM running the test is provisioned.

To make a long story short, after looking at various kstats (notably the n_pf_throttle_usec in the memory_cap module that represents the time spent by the process being throttled when page faulting) and microstate counters with prstat -m, I couldn't come to a definite conclusion about what's going on.

However, I did provision two VMs using the same image as the current test VMs used to run test Jenkins jobs, using a 8GBs RAM package. One was provisioned on a server using a 6 months old platform, the other on a server using a more recent platform. The VM provisioned on the more recent platform was able to run this test in ~1 second consistently, whereas the other was running the same test in ~10 seconds consistently, with some peaks ~50 seconds.

So instead of moving forward with what I suggested previously, I'll investigate further and update this issue as soon as I have more details.

@thefourtheye
Copy link
Contributor

If needed, we can reduce the size of that test

const size = 8589934592;

to

const size = 4294968296;

I'll submit a PR soon.

@misterdjules
Copy link

If needed, we can reduce the size of that test

It certainly won't hurt to perform computation and I/O that doesn't need to happen. However, just to make sure everyone is on the same page, reducing the size of the buffer created to 4294968296 won't systematically lower the rate of failures for this test, and fixing this test on SmartOS will require understanding why it performs poorly on that platform compared to others.

@thefourtheye
Copy link
Contributor

@misterdjules I agree. I submitted the PR because we are over-allocating anyway. 4294968296 (or anything more than 4294967296) should be enough to prove that the patch is working.

thefourtheye added a commit to thefourtheye/io.js that referenced this issue Mar 3, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

Ref: nodejs#10166
thefourtheye added a commit that referenced this issue Apr 4, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

PR-URL: #11177
Ref: #10166

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Julien Gilli <jgilli@nodejs.org>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
italoacasas pushed a commit to italoacasas/node that referenced this issue Apr 10, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

PR-URL: nodejs#11177
Ref: nodejs#10166

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Julien Gilli <jgilli@nodejs.org>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Apr 18, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

PR-URL: #11177
Ref: #10166

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Julien Gilli <jgilli@nodejs.org>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
MylesBorins pushed a commit that referenced this issue Apr 19, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

PR-URL: #11177
Ref: #10166

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Julien Gilli <jgilli@nodejs.org>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
andrew749 pushed a commit to michielbaird/node that referenced this issue Jul 19, 2017
This test is allocating much more memory than necessary to actually
reproduce the original problem. Lowering the amount of memory allocated
increases performance at least in some cases and makes this test less
likely to time out on SmartOS.

PR-URL: nodejs/node#11177
Ref: nodejs/node#10166

Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Julien Gilli <jgilli@nodejs.org>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
@Trott
Copy link
Member

Trott commented Jul 26, 2017

It seems that we may have not seen this failure for a few months now. I'm going to close this but anyone should feel free to re-open (or comment requesting it be re-opened if GitHub does not allow you to re-open) if they feel that's not the right way to go.

@Trott Trott closed this as completed Jul 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests.
Projects
None yet
Development

No branches or pull requests

8 participants