test: test-buffer-creation-regression intermittent failure on smartos #10166

mhdawson · 2016-12-07T15:48:12Z

Version: master
Platform: smartos
Subsystem: buffer

Failure on smartos15-64 during regression run for change not related to this test:

https://ci.nodejs.org/job/node-test-commit-smartos/5631/nodes=smartos15-64/console

not ok 96 parallel/test-buffer-creation-regression
  ---
  duration_ms: 61.210
  severity: fail
  stack: |-
    timeout
  ...

The text was updated successfully, but these errors were encountered:

mhdawson · 2016-12-07T15:48:31Z

@nodejs/platform-smartos

cjihrig · 2016-12-07T15:51:52Z

Related: #10161

Trott · 2016-12-07T20:44:57Z

Fixed in #10161.

mhdawson · 2016-12-13T21:57:14Z

Still failed in sequential run here: https://ci.nodejs.org/job/node-test-commit-smartos/5739/nodes=smartos15-64/console

not ok 1271 sequential/test-buffer-creation-regression
  ---
  duration_ms: 369.121
  severity: fail
  stack: |-
    timeout
  ...

bnoordhuis · 2016-12-13T22:07:11Z

Suggestion: mark the test as flaky on smartos and move on? I suspect this is caused by the infamously terrible virtual memory management that smartos/illumos inherited from solaris. It's not the first time we've hit this.

bcantrill · 2016-12-14T20:07:54Z

Counter-suggestion: don't debug by superstition or prejudice. Could we please get a little more detail on what it is you're talking about? If there's a SmartOS or illumos issue here, we'd like to understand and fix it. And if you are unable or unwilling to debug it please let us know, and we will find someone to do it.

misterdjules · 2017-02-03T20:46:34Z

In the description of the second failure, duration_ms: 369.121 means that it took 369 seconds (not milliseconds) for tools/test.py to notice that this specific test did not complete.

However, the default timeout is 60 seconds, and tools/test.py actively polls its child processes' status every at least as frequently as 1/10th of a second.

It seems to suggest that the system was heavily loaded and that the test harness process couldn't be scheduled as frequently as usual, thus making this failure possibly caused by the environment more than by the test itself or a bug in node on SmartOS.

Given that this test doesn't seem to fail that frequently, and that I haven't been able to reproduce that failure manually, I would suggest the following way forward:

Pass --abort-on-timeout to tools/test.py when running tests on SmartOS as described in Enable --abort-on-timeout for node-test-commit-smartos Jenkins job build#613.
Continue adding instrumentation features to the test harness so that we can get processes' and system stats when a test times out.
Close this issue without marking this test as flaky, and reopen it when the next failure happens. The reason for not marking this test as flaky is that we currently don't know if the test itself is actually flaky, and that it is possible that failures were due to the environment. Marking it as flaky would hide future actual failures. Hopefully, when the next failure occurs, we'll be able to use the instrumentation described in 1) and 2) to have a better idea of what the problem is.

How does that sound?

Trott · 2017-02-03T21:47:13Z

SGTM

misterdjules · 2017-02-04T00:58:51Z

thus making this failure possibly caused by the environment more than by the test itself or a bug in node on SmartOS.

After further testing, it seems that the performance of this test on SmartOS could be related to the version of the platform on which the VM running the test is provisioned.

To make a long story short, after looking at various kstats (notably the n_pf_throttle_usec in the memory_cap module that represents the time spent by the process being throttled when page faulting) and microstate counters with prstat -m, I couldn't come to a definite conclusion about what's going on.

However, I did provision two VMs using the same image as the current test VMs used to run test Jenkins jobs, using a 8GBs RAM package. One was provisioned on a server using a 6 months old platform, the other on a server using a more recent platform. The VM provisioned on the more recent platform was able to run this test in ~1 second consistently, whereas the other was running the same test in ~10 seconds consistently, with some peaks ~50 seconds.

So instead of moving forward with what I suggested previously, I'll investigate further and update this issue as soon as I have more details.

thefourtheye · 2017-02-05T13:53:49Z

If needed, we can reduce the size of that test

const size = 8589934592;

to

const size = 4294968296;

I'll submit a PR soon.

misterdjules · 2017-02-07T00:41:29Z

If needed, we can reduce the size of that test

It certainly won't hurt to perform computation and I/O that doesn't need to happen. However, just to make sure everyone is on the same page, reducing the size of the buffer created to 4294968296 won't systematically lower the rate of failures for this test, and fixing this test on SmartOS will require understanding why it performs poorly on that platform compared to others.

thefourtheye · 2017-02-08T08:48:19Z

@misterdjules I agree. I submitted the PR because we are over-allocating anyway. 4294968296 (or anything more than 4294967296) should be enough to prove that the patch is working.

This test is allocating much more memory than necessary to actually reproduce the original problem. Lowering the amount of memory allocated increases performance at least in some cases and makes this test less likely to time out on SmartOS. Ref: nodejs#10166

This test is allocating much more memory than necessary to actually reproduce the original problem. Lowering the amount of memory allocated increases performance at least in some cases and makes this test less likely to time out on SmartOS. PR-URL: #11177 Ref: #10166 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@nodejs.org> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

This test is allocating much more memory than necessary to actually reproduce the original problem. Lowering the amount of memory allocated increases performance at least in some cases and makes this test less likely to time out on SmartOS. PR-URL: nodejs#11177 Ref: nodejs#10166 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@nodejs.org> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

This test is allocating much more memory than necessary to actually reproduce the original problem. Lowering the amount of memory allocated increases performance at least in some cases and makes this test less likely to time out on SmartOS. PR-URL: #11177 Ref: #10166 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@nodejs.org> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

This test is allocating much more memory than necessary to actually reproduce the original problem. Lowering the amount of memory allocated increases performance at least in some cases and makes this test less likely to time out on SmartOS. PR-URL: nodejs/node#11177 Ref: nodejs/node#10166 Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@nodejs.org> Reviewed-By: Colin Ihrig <cjihrig@gmail.com>

Trott · 2017-07-26T05:04:51Z

It seems that we may have not seen this failure for a few months now. I'm going to close this but anyone should feel free to re-open (or comment requesting it be re-opened if GitHub does not allow you to re-open) if they feel that's not the right way to go.

mhdawson mentioned this issue Dec 7, 2016

test: refactor test-beforeexit-event #10121

Closed

3 tasks

mscdex added buffer Issues and PRs related to the buffer subsystem. smartos Issues and PRs related to the SmartOS platform. test Issues and PRs related to the tests. labels Dec 7, 2016

Trott closed this as completed Dec 7, 2016

mhdawson reopened this Dec 13, 2016

mhdawson mentioned this issue Dec 13, 2016

Add regex to text-crypto-random #10020

Closed

4 tasks

thefourtheye mentioned this issue Feb 5, 2017

test: reduce buffer size in buffer-creation test #11177

Closed

2 tasks

Trott closed this as completed Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: test-buffer-creation-regression intermittent failure on smartos #10166

test: test-buffer-creation-regression intermittent failure on smartos #10166

mhdawson commented Dec 7, 2016

mhdawson commented Dec 7, 2016

cjihrig commented Dec 7, 2016

Trott commented Dec 7, 2016

mhdawson commented Dec 13, 2016

bnoordhuis commented Dec 13, 2016

bcantrill commented Dec 14, 2016

misterdjules commented Feb 3, 2017

Trott commented Feb 3, 2017

misterdjules commented Feb 4, 2017

thefourtheye commented Feb 5, 2017

misterdjules commented Feb 7, 2017

thefourtheye commented Feb 8, 2017

Trott commented Jul 26, 2017

test: test-buffer-creation-regression intermittent failure on smartos #10166

test: test-buffer-creation-regression intermittent failure on smartos #10166

Comments

mhdawson commented Dec 7, 2016

mhdawson commented Dec 7, 2016

cjihrig commented Dec 7, 2016

Trott commented Dec 7, 2016

mhdawson commented Dec 13, 2016

bnoordhuis commented Dec 13, 2016

bcantrill commented Dec 14, 2016

misterdjules commented Feb 3, 2017

Trott commented Feb 3, 2017

misterdjules commented Feb 4, 2017

thefourtheye commented Feb 5, 2017

misterdjules commented Feb 7, 2017

thefourtheye commented Feb 8, 2017

Trott commented Jul 26, 2017