test: make cluster test more time tolerant on more platforms #5056

ncopa · 2016-02-03T11:03:35Z

The problem
nodejs/node-v0.x-archive@f3f4e28
resolves also affects Alpine Linux (musl libc) so use increased timeout
for all platforms and not only for AIX.

bnoordhuis · 2016-02-03T11:09:45Z

test/parallel/test-cluster-master-error.js

-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
+    // AIX needs more time due to default exit performance


I'd drop the comment if it's no longer AIX specific.

makes sense.

rvagg · 2016-02-03T11:15:30Z

@nodejs/build @nodejs/testing anyone have a reason we should care about this 5-fold increase? I find it very strange that any platform needs this long but it doesn't seem like exit performance has anything to do with what this test is addressing so I guess we're not supposed to mind?

jbergstroem · 2016-02-03T11:31:37Z

ping @Trott as well

ncopa · 2016-02-03T12:57:29Z

Here is a strace: https://gist.github.com/ncopa/ca39045d23f841a5de6b

What happens is:

master process is forked.

master process forks worker processes.

master process dies.

At this point there are 2 worker processes on the system. When they exit they become zombies because their parent did not have a chance to reap them. Calling kill(pid,0) will return 'alive' because nothing has reaped the exited process.

since the parent (the master process) died the worker processes are reparented to pid 1, and pid 1 will eventually reap the worker processes. This is what takes 200ms on some systems and on others more. On my system I happens to use busybox init which needs more than 200ms to do the cleanup work.

In other words, the timeout needed depends completely on how fast the pid 1 (the init process) cleans up zombie processes.

Trott · 2016-02-03T15:53:39Z

Not necessarily a deal-breaker, but these changes have the effect of making the tests take longer to execute on all platforms, not just the platforms that need it. That might be OK, but perhaps it would be possible to have a setInterval that polls the status every 200ms or so and then a setTimeout that gives up and throws an error after 1000ms, or something like that.

Even better, as always, would be if we could just find a way to get rid of the arbitrary timeout values entirely. But that's not always straightforward either.

Trott · 2016-02-03T15:55:12Z

Also, if the reaction to my previous comment is "Let's not over-engineer a test just to shave off a few hundred milliseconds", I can respect that too.

ncopa · 2016-02-03T19:55:08Z

Technically, when the master exits with error, it needs to kill its children and waitpid() for each of them.

When master exits without error, we need to either waitpid() each child or kill(2) and waitpid() each child.

If we do that we can totally remove the timeout.

It does not appear to be any child_process.waitpid() though.

ncopa · 2016-02-03T20:57:00Z

I rebased the patch and included some explaining comments on why it takes so long time.

ncopa · 2016-02-03T21:48:50Z

I also know why busybox init is slow. It has a subtimal implementation of reaping the children. node will trigger a sleep(1) in busybox init that is there to protect it from respawning too fast.

I guess AIX init has similar issue.

jasnell · 2016-02-04T17:56:25Z

LGTM if CI is green https://ci.nodejs.org/job/node-test-pull-request/1544/

ncopa · 2016-02-05T11:30:26Z

Note that the timeout is a hack in the first place. The parent process (the master) should wait for the children (the workers) using wait/waitpid before it exits instead of letting init (pid 1) do the cleanup.

jasnell · 2016-04-21T16:19:02Z

@nodejs/testing ... any further thoughts on this one?

Trott · 2016-04-21T20:56:58Z

I dislike the change in test-cluster-master-error.js because the timer is not a timeout for the test itself but is actually necessary for the test to complete. So if we up it from 200ms to 1s, the duration to run the test increases about five-fold on all platforms, not just the platforms that need it.

As I said above, I won't stand in the way if others feel this is acceptable or desirable. But if any of them are possible, I would much prefer any of these solutions:

only increased the timeout on platforms that need it
dispensed with the setTimeout() in favor of a setInterval() that did frequent polling (say, every 200ms)
a rewrite that dumped the whole setTimeout() business entirely in favor of triggering via events

Trott · 2016-04-21T21:03:01Z

test/parallel/test-cluster-master-kill.js

-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
+    // normally 200ms is enough AIX's init and busybox init needs more


Nit: . after enough

EDIT: While we're in Nit Mode: needs -> need

ncopa · 2016-04-22T14:14:51Z

@Trott I agree that waiting 1 sec always on all platforms is ugly. We can do better than that.

I don't like checking for specific platforms because the affected platforms may be fixed in future.

Possible alternative:

@@ -58,15 +62,20 @@ if (cluster.isWorker) {
     // make sure that the master died by purpose
     assert.equal(code, 0);

-    // check worker process status
-    var timeout = 200;
-    if (common.isAix) {
-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
-    setTimeout(function() {
+    // wait for init (pid 1) to collect the worker process
+    // normally 200ms is enough, but it depends on the init (pid 1)
+    // implementation. AIX's init and busybox init need more. We wait
+    // up to 1 sec (1000ms) before we trigger error.
+    var timeout = 1000;
+
+    function waitWorker() {
       alive = isAlive(pid);
-    }, timeout);
+      if (alive && timeout > 0) {
+        setTimeout(waitWorker, 10);
+        timeout -= 10;
+      }
+    }
+    waitWorker();
   });

   process.once('exit', function() {

As mentioned the proper fix is to keep the master process collect all children before exit, but i can not see that waitpid wrapper is implemented/exposed.

ncopa · 2016-04-22T14:26:20Z

Alternative using setInterval and clearInterval:

@@ -58,15 +62,18 @@ if (cluster.isWorker) {
     // make sure that the master died by purpose
     assert.equal(code, 0);

-    // check worker process status
-    var timeout = 200;
-    if (common.isAix) {
-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
-    setTimeout(function() {
+    // wait for init (pid 1) to collect the worker process
+    // normally 200ms is enough, but it depends on the init (pid 1)
+    // implementation. AIX's init and busybox init need more. We wait
+    // up to 1 sec (1000ms) before we trigger error.
+    var timeout = 1000;
+
+    var waitWorker = setInterval(function() {
+      timeout -= 10;
       alive = isAlive(pid);
-    }, timeout);
+      if (!alive || (timeout <= 0))
+        clearInterval(waitWorker);
+    }, 10);
   });

   process.once('exit', function() {

which do you prefer?

Trott · 2016-04-22T21:02:08Z

@ncopa Either is fine by me. Anyone object to polling like that instead of the current "wait 1000ms and check once" approach? @nodejs/testing

EDIT: "Either" meaning "either of the two polling approaches you came up with".

The problem nodejs/node-v0.x-archive@f3f4e28 resolves also affects Alpine Linux (musl libc) so wait up to 1 second for all platforms. But poll regularily so we don´t need to wait more than necessary. Also add some explaining comments on why it takes so long for the child processes to die. Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>

ncopa · 2016-05-03T18:11:24Z

Changed to polling with setInterval and rebased against master.

stefanmb · 2016-05-03T18:57:55Z

@santigimeno Just pointed out I made a very similar PR here: #6531.

@ncopa With regards to this PR, my concern is that even the 1 second interval is sometimes insufficient, so in #6531 I made the these two tests wait indefinitely, with the idea being that the (Python) test harness will kill them if they hang and fail them anyway.

It's inelegant, but timeout-based liveness checks are inherently brittle - we may as well centralize on the single master timeout in the test harness instead of picking arbitrary numbers in the tests themselves.

Edit: My original approach relied more on serializing the master/worker exit, but it used undocumented API so I adopted the suggest interval approach used here too.

ncopa · 2016-05-03T19:58:44Z

@stefanmb the idea with setting a timeout is to detect breakages as early as possible. I´d be ok with 10 seconds but is it more than that then i´d say something is broken. I also think it does not hurt to poll more frequent that 2 times/sec.

Timeout or wait indefinite are both equally ugly hacks so for me either is ok. Maybe indefinite wait is better since its simpler.

ncopa · 2016-05-03T20:02:30Z

@Trott Can this or #6531 be merged before more duplicate pull requests shows up?

drewfish · 2016-05-03T20:44:41Z

I wonder if #6193 is related.

Trott · 2016-05-03T23:03:39Z

@Trott Can this or #6531 be merged before more duplicate pull requests shows up?

@stefanmb has the ability to merge PRs and both PRs have LGTM comments from collaborators, so I'll leave it up to him to figure out which one to merge as he's been paying closer attention than me to this issue. I don't object to either one.

stefanmb · 2016-05-04T18:09:46Z

Closing this PR since I've landed the changes in #6531 instead, they should be equivalent. Thanks @ncopa for your help!

bnoordhuis reviewed Feb 3, 2016
View reviewed changes

r-52 added cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests. lts-watch-v4.x labels Feb 3, 2016

ncopa force-pushed the cluster-tests-liveliness-timeout branch from e2abbc2 to 5cb4ae3 Compare February 3, 2016 20:56

rvagg mentioned this pull request Feb 4, 2016

querystring: check that maxKeys is finite #5066

Merged

This was referenced Feb 25, 2016

Alpine Linux / Docker Build nodejs/build#75

Closed

Add an Alpine variant? nodejs/docker-node#46

Closed

retrohacker mentioned this pull request Mar 31, 2016

Make Alpine-based images nodesource/docker-node#51

Open

Trott reviewed Apr 21, 2016
View reviewed changes

estliberitas force-pushed the master branch 2 times, most recently from 7da4fd4 to c7066fb Compare April 26, 2016 05:23

ncopa force-pushed the cluster-tests-liveliness-timeout branch from 5cb4ae3 to 6e7ef0e Compare May 3, 2016 18:09

santigimeno mentioned this pull request May 3, 2016

test: avoid test-cluster-master-kill flakiness #6531

Closed

3 tasks

stefanmb closed this May 4, 2016

ncopa deleted the cluster-tests-liveliness-timeout branch May 4, 2016 20:04

MylesBorins removed the lts-watch-v4.x label May 17, 2016

Starefossen mentioned this pull request Aug 25, 2016

Process for determining supported platforms #8265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: make cluster test more time tolerant on more platforms #5056

test: make cluster test more time tolerant on more platforms #5056

ncopa commented Feb 3, 2016

bnoordhuis Feb 3, 2016

ncopa Feb 3, 2016

rvagg commented Feb 3, 2016

jbergstroem commented Feb 3, 2016

ncopa commented Feb 3, 2016

Trott commented Feb 3, 2016

Trott commented Feb 3, 2016

ncopa commented Feb 3, 2016

ncopa commented Feb 3, 2016

ncopa commented Feb 3, 2016

jasnell commented Feb 4, 2016

ncopa commented Feb 5, 2016

jasnell commented Apr 21, 2016

Trott commented Apr 21, 2016

Trott Apr 21, 2016 •

edited

Loading

ncopa commented Apr 22, 2016

ncopa commented Apr 22, 2016 •

edited

Loading

Trott commented Apr 22, 2016 •

edited

Loading

ncopa commented May 3, 2016

stefanmb commented May 3, 2016 •

edited

Loading

ncopa commented May 3, 2016

ncopa commented May 3, 2016

drewfish commented May 3, 2016

Trott commented May 3, 2016

stefanmb commented May 4, 2016

test: make cluster test more time tolerant on more platforms #5056

test: make cluster test more time tolerant on more platforms #5056

Conversation

ncopa commented Feb 3, 2016

bnoordhuis Feb 3, 2016

Choose a reason for hiding this comment

ncopa Feb 3, 2016

Choose a reason for hiding this comment

rvagg commented Feb 3, 2016

jbergstroem commented Feb 3, 2016

ncopa commented Feb 3, 2016

Trott commented Feb 3, 2016

Trott commented Feb 3, 2016

ncopa commented Feb 3, 2016

ncopa commented Feb 3, 2016

ncopa commented Feb 3, 2016

jasnell commented Feb 4, 2016

ncopa commented Feb 5, 2016

jasnell commented Apr 21, 2016

Trott commented Apr 21, 2016

Trott Apr 21, 2016 • edited Loading

Choose a reason for hiding this comment

ncopa commented Apr 22, 2016

ncopa commented Apr 22, 2016 • edited Loading

Trott commented Apr 22, 2016 • edited Loading

ncopa commented May 3, 2016

stefanmb commented May 3, 2016 • edited Loading

ncopa commented May 3, 2016

ncopa commented May 3, 2016

drewfish commented May 3, 2016

Trott commented May 3, 2016

stefanmb commented May 4, 2016

Trott Apr 21, 2016 •

edited

Loading

ncopa commented Apr 22, 2016 •

edited

Loading

Trott commented Apr 22, 2016 •

edited

Loading

stefanmb commented May 3, 2016 •

edited

Loading