Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: make cluster test more time tolerant on more platforms #5056

Closed
wants to merge 1 commit into from

Conversation

ncopa
Copy link
Contributor

@ncopa ncopa commented Feb 3, 2016

The problem
nodejs/node-v0.x-archive@f3f4e28
resolves also affects Alpine Linux (musl libc) so use increased timeout
for all platforms and not only for AIX.

// AIX needs more time due to default exit performance
timeout = 1000;
}
// AIX needs more time due to default exit performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop the comment if it's no longer AIX specific.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

@rvagg
Copy link
Member

rvagg commented Feb 3, 2016

@nodejs/build @nodejs/testing anyone have a reason we should care about this 5-fold increase? I find it very strange that any platform needs this long but it doesn't seem like exit performance has anything to do with what this test is addressing so I guess we're not supposed to mind?

@jbergstroem
Copy link
Member

ping @Trott as well

@r-52 r-52 added cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests. lts-watch-v4.x labels Feb 3, 2016
@ncopa
Copy link
Contributor Author

ncopa commented Feb 3, 2016

Here is a strace: https://gist.github.com/ncopa/ca39045d23f841a5de6b

What happens is:

master process is forked.

master process forks worker processes.

master process dies.

At this point there are 2 worker processes on the system. When they exit they become zombies because their parent did not have a chance to reap them. Calling kill(pid,0) will return 'alive' because nothing has reaped the exited process.

since the parent (the master process) died the worker processes are reparented to pid 1, and pid 1 will eventually reap the worker processes. This is what takes 200ms on some systems and on others more. On my system I happens to use busybox init which needs more than 200ms to do the cleanup work.

In other words, the timeout needed depends completely on how fast the pid 1 (the init process) cleans up zombie processes.

@Trott
Copy link
Member

Trott commented Feb 3, 2016

Not necessarily a deal-breaker, but these changes have the effect of making the tests take longer to execute on all platforms, not just the platforms that need it. That might be OK, but perhaps it would be possible to have a setInterval that polls the status every 200ms or so and then a setTimeout that gives up and throws an error after 1000ms, or something like that.

Even better, as always, would be if we could just find a way to get rid of the arbitrary timeout values entirely. But that's not always straightforward either.

@Trott
Copy link
Member

Trott commented Feb 3, 2016

Also, if the reaction to my previous comment is "Let's not over-engineer a test just to shave off a few hundred milliseconds", I can respect that too.

@ncopa
Copy link
Contributor Author

ncopa commented Feb 3, 2016

Technically, when the master exits with error, it needs to kill its children and waitpid() for each of them.

When master exits without error, we need to either waitpid() each child or kill(2) and waitpid() each child.

If we do that we can totally remove the timeout.

It does not appear to be any child_process.waitpid() though.

@ncopa ncopa force-pushed the cluster-tests-liveliness-timeout branch from e2abbc2 to 5cb4ae3 Compare February 3, 2016 20:56
@ncopa
Copy link
Contributor Author

ncopa commented Feb 3, 2016

I rebased the patch and included some explaining comments on why it takes so long time.

@ncopa
Copy link
Contributor Author

ncopa commented Feb 3, 2016

I also know why busybox init is slow. It has a subtimal implementation of reaping the children. node will trigger a sleep(1) in busybox init that is there to protect it from respawning too fast.

I guess AIX init has similar issue.

@jasnell
Copy link
Member

jasnell commented Feb 4, 2016

LGTM if CI is green https://ci.nodejs.org/job/node-test-pull-request/1544/

@ncopa
Copy link
Contributor Author

ncopa commented Feb 5, 2016

Note that the timeout is a hack in the first place. The parent process (the master) should wait for the children (the workers) using wait/waitpid before it exits instead of letting init (pid 1) do the cleanup.

@jasnell
Copy link
Member

jasnell commented Apr 21, 2016

@nodejs/testing ... any further thoughts on this one?

@Trott
Copy link
Member

Trott commented Apr 21, 2016

I dislike the change in test-cluster-master-error.js because the timer is not a timeout for the test itself but is actually necessary for the test to complete. So if we up it from 200ms to 1s, the duration to run the test increases about five-fold on all platforms, not just the platforms that need it.

As I said above, I won't stand in the way if others feel this is acceptable or desirable. But if any of them are possible, I would much prefer any of these solutions:

  • only increased the timeout on platforms that need it
  • dispensed with the setTimeout() in favor of a setInterval() that did frequent polling (say, every 200ms)
  • a rewrite that dumped the whole setTimeout() business entirely in favor of triggering via events

// AIX needs more time due to default exit performance
timeout = 1000;
}
// normally 200ms is enough AIX's init and busybox init needs more
Copy link
Member

@Trott Trott Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: . after enough

EDIT: While we're in Nit Mode: needs -> need

@ncopa
Copy link
Contributor Author

ncopa commented Apr 22, 2016

@Trott I agree that waiting 1 sec always on all platforms is ugly. We can do better than that.

I don't like checking for specific platforms because the affected platforms may be fixed in future.

Possible alternative:

@@ -58,15 +62,20 @@ if (cluster.isWorker) {
     // make sure that the master died by purpose
     assert.equal(code, 0);

-    // check worker process status
-    var timeout = 200;
-    if (common.isAix) {
-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
-    setTimeout(function() {
+    // wait for init (pid 1) to collect the worker process
+    // normally 200ms is enough, but it depends on the init (pid 1)
+    // implementation. AIX's init and busybox init need more. We wait
+    // up to 1 sec (1000ms) before we trigger error.
+    var timeout = 1000;
+
+    function waitWorker() {
       alive = isAlive(pid);
-    }, timeout);
+      if (alive && timeout > 0) {
+        setTimeout(waitWorker, 10);
+        timeout -= 10;
+      }
+    }
+    waitWorker();
   });

   process.once('exit', function() {

As mentioned the proper fix is to keep the master process collect all children before exit, but i can not see that waitpid wrapper is implemented/exposed.

@ncopa
Copy link
Contributor Author

ncopa commented Apr 22, 2016

Alternative using setInterval and clearInterval:

@@ -58,15 +62,18 @@ if (cluster.isWorker) {
     // make sure that the master died by purpose
     assert.equal(code, 0);

-    // check worker process status
-    var timeout = 200;
-    if (common.isAix) {
-      // AIX needs more time due to default exit performance
-      timeout = 1000;
-    }
-    setTimeout(function() {
+    // wait for init (pid 1) to collect the worker process
+    // normally 200ms is enough, but it depends on the init (pid 1)
+    // implementation. AIX's init and busybox init need more. We wait
+    // up to 1 sec (1000ms) before we trigger error.
+    var timeout = 1000;
+
+    var waitWorker = setInterval(function() {
+      timeout -= 10;
       alive = isAlive(pid);
-    }, timeout);
+      if (!alive || (timeout <= 0))
+        clearInterval(waitWorker);
+    }, 10);
   });

   process.once('exit', function() {

which do you prefer?

@Trott
Copy link
Member

Trott commented Apr 22, 2016

@ncopa Either is fine by me. Anyone object to polling like that instead of the current "wait 1000ms and check once" approach? @nodejs/testing

EDIT: "Either" meaning "either of the two polling approaches you came up with".

The problem
nodejs/node-v0.x-archive@f3f4e28
resolves also affects Alpine Linux (musl libc) so wait up to 1 second
for all platforms. But poll regularily so we don´t need to wait more
than necessary.

Also add some explaining comments on why it takes so long for the child
processes to die.

Signed-off-by: Natanael Copa <ncopa@alpinelinux.org>
@ncopa ncopa force-pushed the cluster-tests-liveliness-timeout branch from 5cb4ae3 to 6e7ef0e Compare May 3, 2016 18:09
@ncopa
Copy link
Contributor Author

ncopa commented May 3, 2016

Changed to polling with setInterval and rebased against master.

@stefanmb
Copy link
Contributor

stefanmb commented May 3, 2016

@santigimeno Just pointed out I made a very similar PR here: #6531.

@ncopa With regards to this PR, my concern is that even the 1 second interval is sometimes insufficient, so in #6531 I made the these two tests wait indefinitely, with the idea being that the (Python) test harness will kill them if they hang and fail them anyway.

It's inelegant, but timeout-based liveness checks are inherently brittle - we may as well centralize on the single master timeout in the test harness instead of picking arbitrary numbers in the tests themselves.

Edit: My original approach relied more on serializing the master/worker exit, but it used undocumented API so I adopted the suggest interval approach used here too.

@ncopa
Copy link
Contributor Author

ncopa commented May 3, 2016

@stefanmb the idea with setting a timeout is to detect breakages as early as possible. I´d be ok with 10 seconds but is it more than that then i´d say something is broken. I also think it does not hurt to poll more frequent that 2 times/sec.

Timeout or wait indefinite are both equally ugly hacks so for me either is ok. Maybe indefinite wait is better since its simpler.

@ncopa
Copy link
Contributor Author

ncopa commented May 3, 2016

@Trott Can this or #6531 be merged before more duplicate pull requests shows up?

@drewfish
Copy link
Contributor

drewfish commented May 3, 2016

I wonder if #6193 is related.

@Trott
Copy link
Member

Trott commented May 3, 2016

@Trott Can this or #6531 be merged before more duplicate pull requests shows up?

@stefanmb has the ability to merge PRs and both PRs have LGTM comments from collaborators, so I'll leave it up to him to figure out which one to merge as he's been paying closer attention than me to this issue. I don't object to either one.

@stefanmb
Copy link
Contributor

stefanmb commented May 4, 2016

Closing this PR since I've landed the changes in #6531 instead, they should be equivalent. Thanks @ncopa for your help!

@stefanmb stefanmb closed this May 4, 2016
@ncopa ncopa deleted the cluster-tests-liveliness-timeout branch May 4, 2016 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants