Breaking change: cluster connection behavior when between workers #10427

scottnonnenberg · 2015-03-21T19:03:58Z

On OSX, I've noticed a big difference between the way that connections are dealt with by a node.js master process when there are no workers ready to take care of that incoming connection. In node 0.10.36 (and before), the connection would be held open, and a worker that hadn't been started when that request was made would have the chance to handle it. In node 0.12.0, incoming connections when between workers are outright refused.

At the very least, this should be documented.

Example code and output on both 0.10.36 and 0.12.0 follows:

var cluster = require('cluster');
var http = require('http');
var supertest = require('supertest');
var PORT = 3000;

// cluster.schedulingPolicy = cluster.SCHED_NONE;

if (!cluster.isMaster) {
  http.createServer(function (req, res) {
    if (req.url === '/error') {
      setTimeout(function() {
        throw new Error('something went wrong!');
      }, 500);
    }
    else {
      res.writeHead(200, {'Content-Type': 'text/plain'});
      res.end('Hello World\n');
    }
  }).listen(PORT);

  console.log('Worker %s running at port %s', cluster.worker.id, PORT);
}
else {
  var count = 0;
  var request = supertest('http://localhost:' + PORT);

  var hitWorker = function(count) {
    console.log('%s: Worker listening! Hitting it...', count);

    request
      .get('/error')
      .expect(200, function(err, res) {
        console.log('%s: Worker taken down, now making second request', count);

        request
          .get('/')
          .expect('Hello World\n')
          .expect(200, function(err, res) {
            console.log('%s: Second request complete. Error:', count, err);
          });
      });
  };

  cluster.on('disconnect', function() {
    count +=1;
    if (count < 2) {
      cluster.fork();
    }
  });

  cluster.on('listening', function() {
    hitWorker(count);
  });

  // start just one worker
  cluster.fork();

  var interval = setInterval(function() {
    console.log('...');
  }, 1000);
  interval.unref();
}

output

node 0.12.0 (scheduling policy does not make a difference):

Worker 1 running at port 3000
0: Worker listening! Hitting it...
/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout (timers.js:110:15)
0: Worker taken down, now making second request
0: Second request complete. Error: { [Error: connect ECONNREFUSED]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect' }
Worker 2 running at port 3000
1: Worker listening! Hitting it...
...
/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout (timers.js:110:15)
1: Worker taken down, now making second request
1: Second request complete. Error: { [Error: connect ECONNREFUSED]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect' }

node 0.10.36:

Worker 1 running at port 3000
0: Worker listening! Hitting it...

/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
0: Worker taken down, now making second request
Worker 2 running at port 3000
1: Worker listening! Hitting it...
0: Second request complete. Error: null
...

/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
1: Worker taken down, now making second request
...
...
...
...
^C

This version hangs, because third worker not started, and master keeps connection open. Note also that '0: second request complete' actually comes after '1: worker listening!'. This is because that initial second request actually ends up hitting the second worker.

The text was updated successfully, but these errors were encountered:

* Small tweak to `Graceful._finalLog` - instead of relying on `winston` callback or a `setTimeout(fn, 250)` to ensure that the final log entry hits the disk, we just do a `setTimeout(fn, 0)` to give it a chance. And the tests no longer check for that last entry, because it's not reliable. May introduce a feature in the future where the process is allowed to die naturally, since we've already stopped the server, etc. This would require that we and the overall client program `unref()` all timers. * Small tweak to `Master._restartWorker` - it seems that sometimes we would get a `disconnect` event before the worker had been removed from `cluster.workers` so we sometimes didn't log this very important error 'No workers currently running!' Now we check our own list at `this._workers`. * Overhaul of tests due to [this breaking change in node 0.12/iojs](nodejs/node-v0.x-archive#10427). Tests were previously assuming that a request immediately after worker crash would hit the next worker; now the connection is refused until the new worker is up. * Travis now runs on node 0.12 and iojs 1.4/1.5/1.6 only. Didn't feel like making the tests work on 0.10 as well as the new systems. * Update dev dependencies

rpaterson · 2015-05-22T18:04:01Z

The OP is being too charitable calling this a "Breaking change". IMO this is a serious regression that is preventing us from upgrading from v0.10 to v0.12. One of the main reasons for using a cluster is to provide a high availability service - in v0.12 that is impossible because if all the workers die the clients will see "connection refused".

Priyanky · 2015-07-27T09:54:09Z

+1

Fishrock123 · 2015-09-03T22:23:47Z

Please see nodejs/node#2606 & nodejs/node#1239 (comment)

@jasnell Should we try to backport the docs patch to 0.12?

jasnell · 2015-09-03T22:28:05Z

Yes, definitely. There are a few updates that we still need to back port. I'd say that's a lower priority tho

scottnonnenberg changed the title ~~Breaking change: cluster connection behavior between workers~~ Breaking change: cluster connection behavior when between workers Mar 21, 2015

scottnonnenberg mentioned this issue Mar 23, 2015

Breaking change: cluster connection behavior when between workers nodejs/node#1239

Closed

jasnell added the cluster label Jun 25, 2015

Fishrock123 closed this as completed Sep 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking change: cluster connection behavior when between workers #10427

Breaking change: cluster connection behavior when between workers #10427

scottnonnenberg commented Mar 21, 2015

rpaterson commented May 22, 2015

Priyanky commented Jul 27, 2015

Fishrock123 commented Sep 3, 2015

jasnell commented Sep 3, 2015

Breaking change: cluster connection behavior when between workers #10427

Breaking change: cluster connection behavior when between workers #10427

Comments

scottnonnenberg commented Mar 21, 2015

output

rpaterson commented May 22, 2015

Priyanky commented Jul 27, 2015

Fishrock123 commented Sep 3, 2015

jasnell commented Sep 3, 2015