Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Breaking change: cluster connection behavior when between workers #10427

Closed
scottnonnenberg opened this issue Mar 21, 2015 · 4 comments
Closed
Labels

Comments

@scottnonnenberg
Copy link

On OSX, I've noticed a big difference between the way that connections are dealt with by a node.js master process when there are no workers ready to take care of that incoming connection. In node 0.10.36 (and before), the connection would be held open, and a worker that hadn't been started when that request was made would have the chance to handle it. In node 0.12.0, incoming connections when between workers are outright refused.

At the very least, this should be documented.

Example code and output on both 0.10.36 and 0.12.0 follows:

var cluster = require('cluster');
var http = require('http');
var supertest = require('supertest');
var PORT = 3000;

// cluster.schedulingPolicy = cluster.SCHED_NONE;

if (!cluster.isMaster) {
  http.createServer(function (req, res) {
    if (req.url === '/error') {
      setTimeout(function() {
        throw new Error('something went wrong!');
      }, 500);
    }
    else {
      res.writeHead(200, {'Content-Type': 'text/plain'});
      res.end('Hello World\n');
    }
  }).listen(PORT);

  console.log('Worker %s running at port %s', cluster.worker.id, PORT);
}
else {
  var count = 0;
  var request = supertest('http://localhost:' + PORT);

  var hitWorker = function(count) {
    console.log('%s: Worker listening! Hitting it...', count);

    request
      .get('/error')
      .expect(200, function(err, res) {
        console.log('%s: Worker taken down, now making second request', count);

        request
          .get('/')
          .expect('Hello World\n')
          .expect(200, function(err, res) {
            console.log('%s: Second request complete. Error:', count, err);
          });
      });
  };

  cluster.on('disconnect', function() {
    count +=1;
    if (count < 2) {
      cluster.fork();
    }
  });

  cluster.on('listening', function() {
    hitWorker(count);
  });

  // start just one worker
  cluster.fork();

  var interval = setInterval(function() {
    console.log('...');
  }, 1000);
  interval.unref();
}

output

node 0.12.0 (scheduling policy does not make a difference):

Worker 1 running at port 3000
0: Worker listening! Hitting it...
/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout (timers.js:110:15)
0: Worker taken down, now making second request
0: Second request complete. Error: { [Error: connect ECONNREFUSED]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect' }
Worker 2 running at port 3000
1: Worker listening! Hitting it...
...
/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout (timers.js:110:15)
1: Worker taken down, now making second request
1: Second request complete. Error: { [Error: connect ECONNREFUSED]
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect' }

node 0.10.36:

Worker 1 running at port 3000
0: Worker listening! Hitting it...

/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
0: Worker taken down, now making second request
Worker 2 running at port 3000
1: Worker listening! Hitting it...
0: Second request complete. Error: null
...

/Users/scottnonnenberg/Development/thehelp/cluster/test.js:13
        throw new Error('something went wrong!');
              ^
Error: something went wrong!
    at null._onTimeout (/test.js:13:15)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)
1: Worker taken down, now making second request
...
...
...
...
^C

This version hangs, because third worker not started, and master keeps connection open. Note also that '0: second request complete' actually comes after '1: worker listening!'. This is because that initial second request actually ends up hitting the second worker.

@scottnonnenberg scottnonnenberg changed the title Breaking change: cluster connection behavior between workers Breaking change: cluster connection behavior when between workers Mar 21, 2015
scottnonnenberg added a commit to thehelp/cluster that referenced this issue Mar 23, 2015
* Small tweak to `Graceful._finalLog` - instead of relying on `winston` callback or a `setTimeout(fn, 250)` to ensure that the final log entry hits the disk, we just do a `setTimeout(fn, 0)` to give it a chance. And the tests no longer check for that last entry, because it's not reliable. May introduce a feature in the future where the process is allowed to die naturally, since we've already stopped the server, etc. This would require that we and the overall client program `unref()` all timers.
* Small tweak to `Master._restartWorker` - it seems that sometimes we would get a `disconnect` event before the worker had been removed from `cluster.workers` so we sometimes didn't log this very important error 'No workers currently running!' Now we check our own list at `this._workers`.
* Overhaul of tests due to [this breaking change in node 0.12/iojs](nodejs/node-v0.x-archive#10427). Tests were previously assuming that a request immediately after worker crash would hit the next worker; now the connection is refused until the new worker is up.
* Travis now runs on node 0.12 and iojs 1.4/1.5/1.6 only. Didn't feel like making the tests work on 0.10 as well as the new systems.
* Update dev dependencies
@rpaterson
Copy link

The OP is being too charitable calling this a "Breaking change". IMO this is a serious regression that is preventing us from upgrading from v0.10 to v0.12. One of the main reasons for using a cluster is to provide a high availability service - in v0.12 that is impossible because if all the workers die the clients will see "connection refused".

@Priyanky
Copy link

+1

@Fishrock123
Copy link

Please see nodejs/node#2606 & nodejs/node#1239 (comment)

@jasnell Should we try to backport the docs patch to 0.12?

@jasnell
Copy link
Member

jasnell commented Sep 3, 2015

Yes, definitely. There are a few updates that we still need to back port. I'd say that's a lower priority tho

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants