Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster crash in main process, callback undefined when handling internal messages #3072

Closed
monsanto opened this issue Sep 26, 2015 · 128 comments
Closed
Labels
cluster Issues and PRs related to the cluster subsystem.

Comments

@monsanto
Copy link
Contributor

Found this issue in production on 4.1.1:

The main process crashed with this stack trace:
TypeError: Cannot read property 'apply' of undefined
    at ChildProcess.<anonymous> (cluster.js:710:7)
    at emitTwo (events.js:92:20)
    at ChildProcess.emit (events.js:172:7)
    at handleMessage (internal/child_process.js:685:10)
    at Pipe.channel.onread (internal/child_process.js:440:11)

I am unable to create a reproducible test case, and we can only get this error to occur at scale. Works in v2.3.3, the previous version we were using.

My take on this: if you look at the error line, this can only happen if cb was undefined, or if there is no entry in callbacks for message.ack. The first case is not possible as cb is always onmessage. So that leaves the second case. One idea is that a message somehow gets duplicated (maybe a bug in libuv?), which would guarantee that there be no entry in callbacks as line 708 would delete the entry the first time the message was received.

@brendanashworth brendanashworth added the cluster Issues and PRs related to the cluster subsystem. label Sep 26, 2015
@Zarel
Copy link
Contributor

Zarel commented Sep 26, 2015

More details:

This crash has happened twice yesterday and three times today, on the main server of Pokémon Showdown, which gets approximately 100,000 users/day and 10,000 concurrent users at peak. The crashes don't happen at peak, though, they seem to happen randomly (I don't know if there's a DoS attacker involved).

On the other hand, our smaller server (which gets closer to 1000 users/day and 100 concurrent users) hasn't crashed a single time, so this is most likely an issue that only occurs at scale.

The architecture uses cluster to spawn one child process, which sets up a SockJS server.

@rvagg
Copy link
Member

rvagg commented Sep 29, 2015

/cc @bnoordhuis, blame suggests 41b75ca as a starting point for this

@bnoordhuis
Copy link
Member

If it works with v2.3.3, then it's unlikely that 41b75ca is the culprit, that dates back to v0.11.

I don't spot any glaring bugs in lib/cluster.js. If you have a test case or some way for me to reproduce, that would certainly help.

@Zarel
Copy link
Contributor

Zarel commented Sep 29, 2015

Needless to say, I've not seen this crash outside of production, where the server with 10k concurrent users had five crashes at seemingly-random times, while the server with 100 concurrent users did not crash at all.

I don't have any more information to offer you, although I would be willing to add whatever debug code you'd like to the production server and wait for it to crash again.

@bnoordhuis
Copy link
Member

Maybe this?

diff --git a/lib/cluster.js b/lib/cluster.js
index 602cc8d..bc1bfa1 100644
--- a/lib/cluster.js
+++ b/lib/cluster.js
@@ -690,7 +690,7 @@ var callbacks = {};
 function sendHelper(proc, message, handle, cb) {
   // Mark message as internal. See INTERNAL_PREFIX in lib/child_process.js
   message = util._extend({ cmd: 'NODE_CLUSTER' }, message);
-  if (cb) callbacks[seq] = cb;
+  if (cb) callbacks[seq] = cb, console.log('insert', seq);
   message.seq = seq;
   seq += 1;
   proc.send(message, handle);
@@ -706,7 +706,9 @@ function internal(worker, cb) {
     if (message.ack !== undefined) {
       fn = callbacks[message.ack];
       delete callbacks[message.ack];
+      console.log('delete', message.ack);
     }
+    if (typeof fn !== 'function') console.trace(arguments);
     fn.apply(worker, arguments);
   };
 }

@Zarel
Copy link
Contributor

Zarel commented Oct 3, 2015

stdout:

insert 188723
delete 188723
insert 188724
delete 188724
insert 188725
delete 188725
insert 188726
delete 188726
insert 188727
delete 188727
insert 188728
delete 188728
insert 188729
delete 188729
insert 188730
delete 188730
insert 188731
delete 188731
insert 188732
delete 188732
insert 188733
delete 188733

CRASH: TypeError: Cannot read property '_externalStream' of null
    at Server.connectionListener (_http_server.js:332:32)
    at emitOne (events.js:77:13)
    at Server.emit (events.js:169:7)
    at Object.onconnection (net.js:1427:8)
    at onconnection (cluster.js:646:26)
    at Worker.onmessage (cluster.js:527:9)
    at process.<anonymous> (cluster.js:712:16)
    at emitTwo (events.js:92:20)
    at process.emit (events.js:172:7)
    at handleMessage (internal/child_process.js:686:10)

insert 188734
delete 188733

(I have a crashguard running)

stderr:

node: ../src/stream_base.h:203: void node::StreamBase::Consume(): Assertion `(consumed_) == (false)' failed. 
Trace: { '0': { cmd: 'NODE_CLUSTER', accepted: true, ack: 188733, seq: 188737 },
  '1': undefined } 
    at ChildProcess.<anonymous> (cluster.js:711:43) 
    at emitTwo (events.js:92:20) 
    at ChildProcess.emit (events.js:172:7) 
    at handleMessage (internal/child_process.js:686:10) 
    at Pipe.channel.onread (internal/child_process.js:440:11) 

@bnoordhuis
Copy link
Member

/cc @indutny - see the assert in src/stream_base.h. Sounds like cluster is just surfacing it.

@Zarel
Copy link
Contributor

Zarel commented Oct 3, 2015

The TypeError and the assert in stream_base seem new. I'm using v4.1.2-proposal.

I'm still willing to add whatever other debug code you'd like to the production server and wait for it to crash again.

@indutny
Copy link
Member

indutny commented Oct 3, 2015

@bnoordhuis this assertion can happen if cluster is emitting the same handle object twice, and the JS error, is rather odd too. connection event is emitted with socket._handle === null...

@bnoordhuis
Copy link
Member

@indutny I'm suspecting 291b310 ("stream_base: various improvements"), it introduced the _externalStream property from the error message.

lib/cluster.js itself hasn't seen significant changes between v2.3.3 and v4.1.1, 291b310 seems like the most likely culprit.

@indutny
Copy link
Member

indutny commented Oct 4, 2015

True, but this change is not that major too. It looks like there is a socket created by a cluster, that does not have a handle on it. Before _externalStream it won't fail, but it won't succeed too - the socket will be just hanging without any reads/writes happening on it.

Rephrasing this a bit, I may be wrong, but it looks like 291b310 just revealed existing problem, not introduced a new one.

@Zarel
Copy link
Contributor

Zarel commented Oct 7, 2015

I don't want to sound impatient or anything, but this bug is forcing me to stay on a really old version of iojs (even iojs 3.x crashes, although it's a different crash).

I'm still willing to add whatever other debug code you'd like to the production server and wait for it to crash again.

@bnoordhuis
Copy link
Member

I've been going over the changes in lib/ and src/ since v2.3.3 and 291b310 still seems like the most likely suspect to me.

@Zarel Do you still get the error with env NODE_CLUSTER_SCHED_POLICY=none node app.js?

@indutny
Copy link
Member

indutny commented Oct 7, 2015

@bnoordhuis perhaps there might be a reason to investigate it further?

@bnoordhuis
Copy link
Member

Most certainly. How though?

@indutny
Copy link
Member

indutny commented Oct 7, 2015

@bnoordhuis oops, didn't see your SCHED_POLICY comment.

@Zarel
Copy link
Contributor

Zarel commented Oct 12, 2015

Using env NODE_CLUSTER_SCHED_POLICY=none node app.js:

Crashed, after 18 hours (way more uptime than before, but I don't think that's relevant).

The main process crashed with this stack trace:
TypeError: Cannot read property 'apply' of undefined
    at ChildProcess.<anonymous> (cluster.js:710:7)
    at emitTwo (events.js:92:20)
    at ChildProcess.emit (events.js:172:7)
    at handleMessage (internal/child_process.js:686:10)
    at Pipe.channel.onread (internal/child_process.js:440:11)

@srlowe
Copy link

srlowe commented Oct 16, 2015

Just in case it's helpful - I've had this error in node versions 4.11, 4.12, 4.20

@srlowe
Copy link

srlowe commented Oct 16, 2015

Got a similar (though different) error yesterday. Not sure if related. Please see: #3395

@srlowe
Copy link

srlowe commented Oct 27, 2015

@bnoordhuis I've found a way to provoke this error consistently - Just start two instances of a node app on the same port (an app using cluster). Along with the expected EADDRINUSE error, this error appears.

@bnoordhuis
Copy link
Member

Haven't forgotten about this issue, it's next on my list. I don't think #3677 addresses this but it wouldn't hurt to try it if you have some spare cycles.

@Zarel
Copy link
Contributor

Zarel commented Nov 24, 2015

I tried 5.1.0 (I assume that one has #3677) in it, and I also tried spawning more cluster processes (4 instead of 1, so ~3000 connections/process instead of ~12000), and it's still crashing for me.

It managed to last multiple days of uptime instead of multiple hours, though! I think the process count had something to do with it.

@Zarel
Copy link
Contributor

Zarel commented Nov 28, 2015

This worries me:

https://nodejs.org/en/blog/vulnerability/cve-2015-8027_cve-2015-6764/

Is it possible to release a 2.x release that fixes this vulnerability? Newer versions crash Pokémon Showdown, and older versions are incompatible.

@bnoordhuis
Copy link
Member

@Zarel The v2.x release branch is abandoned. I'll be happy to point you to the commits after the v4.x / v5.x release so you can try to back-port them but that's where the supports ends.

@Zarel
Copy link
Contributor

Zarel commented Nov 29, 2015

I was afraid I'd have to backport it... yes, that's probably the best solution in that case.

@bnoordhuis
Copy link
Member

Ping me after the release and I'll link you to the commits.

@srlowe
Copy link

srlowe commented Oct 14, 2016

@petermp The file descriptor limits on our servers are set to 200000 (hard and soft), and we are still regularly encountering this bug, so I don't think this is the cause. Perhaps a lower limit increases the chances of it happening.

@petermp
Copy link

petermp commented Oct 14, 2016

@bgSosh Yes, I recognized that. My encounter is one of the scenarios. I think the statement leading to the crash indicates the error leading to '_externalStream' crash. In my case, it was:

_http_server.js:340
socket.on('drain', socketOnDrain);
^

I remember running into the same type of crash with different leading error.
I start keeping track of them ... Will do more stress tests.

@mfractal
Copy link

getting quite a few of these as well

TypeError:` Cannot read property '_externalStream' of null
    at Server.connectionListener (_http_server.js:348:32)
    at emitOne (events.js:96:13)
    at Server.emit (events.js:188:7)
    at Object.onconnection (net.js:1458:8)
    at onconnection (cluster.js:685:26)
    at Worker.onmessage (cluster.js:564:9)
    at process.<anonymous> (cluster.js:766:8)
    at emitTwo (events.js:111:20)
    at process.emit (events.js:191:7)
    at process.nextTick (internal/child_process.js:744:12)

Not running with cluster module but am running pm2 in cluster mode.

@Jevin23
Copy link

Jevin23 commented Nov 10, 2016

sometime have too this error on hi-load nodes with http and net
node v.7.0.0


TypeError: Cannot read property '_externalStream' of null
    at Server.connectionListener (_http_server.js:349:32)
    at emitOne (events.js:96:13)
    at Server.emit (events.js:188:7)
    at Object.onconnection (net.js:1454:8)
    at onconnection (cluster.js:685:26)
    at Worker.onmessage (cluster.js:564:9)
    at process.<anonymous> (cluster.js:766:8)
    at emitTwo (events.js:111:20)
    at process.emit (events.js:191:7)
    at process.nextTick (internal/child_process.js:744:12)

@petermp
Copy link

petermp commented Nov 11, 2016

@mfractal @Jevin23 Do you see any error messages immediately prior to _externalStream error messages? That could give a hint to why _externalStream happened.

@kirill-oficerov
Copy link

just in case I'll say that I stumbled upon the error:

node::StreamBase::Consume(): Assertion `(consumed_) == (false)' failed

while was emitting:

io.emit('connection', data);

and the reason was circular data in variable "data". I found it by trying JSON.stringify() resulted in error. Fixed the problem by removing circular data.

@bnoordhuis
Copy link
Member

That's interesting. If you have a standalone test case (one that doesn't depend on third-party modules like socket.io), I'll be happy to take a look.

@kirill-oficerov
Copy link

it's a big project, I can't. Really sorry about that.

io.emit('connection', data);

data here is a socket. Either Websocket (library "ws") or Socket (from callback of net.createServer())

@MichaelErmer
Copy link

MichaelErmer commented Nov 20, 2016

I think this error is happening if you use worker.send with big messages under high load, at least it is gone since we stoped using the messaging part of cluster.
We used to JSON.parse() the message content in the moment it was received, which caused thne workers express instances to idle a lot of connections, which then probably lead to the error above, probably because of FD limits.

@JelteF
Copy link

JelteF commented Feb 3, 2017

I'm experiencing this crash reliably when opening thousands of connections to a Faye server: https://faye.jcoglan.com/

Edit: I run two faye servers combined with pm2.

Edit2:

TypeError: Cannot read property '_externalStream' of null
    at Server.connectionListener (_http_server.js:348:32)
    at emitOne (events.js:101:20)
    at Server.emit (events.js:188:7)
    at Object.onconnection (net.js:1465:8)
    at onconnection (cluster.js:684:26)
    at Worker.onmessage (cluster.js:563:9)
    at process.<anonymous> (cluster.js:765:8)
    at emitTwo (events.js:111:20)
    at process.emit (events.js:191:7)
    at process.nextTick (internal/child_process.js:744:12)
    at _combinedTickCallback (internal/process/next_tick.js:67:7)
    at process._tickDomainCallback [as _tickCallback] (internal/process/next_tick.js:122:9)

@nekrasoft
Copy link

Same here on node v7.4.0 during highload tests with 24K connections

TypeError: Cannot read property '_externalStream' of null#012
    at Server.connectionListener (_http_server.js:324:32)#012
    at emitOne (events.js:101:20)#012
    at Server.emit (events.js:188:7)#012
    at Object.onconnection (net.js:1454:8)#012
    at onconnection (cluster.js:686:26)#012
    at Worker.onmessage (cluster.js:565:9)#012
    at process.<anonymous> (cluster.js:768:8)#012
    at emitTwo (events.js:111:20)#012
    at process.emit (events.js:191:7)#012
    at process.nextTick (internal/child_process.js:752:12)

@bnoordhuis
Copy link
Member

@nekrasoft Does it also happen with the latest v7?

@nekrasoft
Copy link

@bnoordhuis unfortunately, yes

Exactly the same error,
a minute ago we've finished tests on node v7.7.1
(I know 7.7.2 is already the latest, but forgive us, it took us too long to update to 7.7.1)

TypeError: Cannot read property '_externalStream' of null#012
    at Server.connectionListener (_http_server.js:326:32)#012
    at emitOne (events.js:101:20)#012
    at Server.emit (events.js:191:7)#012
    at Object.onconnection (net.js:1456:8)#012
    at onconnection (internal/cluster/child.js:168:12)#012
    at Worker.onmessage (internal/cluster/child.js:43:7)#012
    at process.onInternalMessage (internal/cluster/utils.js:42:8)#012
    at emitTwo (events.js:111:20)#012
    at process.emit (events.js:194:7)#012
    at process.nextTick (internal/child_process.js:766:12)

@Cmato
Copy link

Cmato commented Apr 24, 2017

I am not sure if this is the same bug, but hope it's gonna help somehow.

I have tested it on "xnu-3789.51.2~3/RELEASE_X86_64" with node v6.10.2 and v7.9.0 with the same result.

Stack trace:

TypeError: Cannot read property 'removeListener' of undefined
    at httpSocketSetup (_http_common.js:218:9)
    at Server.connectionListener (_http_server.js:268:3)
    at emitOne (events.js:96:13)
    at Server.emit (events.js:191:7)
    at process.<anonymous> (test.js:26:24)
    at emitTwo (events.js:111:20)
    at process.emit (events.js:194:7)
    at process.nextTick (internal/child_process.js:766:12)
    at _combinedTickCallback (internal/process/next_tick.js:73:7)
    at process._tickCallback (internal/process/next_tick.js:104:9) 

Here is a code: Test

With just one request from testClient.js it works fine.
But after few of them the connection parameter in test.js:23 happens to be undefined and causes the error.

@santigimeno
Copy link
Member

@Cmato I've been looking to your test case. The problem is that the undefined handle is indeed closed when sent to the worker. The reason may not seem evident but it's happening because how the the passing of handles is implemented: when sending multiple handles from one process to another, these are sent sequentially, so until a handle is not successfully sent to the target process (and this means sending the handle and receiving an acknowledgement), the next handle is not sent.

In your test case, when the time comes for a specific handle (never the first, usually 3rd or the 4th) to be sent to the worker, it has already been closed thus the undefined handle. So in your code, in the message handler, you should take that case into account and check that the handle is indeed defined before trying to access to it.

@Zarel
Copy link
Contributor

Zarel commented Apr 25, 2017

@santigimeno, it seems like the error message is still wrong in this case. If it's possible for a connection to have an undefined handle, it seems like something should be testing for that before it gets to the crash there.

@santigimeno
Copy link
Member

@Zarel I'm not sure I follow. The error comes from executing this code with connection === undefined.

@cjihrig
Copy link
Contributor

cjihrig commented May 23, 2017

@Zarel @santigimeno a fix for the problem is here. I'm unsure about it because while it reduces crashes, it also makes failures more silent. The reason, as @santigimeno said, is that the connections are closing while waiting in the queue to be sent to the worker.

That commit, combined with exposing pauseOnConnect to the HTTP server let me execute your example with no failures.

@Zarel
Copy link
Contributor

Zarel commented May 24, 2017

@cjihrig Yes, I've already been running something similar on my server for years, but with more diagnostic information. I've been manually merging it into each version of Node since 4.1. Specifically:

diff --git a/lib/_http_server.js b/lib/_http_server.js
index 46d0dbc..4fafa0f 100644
--- a/lib/_http_server.js
+++ b/lib/_http_server.js
@@ -321,11 +321,15 @@ function connectionListener(socket) {
   socket.on = socketOnWrap;
 
   // We only consume the socket if it has never been consumed before.
-  var external = socket._handle._externalStream;
-  if (!socket._handle._consumed && external) {
-    parser._consumed = true;
-    socket._handle._consumed = true;
-    parser.consume(external);
+  if (socket._handle) {
+    var external = socket._handle._externalStream;
+    if (!socket._handle._consumed && external) {
+      parser._consumed = true;
+      socket._handle._consumed = true;
+      parser.consume(external);
+    }
+  } else {
+    console.trace("socket._handle was " + socket._handle);
   }
   parser[kOnExecute] =
     onParserExecute.bind(undefined, this, socket, parser, state);
diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index da6ed28..79283ee 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -473,10 +473,14 @@ function setupChannel(target, channel) {
         // There will be at most one NODE_HANDLE message in every chunk we
         // read because SCM_RIGHTS messages don't get coalesced. Make sure
         // that we deliver the handle with the right message however.
-        if (message && message.cmd === 'NODE_HANDLE')
+        if (message && message.cmd === 'NODE_HANDLE') {
+          if (!recvHandle) {
+            console.error(new Error('recvHandle === ' + recvHandle).stack);
+          }
           handleMessage(target, message, recvHandle);
-        else
+        } else {
           handleMessage(target, message, undefined);
+        }
       }
       jsonBuffer = incompleteChunk;
       this.buffering = jsonBuffer.length !== 0;
diff --git a/lib/internal/cluster/utils.js b/lib/internal/cluster/utils.js
index ba72ff9..50b1c93 100644
--- a/lib/internal/cluster/utils.js
+++ b/lib/internal/cluster/utils.js
@@ -39,6 +39,10 @@ function internal(worker, cb) {
       delete callbacks[message.ack];
     }
 
-    fn.apply(worker, arguments);
+    if (fn) {
+      fn.apply(worker, arguments);
+    } else {
+      console.error("Duplicate ack for " + JSON.stringify(message));
+    }
   };
 }

This doesn't actually fix the bug, though; it only prevents it from crashing the server.

@santigimeno
Copy link
Member

@cjihrig Why is a fix needed? Wouldn't checking the existence of the handle on reception suffice?

@cjihrig
Copy link
Contributor

cjihrig commented May 24, 2017

@santigimeno yea, it does suffice. It still seems like buggy behavior though.

@santigimeno
Copy link
Member

Can somebody check if #13235 fixes the issue?

@marianogenovese
Copy link

marianogenovese commented May 30, 2017

Hi, i have this error (like errors described above):

_http_server.js:324
  var external = socket._handle._externalStream;
                               ^

TypeError: Cannot read property '_externalStream' of null
    at Server.connectionListener (_http_server.js:324:32)
    at emitOne (events.js:101:20)
    at Server.emit (events.js:191:7)
    at Object.onconnection (net.js:1490:8)
    at onconnection (internal/cluster/child.js:168:12)
    at Worker.onmessage (internal/cluster/child.js:43:7)
    at process.onInternalMessage (internal/cluster/utils.js:42:8)
    at emitTwo (events.js:111:20)
    at process.emit (events.js:194:7)
    at process.nextTick (internal/child_process.js:766:12)
    at _combinedTickCallback (internal/process/next_tick.js:73:7)
    at process._tickCallback (internal/process/next_tick.js:104:9)

Im doing a stress test to my node app, trying to open 100K concurrent connections (using external clients from another servers) but this error occurs with 60K connection and my app crash. Im using cluster module and use fork method.

The super weird thing is... this mistake started today ...

What can i do? Help.

@petermp
Copy link

petermp commented May 30, 2017

Try to increase the max number of file descriptors in your system to higher than 60K to see if it goes away.

@marianogenovese
Copy link

@petermp Thanks!, thats works.

@bnoordhuis
Copy link
Member

I'm fairly sure #13235 fixed this. I'll close out the issue.

@Zarel
Copy link
Contributor

Zarel commented Feb 7, 2018

🎉

I haven't encountered this error in a while. I'm glad it's finally solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cluster Issues and PRs related to the cluster subsystem.
Projects
None yet
Development

No branches or pull requests