Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in node::StreamBase::GetFD #2847

Closed
ctizen opened this issue Sep 13, 2015 · 34 comments
Closed

Segfault in node::StreamBase::GetFD #2847

ctizen opened this issue Sep 13, 2015 · 34 comments
Labels
c++ Issues and PRs that require attention from people who are familiar with C++.

Comments

@ctizen
Copy link

ctizen commented Sep 13, 2015

Hello,

While performing the load testing on my application I've noticed a segfault. My environment and conditions are:

  • VPS under OpenVZ, Ubuntu 14.04
  • Node v4.0.0
  • Using PM2 v0.14.7 (cluster mode, 4 processes in cluster)
  • Big application (about 1000+ files, total codebase size ~2MB without node_modules)
  • All node_modules are built on current version of Node
  • At pm2's config I've set the memory limit of 500 MB per process.

So, under load (about 50-100 rps) when some process runs out of memory, it's killed and restarted by PM2, but sometimes after that the parent PM2 process just fails with "Segmentation fault".

I've researched that for a while (compiled debug version, ran under gdb, etc) and found the source of segfault in src/stream_base-inl.h:

template <class Base>
void StreamBase::GetFD(Local<String> key,
                       const PropertyCallbackInfo<Value>& args) {
  StreamBase* wrap = Unwrap<Base>(args.Holder());

  if (!wrap->IsAlive()) // SEGFAULT. 
    return args.GetReturnValue().Set(UV_EINVAL);

  args.GetReturnValue().Set(wrap->GetFD());
}

I assumed that somehow the wrap pointer becomes NULL. I didn't research deeper, so I assumed that Unwrap function failed to do what it's needed for and just returned NULL, but the code of GetFD didn't handle this correctly. So I tried this small fix:

-  if (!wrap->IsAlive()) 
+  if (!wrap || !wrap->IsAlive())
     return args.GetReturnValue().Set(UV_EINVAL);

And it seems to work so far - no single failure for an hour of load testing, and I'm going to leave it working for some days to be sure.

Does this solution seem to be correct? Should I make a PR for it?

@mscdex mscdex added the c++ Issues and PRs that require attention from people who are familiar with C++. label Sep 13, 2015
@mscdex
Copy link
Contributor

mscdex commented Sep 13, 2015

/cc @indutny

@indutny
Copy link
Member

indutny commented Sep 13, 2015

@heilage-nsk may I ask you to publish a backtrace(s) of the crash? This is pretty odd to see it NULL there. Thank you!

@indutny
Copy link
Member

indutny commented Sep 13, 2015

Oh, one more thing. Is there any way to reproduce it?

@ctizen
Copy link
Author

ctizen commented Sep 14, 2015

@indutny
Here is what I managed to get from gdb:

Program received signal SIGSEGV, Segmentation fault.
0x000000000152cbdb in void node::StreamBase::GetFD<node::StreamWrap>(v8::Local<v8::String>, v8::PropertyCallbackInfo<v8::Value> const&) () at ../src/stream_base-inl.h:75
75    if (!wrap->IsAlive())
   0x000000000152cbd7 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+57>:   48 8b 45 e8 mov    rax,QWORD PTR [rbp-0x18]
=> 0x000000000152cbdb <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+61>:   48 8b 00    mov    rax,QWORD PTR [rax]
   0x000000000152cbde <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+64>:   48 83 c0 40 add    rax,0x40
   0x000000000152cbe2 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+68>:   48 8b 00    mov    rax,QWORD PTR [rax]
   0x000000000152cbe5 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+71>:   48 8b 55 e8 mov    rdx,QWORD PTR [rbp-0x18]
   0x000000000152cbe9 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+75>:   48 89 d7    mov    rdi,rdx
   0x000000000152cbec <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+78>:   ff d0   call   rax
   0x000000000152cbee <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+80>:   83 f0 01    xor    eax,0x1
   0x000000000152cbf1 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+83>:   84 c0   test   al,al
   0x000000000152cbf3 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+85>:   74 23   je     0x152cc18 <_ZN4node10StreamBase5GetFDINS_10StreamWrapEEEvN2v85LocalINS3_6StringEEERKNS3_20PropertyCallbackInfoINS3_5ValueEEE+122>
(gdb) bt
#0  0x000000000152cbdb in void node::StreamBase::GetFD<node::StreamWrap>(v8::Local<v8::String>, v8::PropertyCallbackInfo<v8::Value> const&) () at ../src/stream_base-inl.h:75
#1  0x0000237dfc3569b6 in ?? ()
#2  0x00007fffffffd7b0 in ?? ()
#3  0x0000237dfc3568c1 in ?? ()
#4  0x00007fffffffd780 in ?? ()
#5  0x00007fffffffd858 in ?? ()
#6  0x0000237dfc51899b in ?? ()
#7  0x00001ff6c0b9e919 in ?? ()
#8  0x00001487c81e50c9 in ?? ()
#9  0x0000000001e16df0 in ?? ()
#10 0x00001ff6c0b04131 in ?? ()
#11 0x00001ff6c0b04131 in ?? ()
#12 0x00001ff6c0baa171 in ?? ()
#13 0x00001487c81e50c9 in ?? ()
#14 0x0000150e070c3f01 in ?? ()
#15 0x00000275e774d399 in ?? ()
#16 0x0000000100000000 in ?? ()
#17 0x0000001600000000 in ?? ()
#18 0x0000150e070d6551 in ?? ()
#19 0x0000000100000000 in ?? ()
#20 0x00001487c81e50c9 in ?? ()
#21 0x00001ff6c0b9e919 in ?? ()
#22 0x00001ff6c0b04131 in ?? ()
#23 0x00001ff6c0b04131 in ?? ()
#24 0x00001ff6c0b04131 in ?? ()
#25 0x0000150e070d6509 in ?? ()
#26 0x00001ff6c0b9e919 in ?? ()
#27 0x00001487c8129839 in ?? ()
#28 0x00001487c8121bd9 in ?? ()
#29 0x00007fffffffd908 in ?? ()
#30 0x0000237dfc518a58 in ?? ()
#31 0x00001487c81e50c9 in ?? ()
#32 0x0000150e070c3f01 in ?? ()
#33 0x00001ff6c0b04131 in ?? ()
#34 0x00001ff6c0b04131 in ?? ()
#35 0x00001487c8129839 in ?? ()
#36 0x00001ff6c0b9ed89 in ?? ()
#37 0x0000150e070d5de9 in ?? ()
#38 0x0000001200000000 in ?? ()
#39 0x0000002600000000 in ?? ()
#40 0x0000150e070d5f21 in ?? ()
#41 0x0000000100000000 in ?? ()
#42 0x0000186f62165c31 in ?? ()
#43 0x00001ff6c0b9ed89 in ?? ()
#44 0x00001ff6c0b04131 in ?? ()
#45 0x00001ff6c0b04131 in ?? ()
#46 0x00001ff6c0b04131 in ?? ()
#47 0x0000150e070d5de9 in ?? ()
#48 0x00001ff6c0b9ed89 in ?? ()
#49 0x00001487c8129839 in ?? ()
#50 0x00001487c8121bd9 in ?? ()
#51 0x00007fffffffd9b8 in ?? ()
#52 0x0000237dfc518a58 in ?? ()
#53 0x0000186f62165c31 in ?? ()
#54 0x0000150e070c3f01 in ?? ()
#55 0x00001ff6c0b04131 in ?? ()
#56 0x00001ff6c0b04131 in ?? ()
#57 0x00001487c8129839 in ?? ()
#58 0x00001ff6c0b9d2d1 in ?? ()
#59 0x0000150e070d59f9 in ?? ()
#60 0x0000000700000000 in ?? ()
#61 0x0000001a00000000 in ?? ()
#62 0x0000150e070d5bb1 in ?? ()
#63 0x0000000100000000 in ?? ()
#64 0x0000186f6210fdf9 in ?? ()
#65 0x00001ff6c0b9d2d1 in ?? ()
#66 0x00001ff6c0b04131 in ?? ()
#67 0x00001ff6c0b04131 in ?? ()
#68 0x00001ff6c0b04131 in ?? ()
#69 0x0000150e070d59f9 in ?? ()
#70 0x00001ff6c0b9d2d1 in ?? ()
#71 0x00001487c8129839 in ?? ()
#72 0x00001487c8121bd9 in ?? ()
#73 0x00007fffffffda68 in ?? ()
#74 0x0000237dfc518a58 in ?? ()
#75 0x0000186f6210fdf9 in ?? ()
#76 0x0000150e070c3f01 in ?? ()
#77 0x00001ff6c0b04131 in ?? ()
#78 0x00001ff6c0b04131 in ?? ()
#79 0x00001487c8129839 in ?? ()
#80 0x0000117e4316c169 in ?? ()
#81 0x0000150e070c3fb9 in ?? ()
#82 0x0000000400000000 in ?? ()
#83 0x0000000500000000 in ?? ()
#84 0x0000150e070c4081 in ?? ()
#85 0x0000000100000000 in ?? ()
#86 0x0000186f621f5a89 in ?? ()
#87 0x0000117e4316c169 in ?? ()
#88 0x00001ff6c0b04131 in ?? ()
#89 0x00001ff6c0b04131 in ?? ()
#90 0x00001ff6c0b04131 in ?? ()
#91 0x0000150e070c3fb9 in ?? ()
#92 0x0000117e4316c169 in ?? ()
#93 0x00001487c8129839 in ?? ()
#94 0x00001487c8121bd9 in ?? ()
#95 0x00007fffffffdac0 in ?? ()
#96 0x0000237dfc62f382 in ?? ()
#97 0x0000186f621f5a89 in ?? ()
#98 0x0000150e070c3f01 in ?? ()
#99 0x00001ff6c0b04131 in ?? ()
#100 0x00001ff6c0b04131 in ?? ()
#101 0x00001ff6c0b04131 in ?? ()
#102 0x00001487c8112089 in ?? ()
#103 0x000015c25d9ffb49 in ?? ()
#104 0x000015c25d9ffc91 in ?? ()
#105 0x000015c25d9ffb49 in ?? ()
#106 0x00007fffffffdaf8 in ?? ()
#107 0x0000237dfc590f25 in ?? ()
#108 0x0000186f621f5a89 in ?? ()
#109 0x000034620f0c74a1 in ?? ()
#110 0x000015c25d9ffc91 in ?? ()
#111 0x00003fc95e581dc9 in ?? ()
#112 0x0000152a146f69f1 in ?? ()
#113 0x00007fffffffdb58 in ?? ()
#114 0x0000237dfc5908f1 in ?? ()
#115 0x0000186f621f5a89 in ?? ()
#116 0x000015c25d9ffa69 in ?? ()
#117 0x00003fc95e581dc9 in ?? ()
#118 0x000034620f08ec39 in ?? ()
#119 0x00001ff6c0b04131 in ?? ()
#120 0x00001ff6c0b04131 in ?? ()
#121 0x00001ff6c0b04131 in ?? ()
#122 0x00001ff6c0b04131 in ?? ()
#123 0x000034620f08ec39 in ?? ()
#124 0x0000186f621bfbe9 in ?? ()
#125 0x00007fffffffdba0 in ?? ()
#126 0x0000237dfc6b8819 in ?? ()
#127 0x0000186f621ea3a9 in ?? ()
#128 0x000034620f08ec39 in ?? ()
#129 0x00001ff6c0b04111 in ?? ()
#130 0x00001ff6c0b04131 in ?? ()
#131 0x0000150e070ab421 in ?? ()
#132 0x0000150e070b5599 in ?? ()
#133 0x0000150e070b5561 in ?? ()
#134 0x00007fffffffdbd8 in ?? ()
#135 0x0000237dfc3194b7 in ?? ()
#136 0x00001ff6c0b04111 in ?? ()
#137 0x00001ff6c0b04131 in ?? ()
#138 0x0000000200000000 in ?? ()
#139 0x0000150e070b5599 in ?? ()
#140 0x0000000a00000000 in ?? ()
#141 0x00007fffffffdc18 in ?? ()
#142 0x0000237dfc6b6e0c in ?? ()
#143 0x00001ff6c0b043f1 in ?? ()
#144 0x00001ff6c0b04111 in ?? ()
#145 0x00001ff6c0b04131 in ?? ()
#146 0x0000150e070b5599 in ?? ()
#147 0x0000150e070b5679 in ?? ()
#148 0x0000150e070b5621 in ?? ()
#149 0x00007fffffffdc48 in ?? ()
#150 0x0000237dfc635874 in ?? ()
#151 0x0000150e070b5789 in ?? ()
#152 0x00001ff6c0b975c1 in ?? ()
#153 0x0000150e070b5741 in ?? ()
#154 0x0000150e070b56c1 in ?? ()
#155 0x00007fffffffdcc0 in ?? ()
#156 0x0000237dfc4ad281 in ?? ()
#157 0x0000150e070b5789 in ?? ()
#158 0x0000150e070b5741 in ?? ()
#159 0x00003a4c496a9a31 in ?? ()
#160 0x00001ff6c0b04131 in ?? ()
#161 0x0000150e07004311 in ?? ()
#162 0x00001ff6c0b043f1 in ?? ()
#163 0x0000150e070b5789 in ?? ()
#164 0x0000006500000000 in ?? ()
#165 0x000728b700000000 in ?? ()
#166 0x0000150e070b5941 in ?? ()
#167 0x0000006400000000 in ?? ()
#168 0x00003a4c496fc051 in ?? ()
#169 0x00003a4c496a9a31 in ?? ()
#170 0x00007fffffffdcf8 in ?? ()
#171 0x0000237dfc31a09d in ?? ()
#172 0x0000150e070b5941 in ?? ()
#173 0x00003a4c496fc051 in ?? ()
#174 0x0000237dfc319fa1 in ?? ()
#175 0x0000000800000000 in ?? ()
#176 0x0000000000000000 in ?? ()
(gdb) gcore core.dump
Couldn't get registers: No such process.
(gdb) info threads
  Id   Target Id         Frame 
  10   Thread 0x7fffe61fc700 (LWP 1965) "node" (running)
  9    Thread 0x7fffe6bfd700 (LWP 1964) "node" (running)
  8    Thread 0x7fffe75fe700 (LWP 1963) "node" (running)
  7    Thread 0x7fffe7fff700 (LWP 1962) "node" (running)
  5    Thread 0x7ffff4d5f700 (LWP 1937) "V8 WorkerThread" (running)
  4    Thread 0x7ffff5760700 (LWP 1936) "V8 WorkerThread" (running)
  3    Thread 0x7ffff6161700 (LWP 1935) "V8 WorkerThread" (running)
  2    Thread 0x7ffff6b62700 (LWP 1934) "V8 WorkerThread" (running)
* 1    Thread 0x7ffff7fee740 (LWP 1931) "node" 0x000000000152cbdb in void node::StreamBase::GetFD<node::StreamWrap>(v8::Local<v8::String>, v8::PropertyCallbackInfo<v8::Value> const&) () at ../src/stream_base-inl.h:75

Almost no useful info from backtrace, and also I couldn't generate core dump :( What else can I do to gather more info?

I'm not sure if one can reproduce this segfault easily, but I'll try to create a test case.

@ctizen
Copy link
Author

ctizen commented Sep 14, 2015

Oh. And my naive fix didn't help :) Seems like it's not that easy.

@indutny
Copy link
Member

indutny commented Sep 14, 2015

@heilage-nsk thank you! How did you collect this information? It looks like a debug mode build, but... the stack trace has lots of gaps. It can't just get straight from JS to the C++ code, there should be several v8 API C++ helpers on stack.

A couple more questions:

  • How have you built the debug-mode node?
  • Have you tried to turn on core dumps with ulimit -c unlimited?

Thanks!

@ctizen
Copy link
Author

ctizen commented Sep 14, 2015

@indutny
I compiled and ran node/pm2 as follows:

./configure --prefix=/usr --gdb --debug
make
gdb --args out/Debug/node /usr/bin/pm2 /var/www/project-startup.json --no-daemon

I tried to do ulimit -c unlimited with no success. Since I already had some debugging problems on this machine, I think I should try it on some other VM and/or gdb version (currently 7.4, pretty old). I will inform when I get some results.

@indutny
Copy link
Member

indutny commented Sep 14, 2015

@heilage-nsk thank you very much for this. I appreciate your help.

@ctizen
Copy link
Author

ctizen commented Sep 16, 2015

@indutny
I just tried to reproduce this on fresh Ubuntu 14.04.2 install with gdb 7.7 - it's still failing, backtrace still shows no useful information, no core dump, but at least I was able to inspect the value of local variable:

Program received signal SIGSEGV, Segmentation fault.
0x000000000154cc09 in node::StreamBase::GetFD<node::StreamWrap> (key=..., args=...) at ../src/stream_base-inl.h:75
75    if (!wrap->IsAlive())
(gdb) info locals
wrap = 0x0

I tried to set a conditional breakpoint before this line, but in case when wrap == 0 the breakpoint did not trigger for some reason. I do not have much experience in debugging native (and especially multithreaded) apps under *nix, so it seems weird to me.

So I reapplied my naive fix and re-ran load testing under gdb, and it still works fine after several hours. For now I'm going to inspect nodejs sources a bit deeper to know what could go wrong here, any ideas are appreciated :)

@indutny
Copy link
Member

indutny commented Sep 17, 2015

@heilage-nsk I think I have an idea on why it could be happening, working on the test case.

In a few words, handle.close() releases the structure and nulls the internal field on a next tick. Previous code accounted for this properly, but with TLSWrap it may be broken.

Are you using tls/https in your application?

@ctizen
Copy link
Author

ctizen commented Sep 17, 2015

@indutny
Well, we use https, but it's provided by nginx, I don't think that proxy_pass-ing requests to nodejs can invoke its TLS internals. Moreover, the error seems to happen in node::StreamWrap instance of GetFD template function.
Important note: the segfault seems to happen only when a process is being killed by SIGTERM or SIGINT like here: https://github.com/Unitech/PM2/blob/master/lib/God/Methods.js#L236 - in all other cases it works fine.

@indutny
Copy link
Member

indutny commented Sep 17, 2015

@heilage-nsk no https requests either?

@ctizen
Copy link
Author

ctizen commented Sep 17, 2015

@indutny
Retried with http-only URLs list - still fails (without any fixes).

@indutny
Copy link
Member

indutny commented Sep 21, 2015

The error is pretty easy to reproduce, but how it could be happening in natural environment is less evident to me at the moment:

var net = require('net');

var socket = net.connect(1443);
var handle = socket._handle;
socket.destroy();
socket.on('close', function() {
  setImmediate(function() {
    console.log('hm..');
    console.log(handle.fd);
  });
});

Investigating

@indutny
Copy link
Member

indutny commented Sep 21, 2015

Perhaps it is related to cluster module.

@indutny
Copy link
Member

indutny commented Sep 22, 2015

@heilage-nsk do you know if it happens in master or in worker process?

@indutny
Copy link
Member

indutny commented Sep 22, 2015

Actually, I have one patch that may reveal some useful information about this crash:

diff --git a/src/stream_base-inl.h b/src/stream_base-inl.h
index 81114a2..114ba39 100644
--- a/src/stream_base-inl.h
+++ b/src/stream_base-inl.h
@@ -71,6 +71,17 @@ template <class Base>
 void StreamBase::GetFD(Local<String> key,
                        const PropertyCallbackInfo<Value>& args) {
   StreamBase* wrap = Unwrap<Base>(args.Holder());
+  if (wrap == nullptr) {
+    Environment* env = Environment::GetCurrent(args.GetIsolate());
+    Local<Object> global = env->context()->Global();
+    Local<Object> console = global->Get(
+        String::NewFromUtf8(env->isolate(), "console")).As<Object>();
+    Local<v8::Function> trace = console->Get(
+        String::NewFromUtf8(env->isolate(), "trace")).As<v8::Function>();
+
+    v8::MaybeLocal<Value> res =
+        trace->Call(env->context(), console, 0, nullptr);
+  }

   if (!wrap->IsAlive())
     return args.GetReturnValue().Set(UV_EINVAL);

@heilage-nsk may I ask you to give it a try? It should just print a stacktrace before crashing on nullptr wrap.

@ctizen
Copy link
Author

ctizen commented Sep 22, 2015

@indutny
Thank you!

Trace has shown this:

2015-09-22 10:50:00: 25515 pid can not be killed { [Error: kill ESRCH] code: 'ESRCH', errno: 'ESRCH', syscall: 'kill' }
2015-09-22 10:50:00: Trace
    at safeDeepClone (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:50:27)
    at safeDeepClone (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:53:22)
    at safeDeepClone (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:53:22)
    at safeDeepClone (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:53:22)
    at cloneWrap (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:65:10)
    at Object.module.exports.clone (/usr/lib/node_modules/pm2/lib/Utility.js:39:12)
    at Object.Common.deepCopy.Common.serialize.Common.clone (/usr/lib/node_modules/pm2/lib/Common.js:165:18)
    at Object.getFormatedProcesses (/usr/lib/node_modules/pm2/lib/God/Methods.js:53:21)
    at /usr/lib/node_modules/pm2/lib/God/ActionMethods.js:280:31
    at Object.God.killProcess (/usr/lib/node_modules/pm2/lib/God/Methods.js:247:14)
Segmentation fault (core dumped)

So it also looks like a bug in PM2 process management code. I tried to run the app with just cluster module and it doesn't fail (but has some other bug not related to current issue :) ).

I've made some more debugging and basic tracing and came to v8's ReadField function. In deps/v8/include/v8.h I modified function to this state:

7091   template <typename T>
7092   V8_INLINE static T ReadField(const internal::Object* ptr, int offset) {
7093     const uint8_t* addr =
7094         reinterpret_cast<const uint8_t*>(ptr) + offset - kHeapObjectTag;
7095     const T* retval = reinterpret_cast<const T*>(addr);
7096     if (!*retval) {
7097       std::cout << "Ouch! ReadField got NULL with offset = " << offset << " and kHeap = " << kHeapObjectTag << std::endl;
7098     }
7099     return *retval;
7100   }

The output was:

....
2015-09-22 19:23:00: Process with pid 14620 killed
Ouch! ReadField got NULL with offset = 24 and kHeap = 1
Ouch! ReadField got NULL with offset = 24 and kHeap = 1
Ouch! ReadField got NULL with offset = 24 and kHeap = 1
2015-09-22 19:23:00: Trace
    at safeDeepClone (/usr/lib/node_modules/pm2/node_modules/safe-clone-deep/src/index.js:50:27)
....
Segmentation fault (core dumped)

For me it looks like some (internal?) fields of an object were already cleaned up when process attempted to exit, but the object itself was still there. Maybe this will help somehow.

It's very strange behavior, but I think node should not crash in this case, maybe it should throw an exception or just do nothing?
For now, I'm going to examine pm2's code, I will report here if I find any suspicious code.

@indutny
Copy link
Member

indutny commented Sep 22, 2015

@heilage-nsk I have a possible fix:

diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index 328a67d..890f80a 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -446,9 +446,8 @@ function setupChannel(target, channel) {

     } else {
       this.buffering = false;
-      target.disconnect();
       channel.onread = nop;
-      channel.close();
+      target._disconnect();
       maybeClose(target);
     }
   };

May I ask you to give it a try? Thanks!

@indutny
Copy link
Member

indutny commented Sep 22, 2015

Updated the patch.

@indutny
Copy link
Member

indutny commented Sep 22, 2015

Wait, there are some test failures with this patch. Please do not apply just yet.

@indutny
Copy link
Member

indutny commented Sep 22, 2015

Ok, this patch is better:

diff --git a/lib/internal/child_process.js b/lib/internal/child_process.js
index 328a67d..bea139b 100644
--- a/lib/internal/child_process.js
+++ b/lib/internal/child_process.js
@@ -449,6 +449,7 @@ function setupChannel(target, channel) {
       target.disconnect();
       channel.onread = nop;
       channel.close();
+      target._channel = null;
       maybeClose(target);
     }
   };

Please give it a try ;)

@ctizen
Copy link
Author

ctizen commented Sep 23, 2015

@indutny
Thank you!
It doesn't fail for 10 minutes already! (_)/
I'll leave it working for some hours or days to be sure, but it seems to be fixed. Without this patch it crashed after 1-2 minutes.

@indutny
Copy link
Member

indutny commented Sep 23, 2015

This is a good result, I guess? ;)

@ctizen
Copy link
Author

ctizen commented Sep 23, 2015

Absoultely :)
Still works fine.
If it will go fine until tomorrow, I think I will close this issue, and will wait for this patch in the upstream :)

@indutny
Copy link
Member

indutny commented Sep 23, 2015

Thanks!

@indutny
Copy link
Member

indutny commented Sep 24, 2015

@heilage-nsk heya! Any news? ;)

@ctizen
Copy link
Author

ctizen commented Sep 24, 2015

@indutny
Seems OK! Thank you!
Waiting for this patch in upstream :)

@ctizen ctizen closed this as completed Sep 24, 2015
Fishrock123 pushed a commit that referenced this issue Mar 2, 2016
The test is still failing sometimes because when trying to establish the
second connection, the server is already closed. Bring back the code
that handled this case and was removed in the last refactoring of the
test. Also ignore the errors that might happen when sending the second
handle to the worker because it may already have exited.

PR-URL: #5422
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
The test would sometimes time out on some platforms. Take the initial
version of the test and instead of sending 100 handles, just send 2.
It still fails when run with the code before the change was introduced
and passes afterwards.

Remove test from parallel.status.

PR-URL: #5121
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: Rich Trott <rtrott@gmail.com>
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Fixes: #3635
MylesBorins pushed a commit that referenced this issue Mar 2, 2016
Windows is still sometimes failing with ECONNRESET. Bring back the
handling of this error as was initially introduced in PR #4442.

PR-URL: #5179
Reviewed-By: Rich Trott <rtrott@gmail.com>
Fixes: #3635
MylesBorins pushed a commit that referenced this issue Mar 17, 2016
The test is still failing sometimes because when trying to establish the
second connection, the server is already closed. Bring back the code
that handled this case and was removed in the last refactoring of the
test. Also ignore the errors that might happen when sending the second
handle to the worker because it may already have exited.

PR-URL: #5422
Reviewed-By: Rich Trott <rtrott@gmail.com>
MylesBorins pushed a commit that referenced this issue Mar 21, 2016
The test is still failing sometimes because when trying to establish the
second connection, the server is already closed. Bring back the code
that handled this case and was removed in the last refactoring of the
test. Also ignore the errors that might happen when sending the second
handle to the worker because it may already have exited.

PR-URL: #5422
Reviewed-By: Rich Trott <rtrott@gmail.com>
scovetta pushed a commit to scovetta/node that referenced this issue Apr 2, 2016
Windows would die with ECONNRESET most times when running
this particular test. This commit makes handling these errors
more tolerable.

PR-URL: nodejs#4442
Reviewed-By: Fedor Indutny <fedor.indutny@gmail.com>
Reviewed-By: Rich Trott <rtrott@gmail.com>
santigimeno added a commit to santigimeno/node that referenced this issue Oct 8, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: nodejs#8950
PR-URL: nodejs#8954
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
jasnell pushed a commit that referenced this issue Oct 10, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Fishrock123 pushed a commit that referenced this issue Oct 11, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Trott added a commit to Trott/io.js that referenced this issue Nov 18, 2016
Throw an error if either `err.message` or `err.code` is wrong.
Previously, it was only throwing an error if one or the other was
happening.
MylesBorins pushed a commit that referenced this issue Nov 22, 2016
It's not guaranteed that the first socket that tries to connect is the
first that succeeds so the rest of assumptions made in the test are not
correct.
Fix it by making sure the second socket does not try to connect until
the first has succeeded.
The IPC channel can already be closed when sending the second socket. It
should be allowed.
Also, don't start sending messages until the worker is online.

Fixes: #8950
PR-URL: #8954
Reviewed-By: Michael Dawson <michael_dawson@ca.ibm.com>
Reviewed-By: Brian White <mscdex@mscdex.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
ryzokuken added a commit to ryzokuken/node that referenced this issue Mar 28, 2018
Rename test-dgram-regress-4496 to test-dgram-typeerror-buffer to test-dgram-send-invalid-msg-type
Rename test-http-regr-nodejsgh-2821 to test-http-request-large-payload
Rename test-child-process-fork-regr-nodejsgh-2847 to test-child-process-fork-closed-channel-segfault
Rename test-http-pipeline-regr-2639 to test-http-pipeline-serverresponse-assertionerror
Rename test-http-pipeline-regr-3332 to test-http-pipeline-requests-connection-leak
Rename test-http-pipeline-regr-3508 to test-http-pipeline-socket-parser-typeerror

Refs: nodejs#19105
Refs: https://github.com/nodejs/node/blob/master/doc/guides/writing-tests.md#test-structure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++.
Projects
None yet
Development

No branches or pull requests

4 participants