-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: close all connections to prevent server hang #47038
Conversation
I ran this test BEFORE applying this patch on IBM i, AIX, and Linux. The following were my results: IBM i$ NODE_DEBUG=net,http,timers ./node test/parallel/test-http-pipeline-flood.js
...
HTTP 388536: write ret = true
HTTP 388536: outgoing message end.
HTTP 388536: SERVER socketOnParserExecute 65535
HTTP 388536: pause parser
NET 388536: _onTimeout
NET 388536: SERVER _emitCloseIfDrained
NET 388536: SERVER handle? false connections? 1
NET 388536: _onTimeout AIX$ NODE_DEBUG=net,http,timers ../node-v18.15.0-aix-ppc64/bin/node test/parallel/test-http-pipeline-flood.js
...
HTTP 12779906: write ret = true
HTTP 12779906: outgoing message end.
HTTP 12779906: SERVER socketOnParserExecute 32768
HTTP 12779906: pause parser
NET 12779906: _onTimeout
NET 12779906: destroy
NET 12779906: close
NET 12779906: close handle
NET 12779906: has server
NET 12779906: SERVER _emitCloseIfDrained
NET 12779906: SERVER handle? true connections? 0
NET 12779906: emit close
HTTP 12779906: server socket close
NET 12779906: SERVER _emitCloseIfDrained
NET 12779906: SERVER: emit close
Linux$ NODE_DEBUG=net,http,timers node test/parallel/test-http-pipeline-flood.js
...
HTTP 35447: write ret = true
HTTP 35447: outgoing message end.
HTTP 35447: SERVER socketOnParserExecute 65464
HTTP 35447: pause parser
NET 35447: _onTimeout
NET 35447: destroy
NET 35447: close
NET 35447: close handle
NET 35447: has server
NET 35447: SERVER _emitCloseIfDrained
NET 35447: SERVER handle? true connections? 0
NET 35447: emit close
HTTP 35447: server socket close
NET 35447: SERVER _emitCloseIfDrained
NET 35447: SERVER: emit close The main difference is that on IBM i destroying and closing the client socket doen't happen after the timeout:
Instead the timeout gets called a second time and the server just hangs until the test runner times out. After Applying this patch the output on IBM i is: HTTP 388626: write ret = true
HTTP 388626: outgoing message end.
HTTP 388626: SERVER socketOnParserExecute 65535
HTTP 388626: pause parser
NET 388626: _onTimeout
NET 388626: destroy
NET 388626: close
NET 388626: close handle
NET 388626: has server
NET 388626: SERVER _emitCloseIfDrained
NET 388626: SERVER handle? true connections? 0
NET 388626: SERVER _emitCloseIfDrained
NET 388626: SERVER: emit close
NET 388626: emit close
HTTP 388626: server socket close Any Ideas on how we can get this to work properly without having to call closeAllConnections? |
CC @nodejs/platform-ibmi |
On IBM i calling server.close after the child process closes is not enough to end the server. For some reason the connection still lingers around with the server even though the process was closed. Calling closeAllConnections cleans up the lingering connection. Then calling server.close stops the server from listening for new connections and ends the server. On other platforms, closing the child client process causes the connection to close properly without having to call closeAllConnections. I ran this test case on Linux and AIX. This fix should help resolve the timeout in the following test case: - https://ci.nodejs.org/job/node-test-commit-ibmi/nodes=ibmi73-ppc64/1089/testReport/(root)/test/parallel_test_http_pipeline_flood/
I put this in draft mode. As I would like to open this issue to the Node experts here for comments. Why isn't destroy being auto called? Are there any further steps I can take to debug this issue. |
// Calling closeAllConnections cleans up the lingering connection. | ||
// Then calling server.close stops the server from listening | ||
// for new connections and ends the server. | ||
server.closeAllConnections(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this invalidates the test. The child process gets killed after 200 ms and that should result in the TCP connections getting terminated. Does that happen on ibm i?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After 200 ms the child process gets killed but the client connection does not get destroyed. After the first timeout, the server timeouts a second time and tries to kill the already killed child process. Then the server just hangs as its no longer listening for new connections but still keeps handle to the defunct child process connection. Which I believe prevents the process from ending causing a timeout in the test runner.
Please see #47038 (comment) as I documented the different debug output from testing on Linux, AIX, and IBM i.
I want to know:
- Why the child process connection does not auto destroy after the child is killed?
- What code path handles that?
- Why is that not being called in the IBM i case?
Any hints in the right direction to these questions would be greatly appreciated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test expects the operating system to clean up the child process's TCP connections when it's terminated. That in turn should cause the TCP connections in the parent process to hang up with e.g. a ECONNRESET error.
If that doesn't happen, then the problem is probably somewhere lower in the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think the root cause is the child process connection not getting cleaned up on termination. I will need to further investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Node.js side is there a way to get the TCP connection status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean like a getsockopt() option: no.
On IBM i calling server.close after the child process closes is not enough to end the server. For some reason the connection still lingers around with the server even though the process was closed. Calling closeAllConnections cleans up the lingering connection. Then calling server.close stops the server from listening for new connections and ends the server. On other platforms, closing the child client process causes the connection to close properly without having to call closeAllConnections. I ran this test case on Linux and AIX.
This fix should help resolve the timeout in the following test case: