Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: close all connections to prevent server hang #47038

Closed
wants to merge 1 commit into from
Closed

test: close all connections to prevent server hang #47038

wants to merge 1 commit into from

Conversation

abmusse
Copy link
Contributor

@abmusse abmusse commented Mar 10, 2023

On IBM i calling server.close after the child process closes is not enough to end the server. For some reason the connection still lingers around with the server even though the process was closed. Calling closeAllConnections cleans up the lingering connection. Then calling server.close stops the server from listening for new connections and ends the server. On other platforms, closing the child client process causes the connection to close properly without having to call closeAllConnections. I ran this test case on Linux and AIX.

This fix should help resolve the timeout in the following test case:

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Mar 10, 2023
@abmusse
Copy link
Contributor Author

abmusse commented Mar 10, 2023

I ran this test BEFORE applying this patch on IBM i, AIX, and Linux. The following were my results:

IBM i

$ NODE_DEBUG=net,http,timers ./node test/parallel/test-http-pipeline-flood.js
...

HTTP 388536: write ret = true
HTTP 388536: outgoing message end.
HTTP 388536: SERVER socketOnParserExecute 65535
HTTP 388536: pause parser
NET 388536: _onTimeout
NET 388536: SERVER _emitCloseIfDrained
NET 388536: SERVER handle? false   connections? 1
NET 388536: _onTimeout

AIX

$ NODE_DEBUG=net,http,timers ../node-v18.15.0-aix-ppc64/bin/node test/parallel/test-http-pipeline-flood.js

...

HTTP 12779906: write ret = true
HTTP 12779906: outgoing message end.
HTTP 12779906: SERVER socketOnParserExecute 32768
HTTP 12779906: pause parser
NET 12779906: _onTimeout
NET 12779906: destroy
NET 12779906: close
NET 12779906: close handle
NET 12779906: has server
NET 12779906: SERVER _emitCloseIfDrained
NET 12779906: SERVER handle? true   connections? 0
NET 12779906: emit close
HTTP 12779906: server socket close
NET 12779906: SERVER _emitCloseIfDrained
NET 12779906: SERVER: emit close

Linux

$ NODE_DEBUG=net,http,timers node test/parallel/test-http-pipeline-flood.js
...

HTTP 35447: write ret = true
HTTP 35447: outgoing message end.
HTTP 35447: SERVER socketOnParserExecute 65464
HTTP 35447: pause parser
NET 35447: _onTimeout
NET 35447: destroy
NET 35447: close
NET 35447: close handle
NET 35447: has server
NET 35447: SERVER _emitCloseIfDrained
NET 35447: SERVER handle? true   connections? 0
NET 35447: emit close
HTTP 35447: server socket close
NET 35447: SERVER _emitCloseIfDrained
NET 35447: SERVER: emit close

The main difference is that on IBM i destroying and closing the client socket doen't happen after the timeout:

NET 35447: _onTimeout
NET 35447: destroy
NET 35447: close
NET 35447: close handle

Instead the timeout gets called a second time and the server just hangs until the test runner times out.

After Applying this patch the output on IBM i is:

HTTP 388626: write ret = true
HTTP 388626: outgoing message end.
HTTP 388626: SERVER socketOnParserExecute 65535
HTTP 388626: pause parser
NET 388626: _onTimeout
NET 388626: destroy
NET 388626: close
NET 388626: close handle
NET 388626: has server
NET 388626: SERVER _emitCloseIfDrained
NET 388626: SERVER handle? true   connections? 0
NET 388626: SERVER _emitCloseIfDrained
NET 388626: SERVER: emit close
NET 388626: emit close
HTTP 388626: server socket close

Any Ideas on how we can get this to work properly without having to call closeAllConnections?

@abmusse
Copy link
Contributor Author

abmusse commented Mar 10, 2023

CC @nodejs/platform-ibmi
CC @richardlau

On IBM i calling server.close after the child process closes
is not enough to end the server. For some reason the connection still
lingers around with the server even though the process was closed.
Calling closeAllConnections cleans up the lingering connection.
Then calling server.close stops the server from listening
for new connections and ends the server. On other platforms,
closing the child client process causes the connection to close
properly without having to call closeAllConnections.
I ran this test case on Linux and AIX.

This fix should help resolve the timeout in the following test case:

- https://ci.nodejs.org/job/node-test-commit-ibmi/nodes=ibmi73-ppc64/1089/testReport/(root)/test/parallel_test_http_pipeline_flood/
@abmusse abmusse marked this pull request as draft March 10, 2023 22:51
@abmusse
Copy link
Contributor Author

abmusse commented Mar 10, 2023

I put this in draft mode. As I would like to open this issue to the Node experts here for comments. Why isn't destroy being auto called? Are there any further steps I can take to debug this issue.

// Calling closeAllConnections cleans up the lingering connection.
// Then calling server.close stops the server from listening
// for new connections and ends the server.
server.closeAllConnections();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this invalidates the test. The child process gets killed after 200 ms and that should result in the TCP connections getting terminated. Does that happen on ibm i?

Copy link
Contributor Author

@abmusse abmusse Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 200 ms the child process gets killed but the client connection does not get destroyed. After the first timeout, the server timeouts a second time and tries to kill the already killed child process. Then the server just hangs as its no longer listening for new connections but still keeps handle to the defunct child process connection. Which I believe prevents the process from ending causing a timeout in the test runner.

Please see #47038 (comment) as I documented the different debug output from testing on Linux, AIX, and IBM i.

I want to know:

  • Why the child process connection does not auto destroy after the child is killed?
  • What code path handles that?
  • Why is that not being called in the IBM i case?

Any hints in the right direction to these questions would be greatly appreciated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects the operating system to clean up the child process's TCP connections when it's terminated. That in turn should cause the TCP connections in the parent process to hang up with e.g. a ECONNRESET error.

If that doesn't happen, then the problem is probably somewhere lower in the stack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think the root cause is the child process connection not getting cleaned up on termination. I will need to further investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Node.js side is there a way to get the TCP connection status?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean like a getsockopt() option: no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants