Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracing: remove shutdown-on-signal #22734

Closed
wants to merge 1 commit into from
Closed

Conversation

addaleax
Copy link
Member

@addaleax addaleax commented Sep 6, 2018

This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
need to be able to handle abrupt ends meaningfully.

Fixes: #14802
Fixes: #22528

@nodejs/diagnostics @jasnell @BridgeAR

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • commit message follows commit guidelines

This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
*need* to be able to handle abrupt ends meaningfully.

Fixes: nodejs#14802
Fixes: nodejs#22528
@nodejs-github-bot nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Sep 6, 2018
@addaleax addaleax added the trace_events Issues and PRs related to V8, Node.js core, and userspace code trace events. label Sep 6, 2018
@addaleax addaleax mentioned this pull request Sep 6, 2018
@AndreasMadsen
Copy link
Member

Yeah, buy how can we flush the tracking stream. That is also a pretty big problem. I have honestly been thinking about just making a while (check) {} loop. But I don't think that is a particular good solution either.

I definitely don't see how it fixes #14802 #22528. I agree that tracing consumers should be able to parse a partial file. But there is a big difference between a SIGKILL and a SIGINT. For the purpose of #22528, the implementation actually SIGINTs the process, which alreay can lead to no tracking output at all. I can't imagine this makes it any better.

Also please keep in mind that it is only users of tracking that can get a segfault from this.

@mcollina
Copy link
Member

mcollina commented Sep 7, 2018

I think this fixes the crashes, but it worsen the problem of flushing things out. That last piece of data could be extremely important in case of an error situation.

If we cannot really do this in the C++ signal handler, can we do it afterwards, e.g. before the process exits?

@AndreasMadsen
Copy link
Member

AndreasMadsen commented Sep 7, 2018

If we cannot really do this in the C++ signal handler, can we do it afterwards, e.g. before the process exits?

As far as I know, there is nothing after the signal handler. I think the solution is to use something that doesn't depend on mutex. I'm no POSIX expert, but maybe kill/wait, send/recv, or connect/close.

You can read about signal safty here: http://man7.org/linux/man-pages/man7/signal-safety.7.html.

@mcollina
Copy link
Member

mcollina commented Sep 7, 2018

As far as I know, there is nothing after the signal handler. I think the solution is to use something that doesn't depend on mutex. I'm no POSIX expert, but maybe kill/wait, send/recv, or connect/close.

@AndreasMadsen I haven't dig into how process.on('SIGINT') is hooked, but I bet the JS code is signal-safe.

@addaleax
Copy link
Member Author

addaleax commented Sep 7, 2018

As far as I know, there is nothing after the signal handler. I think the solution is to use something that doesn't depend on mutex. I'm no POSIX expert, but maybe kill/wait, send/recv, or connect/close.

@AndreasMadsen https://linux.die.net/man/7/signal has a list of things that are okay to call. It’s not just mutexes – even allocating or freeing memory is not okay (and not just in theory).

We could write data to and/or close file descriptors, yes. That’s really far from trivial to do for the tracing buffers in the current design, though. (I’d be happy to review any PR that replaces the current code with something that is not broken, but for now, this ‘feature’ is broken so it should be removed.)

@mcollina process.on('SIGINT') is asynchronous, which means that we can’t use it.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jasnell
Copy link
Member

jasnell commented Sep 7, 2018

Just thinking out loud... Could we potentially handle this by setting a flag within SignalExit that the trace agent watches for then flushes?

@addaleax
Copy link
Member Author

addaleax commented Sep 7, 2018

@jasnell I think @mcollina’s suggestion was along those lines, but that only works if there is code running after the SignalExit method; but SignalExit exits the current process…

@ofrobots
Copy link
Contributor

ofrobots commented Sep 7, 2018

That last piece of data could be extremely important in case of an error situation.

IMHO, dealing with this kinds of crashes requires tooling that digs out the data from a core-dump. It is going to be fairly hard to write this flushing correctly and what we have today doesn't work.

@AndreasMadsen
Copy link
Member

https://linux.die.net/man/7/signal has a list of things that are okay to call. It’s not just mutexes – even allocating or freeing memory is not okay (and not just in theory).

Yes, my suggestions were based on the same list, just a different source.

(I’d be happy to review any PR that replaces the current code with something that is not broken, but for now, this ‘feature’ is broken so it should be removed.)

That is acceptable. But we can't say that it fixes #14802 or #22528. If you insist on that, another issue should be creased.

@addaleax
Copy link
Member Author

addaleax commented Sep 9, 2018

CI: https://ci.nodejs.org/job/node-test-pull-request/17093/

I do think that it fixes those issues – just not in a way that makes anybody happy. I’ll create a new one after this PR has been merged.

@addaleax addaleax added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Sep 9, 2018
@addaleax
Copy link
Member Author

addaleax commented Sep 12, 2018

Windows-only rebuild: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20682/

(needs re-run)

@addaleax addaleax added the needs-ci PRs that need a full CI run. label Sep 14, 2018
@joaocgreis
Copy link
Member

Last CI run was not successful for unrelated reasons (nodejs/build#1495).
Resumed: https://ci.nodejs.org/job/node-test-pull-request/17195/
Windows rebuild: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20720/

@addaleax addaleax removed the needs-ci PRs that need a full CI run. label Sep 16, 2018
@addaleax
Copy link
Member Author

Landed in 80076cb

@addaleax addaleax closed this Sep 16, 2018
@addaleax addaleax deleted the fix-22528 branch September 16, 2018 07:56
addaleax added a commit that referenced this pull request Sep 16, 2018
This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
*need* to be able to handle abrupt ends meaningfully.

Fixes: #14802
Fixes: #22528

PR-URL: #22734
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ali Ijaz Sheikh <ofrobots@google.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
targos pushed a commit that referenced this pull request Sep 17, 2018
This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
*need* to be able to handle abrupt ends meaningfully.

Fixes: #14802
Fixes: #22528

PR-URL: #22734
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ali Ijaz Sheikh <ofrobots@google.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
targos pushed a commit that referenced this pull request Sep 19, 2018
This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
*need* to be able to handle abrupt ends meaningfully.

Fixes: #14802
Fixes: #22528

PR-URL: #22734
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ali Ijaz Sheikh <ofrobots@google.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
targos pushed a commit that referenced this pull request Sep 20, 2018
This feature cannot be reasonably implemented this way
without inherently being susceptible to race conditions
that lead to hangs, crashes, etc.

What’s more, implementing this for some signals only
(and it can only be implemented for some signals at all)
may lead to the impression that it is a guaranteed
feature, when really consumers of the tracing output
*need* to be able to handle abrupt ends meaningfully.

Fixes: #14802
Fixes: #22528

PR-URL: #22734
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Ali Ijaz Sheikh <ofrobots@google.com>
Reviewed-By: Ruben Bridgewater <ruben@bridgewater.de>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. trace_events Issues and PRs related to V8, Node.js core, and userspace code trace events.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault on master tracing: unsafe Agent::Stop() call in signal handler
8 participants