-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promises: add --abort-on-unhandled-rejection #15335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo comments.
doc/api/cli.md
Outdated
|
||
This option only affects `Promise` rejections. It is only available on POSIX | ||
systems. It is implemented by keeping `fork()`ed copies of the process alive | ||
until it is known whether the `Promise` is handled within the same eventloop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'event loop'
doc/api/cli.md
Outdated
This option only affects `Promise` rejections. It is only available on POSIX | ||
systems. It is implemented by keeping `fork()`ed copies of the process alive | ||
until it is known whether the `Promise` is handled within the same eventloop | ||
iteration or is left unhandled, which may come with performance implications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current wording it's somewhat ambiguous if 'performance implications' applies to forking or unhandled rejections.
doc/node.1
Outdated
systems. It is implemented by keeping \fBfork()\fRed copies of the process alive | ||
until it is known whether the \fIPromise\fR is handled within the same event | ||
loop iteration or is left unhandled, which may come with performance | ||
implications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
lib/internal/process/promises.js
Outdated
@@ -1,11 +1,13 @@ | |||
'use strict'; | |||
|
|||
const { safeToString } = process.binding('util'); | |||
const abort_regex = /^--abort[_-]on[_-]unhandled[_-]rejection$/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abortRegex
?
lib/internal/process/promises.js
Outdated
if (wantAbort) { | ||
// Defer core dump closing until the next tick since another Promise | ||
// might become an unhandled rejection by adopting this promise's state. | ||
process.nextTick(process._closeCoreDump, promise, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd assign process._closeCoreDump
to a const
and delete it from process
.
src/node.cc
Outdated
} | ||
if (pid == 0) { | ||
bool do_abort; | ||
int rc = read(pipes[0], &do_abort, sizeof(do_abort)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to retry on EINTR:
int rc;
do
rc = read(...);
while (rc == -1 && errno == EINTR);
(And #include <errno.h>
at the top.)
As well, I'd make do_abort
a char, not a bool. sizeof(char)
is always 1, sizeof(bool)
need not be.
src/node.cc
Outdated
if (do_abort) { | ||
// This may happen when e.g. fork()ing itself failed due to resource | ||
// constraints. | ||
abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ABORT()
, here and elsewhere? It prints a stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve added it here, but the other instances of abort();
are used for an “expected” abort after printing the Promise’s stack trace.
src/node.cc
Outdated
if (info->refcount > 0) return; | ||
} | ||
env->promise_reject_reason_map()->Delete(promise->Result()); | ||
if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry on EINTR.
src/node.cc
Outdated
if (write(info->fd, &do_abort, sizeof(do_abort)) <= 0) abort(); | ||
if (close(info->fd) == -1) abort(); | ||
int status = 0; | ||
if (waitpid(info->pid, &status, 0) == -1) abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once @bnoordhuis nits are addressed.
60cff61
to
bebefa8
Compare
Updated with @bnoordhuis nits addressed |
bebefa8
to
0eca52a
Compare
@@ -1,11 +1,13 @@ | |||
'use strict'; | |||
|
|||
const { safeToString } = process.binding('util'); | |||
const abortRegex = /^--abort[_-]on[_-]unhandled[_-]rejection$/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to support the _
variant? We do for --abort-on-uncaught-exception
, but that's because it's a V8 option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s here for parity with --abort_on_uncaught_exception
but yes, we can choose to not support it if we like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't cost much to keep parity; I would suggest keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +0 either way but if we do support it then we should probably have a test variant for it.
0eca52a
to
58ef2eb
Compare
@@ -3861,6 +4008,7 @@ static void CheckIfAllowedInEnv(const char* exe, bool is_env, | |||
"--force-fips", | |||
"--openssl-config", | |||
"--icu-data-dir", | |||
"--abort_on_unhandled_rejection", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add this to the documented list of options that are allowed in NODE_OPTIONS
in doc/api/cli.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardlau done!
e1271fb
to
29fcd36
Compare
Promise.reject(new Error('child error')); | ||
} else { | ||
run('', null); | ||
if (!common.isWindows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we common.skip()
on Windows, or perhaps assert that the flag is rejected?
systems. It is implemented by keeping `fork()`ed copies of the process alive | ||
until it is known whether the `Promise` is handled within the same event loop | ||
iteration or is left unhandled. | ||
This makes unhandled rejections a significantly more complex operation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As scary as this already sounds, I think it remains a massive understatement. Fork, particularly at high frequency, has some pretty significant performance impact:
- fork itself can take a bunch of time
- after the fork completes, performance is impacted in both parent and child processes as subsequent writes require additional page faults, even if the memory is already resident
- on systems that avoid overcommitting swap (including illumos and Linux, when configured to do so), the memory used by the child process is counted against the swap cap. In cloud environments with tuned caps, this can quickly exhaust swap, causing lots of other operations to fail.
- If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.
I think it's good that the documentation mentions the use of fork
, and I think it would benefit from a firmer message about the possible impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fork, particularly at high frequency, has some pretty significant performance impact:
Yes, I am aware that that makes this option unsuitable depending on your application type. I can try to incorporate some of your points into the doc.
If the child ends up writing to the shared pages, then the total memory used by the parent and child goes up, which can result in paging (again, particularly in environments with memory caps configured) that can dramatically reduce performance.
Why would this happen (for more than one or two pages of the stack)? The child process really isn’t doing much here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any pages marked copy-on-write by fork() get cloned when the parent mutates them. If the parent does (for example) a compacting GC, that's a significant number of pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can limit GC or at least discourage it while the forked process lives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be obvious but I wonder if reminding people they should ensure ulimit is set to allow core dumps would be helpful.
|
||
if (pipe(pipes) == -1) return; | ||
pid = fork(); | ||
if (pid == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this correctly that if we fail to fork (or do any of these operations), we just won't get a core file for that, and the user will never know? I think the documentation should probably reflect that this is a best-effort approach. It would also be great if there were some way for a user to know that a dump has been skipped because of a failure like this. I'm afraid that when combined with some of the problems I mentioned earlier (e.g., swap exhaustion, or exhaustion of some other cap -- like number of processes), users won't get a core file, and they won't know that they won't get one until it's too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the original process itself will abort()
once it finds that there is no promise_abort_info
struct attached to the promise when it would otherwise signal the child process to abort:
It’s certainly not ideal, but what’s the alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that aborting the original process if we can't be sure is the reasonable approach.
pid_t pid; | ||
|
||
if (pipe(pipes) == -1) return; | ||
pid = fork(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we avoid the problem where another thread is holding a lock that will remain held in the child process with no thread to unlock it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't matter, does it? The child does nothing but read()
and _exit()
, it doesn't take out any locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fork()
only clones the calling thread into the child process, if that answers your question. Plus, what Ben said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, although it relies on nobody adding anything more sophisticated to the child process in the future.
That also raises something I missed the first time: this only captures stack information about the forking thread. State about thread pool threads or other threads created by libraries or other add-ons will be missing from the core file. That might be another caveat worth documenting.
(edit: Sorry for the redundant note. I hadn't caught up to Julien's comment.)
I've made a couple of notes on the implementation, but just to summarize: this has come up several times, and major issues were brought up with the approach. I think it's pretty likely that end users won't appreciate the severity of the impact that this can have, or the limitations of the resulting core files, until it's too late. |
@davepacheco I know this isn’t ideal, but I’ve spent a lot of time thinking about this issue too. I’ll try to write up a bit more advisory text tomorrow; but generally, since this is an opt-in feature that – as you mention – has been brought up a few times, I think it’s something worth trying out. Maybe we can get V8 to give better predictions of whether an rejection is unhandled, but if you want a core dump at failure time, I’ve come to the conclusion that this is as good as it gets. Also, regarding the performance issues: In the end, for all suggested solutions that I know of, avoiding uncaught rejections as much as possible is the key to reasonable performance. One nice thing is that with the increasing adoption of |
I appreciate the work on this and will do a more thorough review later, but I have to say that I'm not too keen on approaches for this that only work on |
All the postmortem analysis tools that care about core dumps for Node.js don't work on windows at the moment anyway IIRC ( @nodejs/platform-windows - please do correct me if there is news or I missed anything?). In particular, I'd like to ping @misterdjules to assert that this solves the promises post-mortem issue once and for all. The approach here is the one we discussed during nodejs/post-mortem#45 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the best path forward in terms of making sure post mortem analysis tools are compatible with promises.
Is it correct that core dumps generated by such a mechanism would have information about only one thread, that is the thread where the promise reject callback is called? If so, maybe that's one more caveat to add to the documentation.
With all the potential problems that were mentioned in this GH issue, it doesn't seem to me that this would solve problems with promises and post-mortem debugging "once and for all". |
Since the whole process is forked I would assume this would contain all information about the process ( Promise rejections are supposed to be rare, promise rejections that are handled asynchronously are supposed to be very rare so I'd assume the penalty for the average program would be low and I intend to run my own code with this flag when I can.
Would you mind elaborating? I would like us to set up action items and a plan after which we can all agree that promises are compatible with post-mortem analysis tools. In case it isn't abundantly clear - a lot of us (me included) are not particularly versed with these tools and take core dumps rarely and we could really use your knowledge on the topic. With this pull request we don't lose any stack/heap information about the promise which I think is pretty neat and it's what I've been advocating for at nodejs/post-mortem#45 |
From
This is confirmed empirically by running the following (
In the output above, we can see that two core files where generated (I thought only one was supposed to be generated after reading https://github.com/nodejs/node/pull/15335/files#diff-cd53544f44aab2c697bcd7b6a57f23ccR1395), and the one with the promise reject callback on the stack has only one thread, while the other has more than one thread, that is all the threads you'd expected from a node process.
I'm not sure these are safe assumptions to make. At scale, in pathological cases, or in any other unforeseen circumstance what used to be rare and insignificant can become quite problematic. Moreover, when occurrences of a bug are rare, we want the tools we use to investigate them to be robust, reliable and as complete as possible.
@davepacheco listed many potential problems with this approach with his comments on this PR, and in many other GH issues/PRs.
It's definitely an interesting experiment and contribution, but as a post-mortem debugging user I don't think I would use it in its current form in production if it landed because of, among other things, the potential problems that @davepacheco and myself presented in this PR. |
@uzun FYI as another approach being taken to capture additional info related to promises. |
fork() only preserves the single calling thread, but since that is the main node loop you'll get the thread and JavaScript stack that you need to diagnose the problem, also the entire heap. Missing the other (e.g libuv) native threads is not a show-stopper. I agree we should mention this in the documentation. |
@uzun can you also run the test cases you were building in nodejs/post-mortem#45 with this to see what the results are. |
The idea in the thread you linked to there was to start the cluster and treat the first unhandled rejection differently. This is similar (by forking the process and relying on CoW) but I think the performance profile can be very different. I think in the vast majority of bugs this approach can work well.
You are absolutely right, I'm going to send a very angry mail to my old CS professor :D Thanks.
I agree with everything you've written. I think that at this point we have to make painful tradeoffs and Node.js has been reasonably good at making those unsafe assumptions responsibly. I think determinism when debugging is extremely important and I acknowledge the value of postmortem analysis. I don't think there is a way to solve the fundamental issue with errors thrown asynchronously (would a Given all that - I would really like an actionable list of items from the post-mortem side of things that should happen. I know very little beyond a consumer perspective about post-mortem analysis but I've been involved with promise work for more than 4 years. Except for @mhdawson the post-mortem working group has been awfully silent about making post-mortem analysis better with promises in the last year. By "once and for all" I meant achieving the best outcome given these constraints. I realize this might require a lot of work but I think it's worth exploring.
If this is sufficient in the general case it sounds good to me. I've only used core dumps a handful of times in order to debug things and it was always the JavaScript or |
First of all, I appreciate all the work people have done to try to satisfy both the postmortem debugging use-cases and people's desires to use promises. I know this PR represents a lot of work, and as a very heavy postmortem debugging user I'm grateful that Node is trying so hard to support this use-case. I don't think it's fair to say that members of the postmortem working group have been silent on this issue. Several of us have weighed in on several different threads now explaining why we believe it's impossible to satisfy both use cases. We're not saying that it's hard, or that people just haven't figured out how to do it yet, but that we believe the requirements are mutually exclusive. I understand people want to explore the space anyway. That's great! But I can't give you "an actionable list of items from the post-mortem side of things that should happen" because I don't think there's anything that can be done to resolve the starting constraints. I think @addaleax is correct in concluding that this is likely as good as it gets. Unfortunately, it's still a non-starter for a lot of postmortem debugging users. I think the best we can hope for is that this approach is helpful for some users, and that the documentation warns them about the limitations and risks of this approach so that they're not caught off guard if they wind up with dozens of extra processes in production or core files that don't have some of the state they expect. Relatedly, I think it would be useful to know if there are people who use core-file-based postmortem debugging today who are ready to deploy with the proposed --abort-on-unhandled-rejection. It would be especially useful to know if any of those people have already used this in its prototype form to debug real issues. I know I haven't waded into use of promises much myself yet because of these issues, and it sounds like some of the people interested in this proposal haven't used core files much before, so it would be good to know from somebody firmly in both camps that we're really addressing their issue. I apologize for not knowing already if there are end users consuming this work already! For those new to these discussions, to summarize the problem again most concisely: postmortem debugging with --abort-on-uncaught-exception collects the entire JavaScript and native state of the program (including all threads) in the precise context of the failure (so that stack state is preserved) with zero runtime cost until the failure happens, while promises explicitly specify that the program must continue executing after the point of the failure, meaning that state is necessarily mutated. Forking is the closest thing to having it both ways, but it has all of the drawbacks brought up here -- it's a heavy cost and it doesn't include all of the same state. |
It seems @yunong would potentially be a good person to ask. |
I tried the PR with a simple testcase, on Ubuntu 16.04. Running vanilla, as expected I get With |
@rnchamberlain do you have the example code you used for that? |
Ping @addaleax |
The assert in node.cc line 1306:
|
@rnchamberlain I still can’t reproduce this :( It sounds like a V8 bug though, because |
https://cs.chromium.org/chromium/src/v8/src/builtins/builtins-promise-gen.cc?rcl=9c4f80520da791c712944265478d0272293e5d2b&l=980 does call out to PromiseRejectCallback before fulfilling the promise, so it is possible. It does seem odd that this isn't guaranteed by the API. Filed https://bugs.chromium.org/p/v8/issues/detail?id=6880 to track this. Thanks @addaleax ! |
We reproduced this in macOS 10.13.
|
Fwiw it sounds like @gsathya already figured out what the issue in V8 is, so it seems like all we need is an upstream change that we’ll cherry-pick into this PR |
Support creating core dumps of unhandled promise rejections on POSIX systems by keeping `fork()`s of the process alive until it is known whether they are handled synchronously or not.
29fcd36
to
eff4265
Compare
Is there any update here? @gsathya Is there a fix in V8 that we can cherry-pick? |
@BridgeAR I think for now I’m abandoning this in favor of looking into more creative userland solutions … if anybody else thinks this is a good idea and wants to push it over the line, please feel free to do that |
@addaleax thanks for the info. I am closing this therefore. If anyone feels it should be reopened, please feel free to do so. |
Support creating core dumps of unhandled promise rejections on POSIX systems by keeping
fork()
s of the process alive until it is known whether they are handled synchronously or not.This is a kind of crude approach, but with some basic experimentation it looks like it can work well enough.
@nodejs/diagnostics @nodejs/post-mortem
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)
promises