Investigation of post-mortem debugging with Promises. #45

uzun · 2017-06-06T16:18:16Z

I am a graduate student at the IBM Centre for Advanced Studies—Atlantic at the University of New Brunswick working with @mhdawson. We are interested in working on post-mortem debugging for promises in Node.js as part of our research and are hoping to get the group’s input as we tackle this problem. I realize from reading the Issues that there isn’t consensus on a solution and although I am not yet a Node expert myself, I am willing to dedicate time to contribute towards a possible fix.

Our current plan would be to:

Develop a set of test cases that show the problem of not being able to capture stack traces, core dumps etc. for unhandled exceptions that cause promise rejections.
Implement and test a number of different options based on the suggestions in the pull requests so far, as well as new ideas we come up with.
Document what can or cannot be captured in each option and the overhead of doing so.
Receive input on different options, tradeoffs, and iterate as required.
Submit a final pull request to add the most suitable solution from the investigation above.

Some of the ideas we would plan to test would include:

Capturing different levels of information (stack trace, core dump etc.) to find a good balance between the overhead and information captured.
Using the V8 API to determine the reason for a promise rejection.
Adopting a double-run approach. The call stack and relevant metadata for an unhandled rejection are captured on the first pass. Then during the second pass, the rejections are tracked and compared with previous data to determine if a core dump is necessary. We want to analyze when to output or abort and whether it is preferential to do so immediately.
Experimenting with when to capture information or abort and the impact this has on existing code using promises.
Building up more information over time to help make better decisions on when to capture information.

We plan to keep this issue up to date with pointers to a fork that includes the code for the options we experiment with.

If you have any suggestions on what we should consider looking at or would like to collaborate on this effort just let me know.

benjamingr · 2017-06-06T18:30:21Z

Hey, have been thinking on this for a while - can we discuss?

davepacheco · 2017-06-06T18:33:30Z

@misterdjules has done quite a bit of work to identify problematic cases and challenges in addressing them. I don't have the links handy, but perhaps he (or somebody else) does.

misterdjules · 2017-06-06T19:25:40Z

@misterdjules has done quite a bit of work to identify problematic cases and challenges in addressing them. I don't have the links handy, but perhaps he (or somebody else) does.

I tried to document the challenges of using post-mortem debugging with code that uses promises at https://gist.github.com/misterdjules/2969aa1b5e6440a7e401. I tried to do the same with regards to using post-mortem debugging with code that uses async/await at https://gist.github.com/misterdjules/30db8fc651746c6f917a.

Both documents are works in progress because shortly after I started writing them, the pull request that had been submitted to node core to provide support for promises was closed, and so that work became low priority.

Both documents focused on trying to find a solution so that we could use the exact same post-mortem debugging tools that we use at Joyent (mdb and mdb_v8) to inspect core files generated after an uncaught error is thrown within a promise handler.

The main obstacle to making more progress was, in my opinion, finding out what changes to the behavior of programs using promises would be acceptable to accommodate the needs of post-mortem debugging users. That would have required discussing the design of promises with other people (e.g TC39 members, etc.) and exploring what could be changed about it, but it hasn't been done.

If I understand correctly, this issue introduces an effort that will explore other avenues, so maybe the two gists I linked above and the context I just gave are not that relevant. However, if you have any question, feel free to ask and I'll be happy to try to help!

mhdawson · 2017-06-06T22:38:03Z

@benjamingr I can facilitate setting up a meeting to discuss for those who are interested. Doodle here: http://doodle.com/poll/mpmrqsxi3tq4msiq. I only suggested a few times, if these don't work out we can find a different set. Anybody who is interested and @uzun please fill out the poll.

mhdawson · 2017-06-08T20:30:03Z

Starting point for examples discussion with Max today:

Archive.zip

benjamingr · 2017-06-09T00:15:41Z

So on the 12th, anyone else wants to join the meeting?

mhdawson · 2017-06-09T13:36:07Z

Just set the time to 12 EST on Monday the 12th of June based on the doodle, I'll post a link to a hangout in this issue just before 12 on Monday. If anybody else is interested pleas join in.

benjamingr · 2017-06-12T16:02:38Z

@mhdawson ?

mhdawson · 2017-06-12T16:02:42Z

Link https://hangouts.google.com/hangouts/_/ytl/8DjSM5K6BFZWZG0AEKsTmlc7le8YVbb6ZbMW-E8AF-4=?eid=100598160817214911030

mhdawson · 2017-06-12T16:04:02Z

@benjamingr @uzun

mhdawson · 2017-06-12T16:43:43Z

Meeting recording here: https://www.youtube.com/watch?v=WeQAIUPbHOA although we started it a bit late so did not capture all of the discussion.

We also agreed to have a regular meeting 12 EST on Mondays to get started. @benjamingr is going to list a few more test cases we should cover and he also provided a link to the v8 API's that let you register for rejections (max will add link to this issue)

benjamingr · 2017-06-12T18:47:47Z

In my opinion, the most interesting case is:

function test() {
  var p = Promise.reject(new Error());
  process.nextTick(() => modifyHeap()); 
  // if you remove me, a core dump should return an unmodified heap
  setTimeout(() => { p.catch(function(() => {}); 
}

The promise is handled in a macrotick - but the heap is modified in a microtick. This is the "core" case @misterdjules and I have been discussing. Unless we can get a core dump "fast enough" to take in this case (which we can't if I understand correctly, since it's not viable to only capture the page changed in modifyHeap() and the rest later if it's unhandled).

rnchamberlain · 2017-06-13T10:53:46Z

It might be possible to have an option/hook to fork the process at the reject point (i.e. before modifyHeap()). The parent and child then run as copy-on-write. Then if/when the decision is made that a core dump is needed, we request the child (which has been idle) to dump core.

Alternatively, we capture a lightweight node-report or similar at the unhandled point - first failure. Then using the information it contains (eg stack trace?) we set a option/hook to trigger a core dump at the reject point on a subsequent run.

benjamingr · 2017-06-13T16:25:19Z

It might be possible to have an option/hook to fork the process at the reject point (i.e. before modifyHeap()). The parent and child then run as copy-on-write. Then if/when the decision is made that a core dump is needed, we request the child (which has been idle) to dump core.

I was under the impression that this solution is impossible because taking a copy-on-write core dump is too expensive. What if we fork and then just block - and then take the dump only when we're sure?

Is forking 2 times a second without doing any memory touching too expensive?

rnchamberlain · 2017-06-13T16:30:56Z

Yes, that's what I mean, the child just blocks, so there is no disk i/o until we are sure we want to dump. Just the cost of the memory copies in the parent process - we'd need to see how expensive that is.

benjamingr · 2017-06-13T16:40:12Z

@rnchamberlain if it's possible - it would be the best path forward. I think forking should be fast enough.

davepacheco · 2017-06-13T18:45:15Z

I originally said this elsewhere, but it applies here:

Forking is dicey in multi-threaded programs, particularly if one component is not responsible for all of the threads. It's also dicey in a context where you know the program has encountered a non-operational error (i.e., a bug). You could cause the program to hang instead of either dumping core or proceeding on. Plus, important state about the program can change in the child process. This approach might be fine in some specific programs, but I wouldn't recommend it as a general approach.

Besides that, the performance cost is not trivial. Even if the child process were to immediately exit, a lot of work has to happen as part of fork: all threads need to be stopped, the entire list of open file descriptors needs to be duplicated, and all of the parent process's page tables need to be marked copy-on-write. Besides that immediate cost, there's a subsequent performance impact on both the parent and the child after the fork() because of COW. fork() is not a lightweight operation.

sam-github · 2017-06-13T19:01:57Z

@davepacheco Agreed that taking core dumps is not a light-weight exercise and involves some technical challenges to overcome, but getting a core dump is a valuable feature. Can you suggest an alternate approach?

davepacheco · 2017-06-13T20:38:54Z

@sam-github Core dumps are indeed somewhat expensive, but that cost is only incurred when the program has already failed fatally, so that's often not a big deal. My concerns were about using fork() in error cases that are not necessarily fatal.

To answer your question: I don't know that it's possible to satisfy the competing goals here (namely, to record program state at the point of fatal failure and also allow programs to handle an otherwise fatal failure after it's already happened). For me, this makes the use of promises as specified a non-starter. I understand that it's a tradeoff, and I'm glad people are looking into options to make these work better, but I think using fork() here would only compound the challenges of using promises with postmortem debugging.

rnchamberlain · 2017-06-15T08:17:18Z

The alternative strategy is to capture enough information at the 'unhandled' point - i.e. on first failure - to allow a selective trap to be set during subsequent runs of the application. The trap has to be sufficiently selective so that we can trigger core dumps at the 'reject' point, without triggering an excessive number (we'd need a count/limit anyway).

mhdawson · 2017-06-16T12:57:55Z

@rnchamberlain experimenting with the selective approach is part of the plan.

mhdawson · 2017-06-19T16:01:10Z

Hangout for this weeks call: https://hangouts.google.com/hangouts/_/ytl/Z-z4RMwrmwuiryRsNwGhdBBlzNA4Yt3LhBEg6_DAq_Y=?eid=100598160817214911030

mhdawson · 2017-06-19T16:03:48Z

@benjamingr you going to make it this week ?

uzun · 2017-06-19T16:10:31Z

Link from last week that @benjamingr mentioned in chat: https://github.com/nodejs/node/blob/master/lib/internal/process/promises.js

benjamingr · 2017-06-19T17:05:32Z

Sorry, I was pulled into a meeting.

mhdawson · 2017-06-21T17:21:10Z

Of interest: https://www.npmjs.com/package/gencore once we get to the point were we want to generate cores on a second run.

mhdawson · 2017-06-21T17:21:32Z

@benjamingr np, hope to see you next week.

mhdawson · 2017-06-26T16:03:18Z

Will post hangout in just a minute

mhdawson · 2017-06-26T16:05:37Z

https://hangouts.google.com/hangouts/_/ytl/WjFYvzsPzYqQ-ZGObALX7W4MryeRbkSsHNOITNJPJS4=?eid=100598160817214911030

rnchamberlain · 2017-07-25T09:07:26Z

If we don't want to continue running after producing the core dump, the SIGSEGV/SIGABRT mechanism is probably easiest. It would match what we do for --abort -on-uncaught-exception - see the base::OS::Abort() call in node/deps/v8/src/isolate.cc, function Isolate::Throw().

benjamingr · 2017-07-25T09:25:01Z

@uzun can you please let me know if the above comments help with the questions you've had about how to generate core dumps from Node core?

mhdawson · 2017-07-31T12:04:19Z

There is only a 50% change I can make the meeting today as I have some family related issues. If I can't I'll catch up with @uzun later in the week.

benjamingr · 2017-07-31T14:22:37Z

@uzun do we have any reason to meet today?

uzun · 2017-07-31T14:31:10Z

@benjamingr @mhdawson I think we can just skip this week then and catch up next Monday.

uzun · 2017-08-07T15:31:41Z

Just a heads up, I won't be able to make it this week, so no meeting today.

mhdawson · 2017-08-14T13:42:34Z

Are we on for today ?

uzun · 2017-08-14T14:41:54Z

I'm attending a thesis defence that might overlap with it this afternoon. We can hold off until next Monday or reschedule for later in the week. Not as much progress to report as I'd like, so I could use the extra time.

mhdawson · 2017-08-14T16:02:55Z

Ok lets do it next week as I'm just catching up after being away for a week myself.

uzun · 2017-08-28T15:47:04Z

I'm out sick today and can't make it, I'll post some progress updates this week.

mhdawson · 2017-08-28T15:55:29Z

ok, we might want to pick another time this week as we have not met for a while.

Once you feel better can you propose a time.

benjamingr · 2017-08-28T16:01:46Z

Got it, next week or earlier then.

benjamingr · 2017-09-05T17:09:29Z

Hi, given there were no meetings for a few weeks now - is this actually happening? Are you still actively pursuing this @uzun ?

benjamingr · 2017-09-12T12:07:49Z

I think given nodejs/node#15335 this issue can be closed.

rnchamberlain · 2017-09-12T12:15:31Z

@benjamingr maybe wait until nodejs/node#15335 has landed

uzun · 2018-01-30T18:20:30Z

I'm looking for open source projects that extensively use promises. If anyone has any suggestions, let me know.

styfle · 2018-01-30T18:39:53Z

@uzun You might try GitHub or npm. Or try asking around on twitter.

Here's a project that utilizes promises: https://github.com/bevacqua/promisees

mhdawson · 2018-01-30T20:14:06Z

It was mentioned in the recent benchmarking meeting that the latest HAPI framework uses promises exclusively, so maybe something built on that?

littledan · 2018-02-06T14:43:43Z

Has anyone considered using catch prediction, rather than HostPromiseRejectionTracker, for deciding when to capture a core dump?

@bmeurer has pointed out that the old logic didn't do such a good
job of capturing the stack trace anyway: An Error is created from the
PromisePerformThen path, which is already executing from a very
different place from where the Promise was originally rejected.
Basically, if you have a chain of Promises, and the first one rejects,
which triggers the last one to be rejected as well, and that's the
only unhandled one, the callback is called on that last Promise after
working through all the logic in the chain. Really, the best place to
capture the stack trace and system state would be where the rejection
was created in the first place. Unfortunately, at that point, it's
unclear whether the rejection will be handled or not.

Catch prediction is meant to solve this problem in Chrome DevTools:
it's a V8 feature which backs DevTool's "pause on uncaught exception"
feature. That feature lets you run a web page until an exception is
thrown but not caught, pausing at the moment it is thrown, with the
whole system state in place at that point. It's a heuristic--not 100%
accurate, and it gets a bit worse with Promises (e.g., not
understanding the catch in Promise.reject().catch(() => {})), but
it's a tool V8 has already which can get at the specific time when a
Promise rejection starts. There's some performance overhead, though; I
don't know if it's feasible to leave on in a production system.

If post-mortem stacks and core dumps are done via catch prediction
rather than HostPromiseRejectionTracker, we might get improved accuracy of core dumps and better guesses about whether to trigger one (as catch prediction is integrated with async/await).

mhdawson · 2018-02-06T19:03:43Z

@littledan how would we experiment with Catch prediction (pointers to which files to look it, what to search for etc,)? @uzun was looking at benchmarking different options and that sounds like one that he should include.

littledan · 2018-02-06T21:02:20Z

@mhdawson The API is driven by ChangeBreakOnException.

benjamingr · 2018-02-07T14:55:49Z

@uzun sequelize, most modern DB drivers, koa and most of the newer ecosystem.

richardlau · 2018-08-15T16:53:29Z

I believe this work is now being done in https://github.com/nodejs/promises-debugging. If this needs to be tracked outside of that repository, please open an issue over in https://github.com/nodejs/diagnostics.

@addaleax

After reading this comment, I walk away feeling like the things I have cared about for a long time aren't taken seriously by the author, though I suspect that was not the intent. Promises changing the state of a process before it exits has been a known problem for nearly as long as I've been in the Node.js community, and effort has been invested by quite a few people (i.e. the amazing @addaleax!) in this space. nodejs/post-mortem#45 (speaks directly to it as if it were a known long-standing issue) nodejs/promises#26 (doesn't reference core dumps but speaks to changing state on process exit) We are still resisting promises in the restify codebase due to this class of problem: restify/node-restify#1304 I do think it is fair to say: * Those with a vested interest in operating Node.js in production at scale haven't shown up with the resources to solve for operating promises. I'm starting to see this change (yay!). * Many of conversations around this happened outside of the Node.js issue tracker, often in echo chambers and hallway tracks. If you weren't actively looking for these concerns, they would have been easy to miss. I believe this is already changing (also yay!). I also think it is fair to say: * Many of those who had a vested interest in operating Node.js in production at scale _did_ show up and broadcast their concerns about the form of the promises specification when it was being proposed. The ones I saw were met with something between dismissal and hostility. * Those I know who _did_ show up left this feeling as if promises in their current form were both rushed through the specification process and forced upon this part of the community without much regard for their input. After this, many wholesale rejected promises as a viable primitive which left adoption to be driven by those who didn't have this class of concern. To put it another way, if you cared about post-mortem and your concerns weren't heard for the spec, you didn't use promises and since you didn't use promises you wouldn't be actively engaging in issue trackers about promises. I don't bring up these two points to add fuel to an old argument, but to try to cultivate empathy for why these concerns are only now starting to shake out of the woodwork from some people's perspectives. The part of the community that cares about this class of problem - and that we want investment from to solve this class of problem - was ostracized early on in the story of promises. We should be making efforts to draw those who abandoned promises back into the community, I don't believe the tone of this section advances that cause.

mhdawson mentioned this issue Jun 6, 2017

Post Mortem Analysis with Async APIs nodejs/CTC#114

Closed

davepacheco mentioned this issue Sep 11, 2017

promises: add --abort-on-unhandled-rejection nodejs/node#15335

Closed

4 tasks

retrohacker mentioned this issue Dec 2, 2017

Restify doesn't provide a way to catch Promise rejections in a middleware restify/node-restify#1304

Closed

littledan mentioned this issue Feb 6, 2018

[Idea] Avoid redundant HostPromiseRejectionTrackerEvents tc39/ecma262#1092

Closed

benjamingr mentioned this issue Jun 6, 2018

Action Items nodejs/promise-use-cases#26

Closed

richardlau closed this as completed Aug 15, 2018

retrohacker mentioned this issue Aug 15, 2018

Update extras-5.md nodejs/promise-use-cases#30

Closed

Investigation of post-mortem debugging with Promises. #45

Investigation of post-mortem debugging with Promises. #45

Comments

uzun commented Jun 6, 2017

benjamingr commented Jun 6, 2017

davepacheco commented Jun 6, 2017

misterdjules commented Jun 6, 2017

mhdawson commented Jun 6, 2017 • edited Loading

mhdawson commented Jun 8, 2017

benjamingr commented Jun 9, 2017

mhdawson commented Jun 9, 2017 • edited Loading

benjamingr commented Jun 12, 2017

mhdawson commented Jun 12, 2017

mhdawson commented Jun 12, 2017

mhdawson commented Jun 12, 2017

benjamingr commented Jun 12, 2017

rnchamberlain commented Jun 13, 2017

benjamingr commented Jun 13, 2017

rnchamberlain commented Jun 13, 2017

benjamingr commented Jun 13, 2017

davepacheco commented Jun 13, 2017

sam-github commented Jun 13, 2017

davepacheco commented Jun 13, 2017

rnchamberlain commented Jun 15, 2017

mhdawson commented Jun 16, 2017

mhdawson commented Jun 19, 2017

mhdawson commented Jun 19, 2017

uzun commented Jun 19, 2017

benjamingr commented Jun 19, 2017

mhdawson commented Jun 21, 2017

mhdawson commented Jun 21, 2017

mhdawson commented Jun 26, 2017

mhdawson commented Jun 26, 2017

rnchamberlain commented Jul 25, 2017

benjamingr commented Jul 25, 2017

mhdawson commented Jul 31, 2017

benjamingr commented Jul 31, 2017

uzun commented Jul 31, 2017

uzun commented Aug 7, 2017

mhdawson commented Aug 14, 2017

uzun commented Aug 14, 2017

mhdawson commented Aug 14, 2017

uzun commented Aug 28, 2017

mhdawson commented Aug 28, 2017

benjamingr commented Aug 28, 2017

benjamingr commented Sep 5, 2017

benjamingr commented Sep 12, 2017

rnchamberlain commented Sep 12, 2017

uzun commented Jan 30, 2018

styfle commented Jan 30, 2018

mhdawson commented Jan 30, 2018

littledan commented Feb 6, 2018

mhdawson commented Feb 6, 2018

littledan commented Feb 6, 2018

benjamingr commented Feb 7, 2018

richardlau commented Aug 15, 2018

mhdawson commented Jun 6, 2017 •

edited

Loading

mhdawson commented Jun 9, 2017 •

edited

Loading