Skip to content
This repository has been archived by the owner on Aug 7, 2023. It is now read-only.

Investigation of post-mortem debugging with Promises. #45

Closed
uzun opened this issue Jun 6, 2017 · 78 comments
Closed

Investigation of post-mortem debugging with Promises. #45

uzun opened this issue Jun 6, 2017 · 78 comments

Comments

@uzun
Copy link

uzun commented Jun 6, 2017

I am a graduate student at the IBM Centre for Advanced Studies—Atlantic at the University of New Brunswick working with @mhdawson. We are interested in working on post-mortem debugging for promises in Node.js as part of our research and are hoping to get the group’s input as we tackle this problem. I realize from reading the Issues that there isn’t consensus on a solution and although I am not yet a Node expert myself, I am willing to dedicate time to contribute towards a possible fix.

Our current plan would be to:

  1. Develop a set of test cases that show the problem of not being able to capture stack traces, core dumps etc. for unhandled exceptions that cause promise rejections.
  2. Implement and test a number of different options based on the suggestions in the pull requests so far, as well as new ideas we come up with.
  3. Document what can or cannot be captured in each option and the overhead of doing so.
  4. Receive input on different options, tradeoffs, and iterate as required.
  5. Submit a final pull request to add the most suitable solution from the investigation above.

Some of the ideas we would plan to test would include:

  1. Capturing different levels of information (stack trace, core dump etc.) to find a good balance between the overhead and information captured.
  2. Using the V8 API to determine the reason for a promise rejection.
  3. Adopting a double-run approach. The call stack and relevant metadata for an unhandled rejection are captured on the first pass. Then during the second pass, the rejections are tracked and compared with previous data to determine if a core dump is necessary. We want to analyze when to output or abort and whether it is preferential to do so immediately.
  4. Experimenting with when to capture information or abort and the impact this has on existing code using promises.
  5. Building up more information over time to help make better decisions on when to capture information.

We plan to keep this issue up to date with pointers to a fork that includes the code for the options we experiment with.

If you have any suggestions on what we should consider looking at or would like to collaborate on this effort just let me know.

@benjamingr
Copy link
Member

Hey, have been thinking on this for a while - can we discuss?

@davepacheco
Copy link

@misterdjules has done quite a bit of work to identify problematic cases and challenges in addressing them. I don't have the links handy, but perhaps he (or somebody else) does.

@misterdjules
Copy link

@misterdjules has done quite a bit of work to identify problematic cases and challenges in addressing them. I don't have the links handy, but perhaps he (or somebody else) does.

I tried to document the challenges of using post-mortem debugging with code that uses promises at https://gist.github.com/misterdjules/2969aa1b5e6440a7e401. I tried to do the same with regards to using post-mortem debugging with code that uses async/await at https://gist.github.com/misterdjules/30db8fc651746c6f917a.

Both documents are works in progress because shortly after I started writing them, the pull request that had been submitted to node core to provide support for promises was closed, and so that work became low priority.

Both documents focused on trying to find a solution so that we could use the exact same post-mortem debugging tools that we use at Joyent (mdb and mdb_v8) to inspect core files generated after an uncaught error is thrown within a promise handler.

The main obstacle to making more progress was, in my opinion, finding out what changes to the behavior of programs using promises would be acceptable to accommodate the needs of post-mortem debugging users. That would have required discussing the design of promises with other people (e.g TC39 members, etc.) and exploring what could be changed about it, but it hasn't been done.

If I understand correctly, this issue introduces an effort that will explore other avenues, so maybe the two gists I linked above and the context I just gave are not that relevant. However, if you have any question, feel free to ask and I'll be happy to try to help!

@mhdawson
Copy link
Member

mhdawson commented Jun 6, 2017

@benjamingr I can facilitate setting up a meeting to discuss for those who are interested. Doodle here: http://doodle.com/poll/mpmrqsxi3tq4msiq. I only suggested a few times, if these don't work out we can find a different set. Anybody who is interested and @uzun please fill out the poll.

@mhdawson
Copy link
Member

mhdawson commented Jun 8, 2017

Starting point for examples discussion with Max today:

Archive.zip

@benjamingr
Copy link
Member

So on the 12th, anyone else wants to join the meeting?

@mhdawson
Copy link
Member

mhdawson commented Jun 9, 2017

Just set the time to 12 EST on Monday the 12th of June based on the doodle, I'll post a link to a hangout in this issue just before 12 on Monday. If anybody else is interested pleas join in.

@benjamingr
Copy link
Member

@mhdawson ?

@mhdawson
Copy link
Member

@benjamingr @uzun

@mhdawson
Copy link
Member

Meeting recording here: https://www.youtube.com/watch?v=WeQAIUPbHOA although we started it a bit late so did not capture all of the discussion.

We also agreed to have a regular meeting 12 EST on Mondays to get started. @benjamingr is going to list a few more test cases we should cover and he also provided a link to the v8 API's that let you register for rejections (max will add link to this issue)

@benjamingr
Copy link
Member

In my opinion, the most interesting case is:

function test() {
  var p = Promise.reject(new Error());
  process.nextTick(() => modifyHeap()); 
  // if you remove me, a core dump should return an unmodified heap
  setTimeout(() => { p.catch(function(() => {}); 
}

The promise is handled in a macrotick - but the heap is modified in a microtick. This is the "core" case @misterdjules and I have been discussing. Unless we can get a core dump "fast enough" to take in this case (which we can't if I understand correctly, since it's not viable to only capture the page changed in modifyHeap() and the rest later if it's unhandled).

@rnchamberlain
Copy link
Contributor

It might be possible to have an option/hook to fork the process at the reject point (i.e. before modifyHeap()). The parent and child then run as copy-on-write. Then if/when the decision is made that a core dump is needed, we request the child (which has been idle) to dump core.

Alternatively, we capture a lightweight node-report or similar at the unhandled point - first failure. Then using the information it contains (eg stack trace?) we set a option/hook to trigger a core dump at the reject point on a subsequent run.

@benjamingr
Copy link
Member

It might be possible to have an option/hook to fork the process at the reject point (i.e. before modifyHeap()). The parent and child then run as copy-on-write. Then if/when the decision is made that a core dump is needed, we request the child (which has been idle) to dump core.

I was under the impression that this solution is impossible because taking a copy-on-write core dump is too expensive. What if we fork and then just block - and then take the dump only when we're sure?

Is forking 2 times a second without doing any memory touching too expensive?

@rnchamberlain
Copy link
Contributor

Yes, that's what I mean, the child just blocks, so there is no disk i/o until we are sure we want to dump. Just the cost of the memory copies in the parent process - we'd need to see how expensive that is.

@benjamingr
Copy link
Member

@rnchamberlain if it's possible - it would be the best path forward. I think forking should be fast enough.

@davepacheco
Copy link

I originally said this elsewhere, but it applies here:

Forking is dicey in multi-threaded programs, particularly if one component is not responsible for all of the threads. It's also dicey in a context where you know the program has encountered a non-operational error (i.e., a bug). You could cause the program to hang instead of either dumping core or proceeding on. Plus, important state about the program can change in the child process. This approach might be fine in some specific programs, but I wouldn't recommend it as a general approach.

Besides that, the performance cost is not trivial. Even if the child process were to immediately exit, a lot of work has to happen as part of fork: all threads need to be stopped, the entire list of open file descriptors needs to be duplicated, and all of the parent process's page tables need to be marked copy-on-write. Besides that immediate cost, there's a subsequent performance impact on both the parent and the child after the fork() because of COW. fork() is not a lightweight operation.

@sam-github
Copy link

@davepacheco Agreed that taking core dumps is not a light-weight exercise and involves some technical challenges to overcome, but getting a core dump is a valuable feature. Can you suggest an alternate approach?

@davepacheco
Copy link

@sam-github Core dumps are indeed somewhat expensive, but that cost is only incurred when the program has already failed fatally, so that's often not a big deal. My concerns were about using fork() in error cases that are not necessarily fatal.

To answer your question: I don't know that it's possible to satisfy the competing goals here (namely, to record program state at the point of fatal failure and also allow programs to handle an otherwise fatal failure after it's already happened). For me, this makes the use of promises as specified a non-starter. I understand that it's a tradeoff, and I'm glad people are looking into options to make these work better, but I think using fork() here would only compound the challenges of using promises with postmortem debugging.

@rnchamberlain
Copy link
Contributor

The alternative strategy is to capture enough information at the 'unhandled' point - i.e. on first failure - to allow a selective trap to be set during subsequent runs of the application. The trap has to be sufficiently selective so that we can trigger core dumps at the 'reject' point, without triggering an excessive number (we'd need a count/limit anyway).

@mhdawson
Copy link
Member

@rnchamberlain experimenting with the selective approach is part of the plan.

@mhdawson
Copy link
Member

@mhdawson
Copy link
Member

@benjamingr you going to make it this week ?

@uzun
Copy link
Author

uzun commented Jun 19, 2017

Link from last week that @benjamingr mentioned in chat: https://github.com/nodejs/node/blob/master/lib/internal/process/promises.js

@benjamingr
Copy link
Member

Sorry, I was pulled into a meeting.

@mhdawson
Copy link
Member

Of interest: https://www.npmjs.com/package/gencore once we get to the point were we want to generate cores on a second run.

@mhdawson
Copy link
Member

@benjamingr np, hope to see you next week.

@mhdawson
Copy link
Member

Will post hangout in just a minute

@rnchamberlain
Copy link
Contributor

If we don't want to continue running after producing the core dump, the SIGSEGV/SIGABRT mechanism is probably easiest. It would match what we do for --abort -on-uncaught-exception - see the base::OS::Abort() call in node/deps/v8/src/isolate.cc, function Isolate::Throw().

@benjamingr
Copy link
Member

@uzun can you please let me know if the above comments help with the questions you've had about how to generate core dumps from Node core?

@mhdawson
Copy link
Member

There is only a 50% change I can make the meeting today as I have some family related issues. If I can't I'll catch up with @uzun later in the week.

@benjamingr
Copy link
Member

@uzun do we have any reason to meet today?

@uzun
Copy link
Author

uzun commented Jul 31, 2017

@benjamingr @mhdawson I think we can just skip this week then and catch up next Monday.

@uzun
Copy link
Author

uzun commented Aug 7, 2017

Just a heads up, I won't be able to make it this week, so no meeting today.

@mhdawson
Copy link
Member

Are we on for today ?

@uzun
Copy link
Author

uzun commented Aug 14, 2017

I'm attending a thesis defence that might overlap with it this afternoon. We can hold off until next Monday or reschedule for later in the week. Not as much progress to report as I'd like, so I could use the extra time.

@mhdawson
Copy link
Member

Ok lets do it next week as I'm just catching up after being away for a week myself.

@uzun
Copy link
Author

uzun commented Aug 28, 2017

I'm out sick today and can't make it, I'll post some progress updates this week.

@mhdawson
Copy link
Member

ok, we might want to pick another time this week as we have not met for a while.

Once you feel better can you propose a time.

@benjamingr
Copy link
Member

Got it, next week or earlier then.

@benjamingr
Copy link
Member

Hi, given there were no meetings for a few weeks now - is this actually happening? Are you still actively pursuing this @uzun ?

@benjamingr
Copy link
Member

I think given nodejs/node#15335 this issue can be closed.

@rnchamberlain
Copy link
Contributor

@benjamingr maybe wait until nodejs/node#15335 has landed

@uzun
Copy link
Author

uzun commented Jan 30, 2018

I'm looking for open source projects that extensively use promises. If anyone has any suggestions, let me know.

@styfle
Copy link
Member

styfle commented Jan 30, 2018

@uzun You might try GitHub or npm. Or try asking around on twitter.

Here's a project that utilizes promises: https://github.com/bevacqua/promisees

@mhdawson
Copy link
Member

It was mentioned in the recent benchmarking meeting that the latest HAPI framework uses promises exclusively, so maybe something built on that?

@littledan
Copy link

Has anyone considered using catch prediction, rather than HostPromiseRejectionTracker, for deciding when to capture a core dump?

@bmeurer has pointed out that the old logic didn't do such a good
job of capturing the stack trace anyway: An Error is created from the
PromisePerformThen path, which is already executing from a very
different place from where the Promise was originally rejected.
Basically, if you have a chain of Promises, and the first one rejects,
which triggers the last one to be rejected as well, and that's the
only unhandled one, the callback is called on that last Promise after
working through all the logic in the chain. Really, the best place to
capture the stack trace and system state would be where the rejection
was created in the first place. Unfortunately, at that point, it's
unclear whether the rejection will be handled or not.

Catch prediction is meant to solve this problem in Chrome DevTools:
it's a V8 feature which backs DevTool's "pause on uncaught exception"
feature. That feature lets you run a web page until an exception is
thrown but not caught, pausing at the moment it is thrown, with the
whole system state in place at that point. It's a heuristic--not 100%
accurate, and it gets a bit worse with Promises (e.g., not
understanding the catch in Promise.reject().catch(() => {})), but
it's a tool V8 has already which can get at the specific time when a
Promise rejection starts. There's some performance overhead, though; I
don't know if it's feasible to leave on in a production system.

If post-mortem stacks and core dumps are done via catch prediction
rather than HostPromiseRejectionTracker, we might get improved accuracy of core dumps and better guesses about whether to trigger one (as catch prediction is integrated with async/await).

@mhdawson
Copy link
Member

mhdawson commented Feb 6, 2018

@littledan how would we experiment with Catch prediction (pointers to which files to look it, what to search for etc,)? @uzun was looking at benchmarking different options and that sounds like one that he should include.

@littledan
Copy link

@mhdawson The API is driven by ChangeBreakOnException.

@benjamingr
Copy link
Member

@uzun sequelize, most modern DB drivers, koa and most of the newer ecosystem.

@richardlau
Copy link
Member

I believe this work is now being done in https://github.com/nodejs/promises-debugging. If this needs to be tracked outside of that repository, please open an issue over in https://github.com/nodejs/diagnostics.

retrohacker pushed a commit to retrohacker/promise-use-cases that referenced this issue Aug 15, 2018
After reading this comment, I walk away feeling like the things I have cared about for a long time aren't taken seriously by the author, though I suspect that was not the intent.

Promises changing the state of a process before it exits has been a known problem for nearly as long as I've been in the Node.js community, and effort has been invested by quite a few people (i.e. the amazing @addaleax!) in this space.

nodejs/post-mortem#45 (speaks directly to it as if it were a known long-standing issue)
nodejs/promises#26 (doesn't reference core dumps but speaks to changing state on process exit)

We are still resisting promises in the restify codebase due to this class of problem: restify/node-restify#1304

I do think it is fair to say:

* Those with a vested interest in operating Node.js in production at scale haven't shown up with the resources to solve for operating promises. I'm starting to see this change (yay!).
* Many of conversations around this happened outside of the Node.js issue tracker, often in echo chambers and hallway tracks. If you weren't actively looking for these concerns, they would have been easy to miss. I believe this is already changing (also yay!).

I also think it is fair to say:

* Many of those who had a vested interest in operating Node.js in production at scale _did_ show up and broadcast their concerns about the form of the promises specification when it was being proposed. The ones I saw were met with something between dismissal and hostility.
* Those I know who _did_ show up left this feeling as if promises in their current form were both rushed through the specification process and forced upon this part of the community without much regard for their input. After this, many wholesale rejected promises as a viable primitive which left adoption to be driven by those who didn't have this class of concern. To put it another way, if you cared about post-mortem and your concerns weren't heard for the spec, you didn't use promises and since you didn't use promises you wouldn't be actively engaging in issue trackers about promises.

I don't bring up these two points to add fuel to an old argument, but to try to cultivate empathy for why these concerns are only now starting to shake out of the woodwork from some people's perspectives. The part of the community that cares about this class of problem - and that we want investment from to solve this class of problem - was ostracized early on in the story of promises. We should be making efforts to draw those who abandoned promises back into the community, I don't believe the tone of this section advances that cause.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests