Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential deadlock when generating a crash report #580

Merged
merged 14 commits into from
May 13, 2020

Conversation

robinmacharg
Copy link
Contributor

@robinmacharg robinmacharg commented May 4, 2020

[Note: this PR supersedes #577, and incorporates feedback from initial review.]

Goal

An issue has been identified that can lead to an app deadlocking when reporting a crash.

The sequence of events is:

  1. Some program component accesses dyld and takes a lock out.
  2. A crash is reported.
  3. Bugsnag, as part of its normal crash reporting behaviour, suspends all but a few critical threads, including the thread that holds the dyld lock.
  4. Bugsnag attempts to read the Mach binary images loaded into the app. This also attempts to take out a lock but is unable to, leading to deadlock.

This PR fixes this issue.

Design

There are two main mechanisms by which a program may find out about dyld behaviour: by direct querying (e.g. _dyld_get_image_header() etc., used by the KSCrash component of Bugsnag prior to this issue) and by callback (_dyld_register_func_for_add_image() etc.). This change modifies the KSCrash component to use callbacks and caches information about loaded binary images for when a crash does occur. Thus, when threads are suspended and a crash report is generated no additional calls to dyld need to be made and deadlock is avoided.

Thread safety is a concern. Reading the released dyld source code implies that access to the internal binary image storage in modern OS releases is mediated by a locking mechanism. However to date no documented guarantee of this has been found. For the avoidance of doubt the KSCrash-level add/remove callbacks use a legacy-compatible shared lock to ensure synchronisation. By the time a crash report is being written all other non-KS threads will have been suspended and so no additional locking is required.

Performance is also a concern, especially given that the proposed solution moves processing from post-crash (where users are likely to be less concerned about a couple of lost seconds) to up-front. The library startup code was crudely timed in the sample iOS app on a real device (iPhone XS), with results as follows:

    // Placed in AppDelegate
    CFTimeInterval start = CACurrentMediaTime();
    [Bugsnag startBugsnagWithApiKey:@"<API KEY>"];
    CFTimeInterval end = CACurrentMediaTime();
    NSLog(@"Took: %f", end - start);
Without fix With fix
1 0.010215 0.011997
2 0.014457 0.011432
3 0.011843 0.015702
4 0.014623 0.013538
5 0.011528 0.01601
6 0.015008 0.017804
7 0.01258 0.015619
8 0.012356 0.012655
9 0.011991 0.014726
10 0.012292 0.013338
Avg 0.0126893 0.0142821

i.e. a difference of ~2ms. This was for 385 binary images. (Anecdotally, turning on debugging adds ~0.1s to the startup time).

Changeset

dyld binary image capture was moved from BSG_KSCrashReport.c to BSG_KSCrash.m. A file-static dynamic array cache is used. An additional .m file contains functions to ease bridging between Objective C and C.

Bugsnag/KSCrash makes use of dyld in two main - but related - places: BSG_KSCrashReport.c writing and BSG_KSDynamicLinker.c. These were modified to use an internal API similar to that presented by dyld that accesses the cache of headers.

Tests

  • A dynamic C array is used to record loaded binary images. Its behaviour is tested.
  • Manual performance testing has been carried out to ascertain the impact on app startup times of recording binary images.
  • The generated json reports are have been manually tested.
  • Both symbolicated and unsymbolicated crash reports need checked on the dashboard for correctness.

Review

Outstanding Questions

  • This pull request is ready for:
    • Initial review of the intended approach, not yet feature complete
    • Structural review of the classes, functions, and properties modified
    • Final review
    • Release
  • The correct target branch has been selected (master for fixes, next for
    features)
  • If this is intended for release have all of the pre-release checks been considered?
  • Consistency across platforms for structures or concepts added or modified
  • Consistency between the changeset and the goal stated above
  • Internal consistency with the rest of the library - is there any overlap between existing interfaces and any which have been added?
  • Usage friction - is the proposed change in usage cumbersome or complicated?
  • Performance and complexity - are there any cases of unexpected O(n^3) when iterating, recursing, flat mapping, etc?
  • Concurrency concerns - if components are accessed asynchronously, what issues will arise
  • Thoroughness of added tests and any missing edge cases
  • Idiomatic use of the language

@robinmacharg robinmacharg force-pushed the fix-dyld-deadlock2 branch 3 times, most recently from 6ac538e to 678db0c Compare May 4, 2020 16:08
@robinmacharg robinmacharg marked this pull request as ready for review May 5, 2020 08:05
Copy link
Contributor

@tomlongridge tomlongridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy, apart from two small nits.

@fractalwrench - can you take a look?

@robinmacharg robinmacharg added awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks. bug Confirmed bug wip There is work in progress labels May 6, 2020
@robinmacharg robinmacharg added scheduled Work is starting on this feature/bug and removed wip There is work in progress labels May 6, 2020
Copy link
Contributor

@tomlongridge tomlongridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@robinmacharg robinmacharg removed the awaiting feedback Awaiting a response from a customer. Will be automatically closed after approximately 2 weeks. label May 13, 2020
@robinmacharg robinmacharg merged commit 89aa9a1 into master May 13, 2020
@fractalwrench fractalwrench deleted the fix-dyld-deadlock2 branch June 22, 2020 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bug scheduled Work is starting on this feature/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants