[PLAT-7848] Improve handling of concurrent crashes #1286
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goal
Successfully report crashes in the event that multiple threads crash at the same time.
In the current release, a secondary crash that occurs while the first crash handling thread is still working (and presumably has not yet suspended other threads) will be incorrectly identified as a "crash in the crash reporter" and trigger an internal error report (via a recrash report.)
There are also other race conditions that could cause corruption of the crash reporting process.
Changeset
Crash reporting is now explicitly one-shot (in practical terms it already was - a single crash report path is configured per process lifetime) so that only a single crash report will attempt to be written.
A recrash report is now only written for a secondary crash which occured in the original crash reporting thread.
Both of these checks are performed in
bsg_kscrashsentry_beginHandlingCrash()
which now takes the offending (crashed) thread as its argument. An atomic compare-exchange operation is used to ensure only a single thread can win the race to be reported.Crashes in secondary threads are prevented from immediately killing the process by waiting in
bsg_kscrashsentry_beginHandlingCrash()
until the crash handling has finished.Testing
An E2E scenario that reliably reproduced the problem has been added, and the fix verified in multiple runs. Attempts to include checks of the stackframe contents failed because C++ exception stacktraces are not currently recorded when Bugsnag is linked dynamically, which it is for the Mac fixture.
Existing test for recrash reports have also been run successfully.