Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stack overflow handling on Unix #89799

Closed

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Aug 1, 2023

There was a race between cleaning up signal handlers at process abort and multiple threads failing with stack overflow at the same time.
When a process exits due to a stack overflow, we should not restore the SIGSEGV handler so that all other threads failing due to the same condition just wait until the process abort completes.

Close #46175

There was a race between cleaning up signal handlers at process
abort and multiple threads failing with stack overflow at the
same time.
When a process exits due to a stack overflow, we should not
restore the SIGSEGV handler so that all other threads failing
due to the same condition just wait until the process abort
completes.
@janvorli janvorli added this to the 8.0.0 milestone Aug 1, 2023
@janvorli janvorli requested a review from jkotas August 1, 2023 17:54
@janvorli janvorli self-assigned this Aug 1, 2023
@jkotas
Copy link
Member

jkotas commented Aug 1, 2023

Build errors

@janvorli
Copy link
Member Author

janvorli commented Aug 1, 2023

The test is failing on Unix because the output of the test contains the following log from createdump after the stack trace and so the expected last line with "Main" is not found.

"[createdump] Invalid process id: task_for_pid(22718) FAILED (os/kern) failure (5)"
"[createdump] This failure may be because createdump or the application is not properly signed and entitled."
"[createdump] Failure took 0ms"
"waitpid() returned successfully (wstatus 0000ff00) WEXITSTATUS ff WTERMSIG 0"

I need to update the test to be resilient to this.
I've also forgotten to keep it disabled on arm64 (at least the large frame ones) as the JIT still generates probes in a way that prevents telling stack overflow from regular sigsegv in some edge cases.

Also make it resilient to extra output after the stack trace
like createdump output
@janvorli
Copy link
Member Author

janvorli commented Aug 2, 2023

I've added disabling of minidumps for the child processes that are supposed to crash with stack overflow. It is not necessary and consuming resources / time on the CI machines. We should probably do that for the other coreclr tests that have intentionally crashing children too.

@@ -2524,7 +2524,7 @@ PROCAbort(int signal, siginfo_t* siginfo)

// Restore all signals; the SIGABORT handler to prevent recursion and
// the others to prevent multiple core dumps from being generated.
SEHCleanupSignals();
SEHCleanupSignals(false /* isChildProcess */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How high is the risk of reentrancy here? We probably don't want to end in a recursive case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcAbort is calling abort() right after this line, so I don't see how it could be reentered on the same thread.

@jkotas
Copy link
Member

jkotas commented Aug 5, 2023

multiple threads failing with stack overflow at the same time.

Can the same problem happen with other signals? For example, what happens if there are multiple threads crashing with NullReferenceException? If I am reading the code correctly, the first one is going to unregister the signal handler and the next one is going bring the process down immediately since our signal handler won't be registered anymore.

Should we keep the signal handlers registered and instead block in them if a fatal process shutdown is underway?

@janvorli
Copy link
Member Author

This is a targeted fix for an issue specific to the current implementation of stack overflow handling. The reason why I have originally (when implementing the stack overflow handling in the past) added the logic that lets through only the first stack overflow and makes all others wait was that I needed to create an extra thread for the stack trace logging to be reliable in the case of stack overflow and it didn't seem to make sense to have many of them writing the stack trace past each other.
The non-stack overflow hardware exception cases are different in that the handler runs on the faulting thread (thus the same stack). Also, the non-stack overflow exceptions can get handled while the stack overflow always fails fast.

In general, I think that it would make sense to not to unregister the hardware exception signal handlers at all at process abort and implement some waiting mechanism in the handlers, but we would need to solve an additional race between the time the abort is called and the time the hardware exception signal handler is called. In fact, thanks to your feedback I have just realized that even with the fix in this PR, we can have a race causing the same problem as before if we had any unhandled exception or anything causing process abort on one thread and stack overflow on another one.

So, it seems it would make sense to modify this PR to handle all the cases, including the ones you've mentioned, properly (and move it to .NET 9)

@janvorli janvorli modified the milestones: 8.0.0, 9.0.0 Aug 14, 2023
@mangod9
Copy link
Member

mangod9 commented Sep 11, 2023

@janvorli is this ready to merge or more work is required as part of this change?

@janvorli
Copy link
Member Author

@mangod9 please see the end of my comment above: #89799 (comment). Based on JanK's feedback, I've decided making some more changes for other hardware failure signals. I have also found there is a possibility of a different race than the one I've fixed. So it needs some additional work.
I can close this PR and reopen it once the extra work is done if preferred.

@janvorli janvorli closed this Dec 20, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester fails on Linux
4 participants