Fix stack overflow handling on Unix #89799

janvorli · 2023-08-01T17:54:18Z

There was a race between cleaning up signal handlers at process abort and multiple threads failing with stack overflow at the same time.
When a process exits due to a stack overflow, we should not restore the SIGSEGV handler so that all other threads failing due to the same condition just wait until the process abort completes.

Close #46175

There was a race between cleaning up signal handlers at process abort and multiple threads failing with stack overflow at the same time. When a process exits due to a stack overflow, we should not restore the SIGSEGV handler so that all other threads failing due to the same condition just wait until the process abort completes.

jkotas · 2023-08-01T19:47:49Z

Build errors

janvorli · 2023-08-01T22:42:09Z

The test is failing on Unix because the output of the test contains the following log from createdump after the stack trace and so the expected last line with "Main" is not found.

"[createdump] Invalid process id: task_for_pid(22718) FAILED (os/kern) failure (5)"
"[createdump] This failure may be because createdump or the application is not properly signed and entitled."
"[createdump] Failure took 0ms"
"waitpid() returned successfully (wstatus 0000ff00) WEXITSTATUS ff WTERMSIG 0"

I need to update the test to be resilient to this.
I've also forgotten to keep it disabled on arm64 (at least the large frame ones) as the JIT still generates probes in a way that prevents telling stack overflow from regular sigsegv in some edge cases.

Also make it resilient to extra output after the stack trace like createdump output

janvorli · 2023-08-02T23:49:23Z

I've added disabling of minidumps for the child processes that are supposed to crash with stack overflow. It is not necessary and consuming resources / time on the CI machines. We should probably do that for the other coreclr tests that have intentionally crashing children too.

hoyosjs · 2023-08-04T20:38:39Z

src/coreclr/pal/src/thread/process.cpp

@@ -2524,7 +2524,7 @@ PROCAbort(int signal, siginfo_t* siginfo)

    // Restore all signals; the SIGABORT handler to prevent recursion and
    // the others to prevent multiple core dumps from being generated.
-    SEHCleanupSignals();
+    SEHCleanupSignals(false /* isChildProcess */);


How high is the risk of reentrancy here? We probably don't want to end in a recursive case here.

ProcAbort is calling abort() right after this line, so I don't see how it could be reentered on the same thread.

jkotas · 2023-08-05T23:20:25Z

multiple threads failing with stack overflow at the same time.

Can the same problem happen with other signals? For example, what happens if there are multiple threads crashing with NullReferenceException? If I am reading the code correctly, the first one is going to unregister the signal handler and the next one is going bring the process down immediately since our signal handler won't be registered anymore.

Should we keep the signal handlers registered and instead block in them if a fatal process shutdown is underway?

janvorli · 2023-08-14T15:19:31Z

This is a targeted fix for an issue specific to the current implementation of stack overflow handling. The reason why I have originally (when implementing the stack overflow handling in the past) added the logic that lets through only the first stack overflow and makes all others wait was that I needed to create an extra thread for the stack trace logging to be reliable in the case of stack overflow and it didn't seem to make sense to have many of them writing the stack trace past each other.
The non-stack overflow hardware exception cases are different in that the handler runs on the faulting thread (thus the same stack). Also, the non-stack overflow exceptions can get handled while the stack overflow always fails fast.

In general, I think that it would make sense to not to unregister the hardware exception signal handlers at all at process abort and implement some waiting mechanism in the handlers, but we would need to solve an additional race between the time the abort is called and the time the hardware exception signal handler is called. In fact, thanks to your feedback I have just realized that even with the fix in this PR, we can have a race causing the same problem as before if we had any unhandled exception or anything causing process abort on one thread and stack overflow on another one.

So, it seems it would make sense to modify this PR to handle all the cases, including the ones you've mentioned, properly (and move it to .NET 9)

mangod9 · 2023-09-11T15:29:28Z

@janvorli is this ready to merge or more work is required as part of this change?

janvorli · 2023-09-11T15:57:24Z

@mangod9 please see the end of my comment above: #89799 (comment). Based on JanK's feedback, I've decided making some more changes for other hardware failure signals. I have also found there is a possibility of a different race than the one I've fixed. So it needs some additional work.
I can close this PR and reopen it once the extra work is done if preferred.

janvorli added the area-VM-coreclr label Aug 1, 2023

janvorli added this to the 8.0.0 milestone Aug 1, 2023

janvorli requested a review from jkotas August 1, 2023 17:54

janvorli self-assigned this Aug 1, 2023

Fix macOS build break

b3ba84a

Modify the test to exclude large frame tests on arm64

b7a77b4

Also make it resilient to extra output after the stack trace like createdump output

build-analysis bot mentioned this pull request Aug 2, 2023

Assert in System.Net.Sockets.SocketAsyncEventArgs.CompleteAcceptOperation #89806

Closed

Prevent generating dumps for the child processes

2b65225

hoyosjs reviewed Aug 4, 2023

View reviewed changes

janvorli modified the milestones: 8.0.0, 9.0.0 Aug 14, 2023

janvorli closed this Dec 20, 2023

github-actions bot locked and limited conversation to collaborators Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stack overflow handling on Unix #89799

Fix stack overflow handling on Unix #89799

janvorli commented Aug 1, 2023

jkotas commented Aug 1, 2023

janvorli commented Aug 1, 2023

janvorli commented Aug 2, 2023

hoyosjs Aug 4, 2023

janvorli Aug 4, 2023

jkotas commented Aug 5, 2023

janvorli commented Aug 14, 2023

mangod9 commented Sep 11, 2023

janvorli commented Sep 11, 2023

Fix stack overflow handling on Unix #89799

Fix stack overflow handling on Unix #89799

Conversation

janvorli commented Aug 1, 2023

jkotas commented Aug 1, 2023

janvorli commented Aug 1, 2023

janvorli commented Aug 2, 2023

hoyosjs Aug 4, 2023

Choose a reason for hiding this comment

janvorli Aug 4, 2023

Choose a reason for hiding this comment

jkotas commented Aug 5, 2023

janvorli commented Aug 14, 2023

mangod9 commented Sep 11, 2023

janvorli commented Sep 11, 2023