Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display stack trace at stack overflow #32167

Merged

Conversation

janvorli
Copy link
Member

This is a fixed version of the reverted commit.

This change enables printing stack trace at stack overflow to the
console. In case multiple threads fail with stack overflow in parallel,
only the trace of the first thread is printed.

There are couple of interesting details:

The stack trace is printed in a compressed form. The algorithm finds the
largest sequence of frames starting from the top of the stack that is
repeated and prints it just once with the repeat count.
Only the first thread that hits stack overflow gets its stack printed.
On Linux, others threads that hit stack overflow are held spinning until
the process exits. This enables having a single preallocated stack for
handling stack overflow. On Windows, we just don't print the stack
trace, as it would be messy anyways due to the interleaving of output
from multiple threads and the value of getting stack overflow traces of
multiple threads is questionable.
On debug / checked builds for Windows, I had to bump the stack guarantee
by two pages and also enable setting the stack guarantee for all threads
in these build flavors.
At couple of places in the runtime, there were debug checks comparing
explicit frame and some other struct addresses to the current SP. These
were firing on Linux when we are running on an extra stack overflow
handling stack. I've fixed it by adding a flag to the Thread that
indicates that we are running on an alternate stack and these checks
should be skipped.
I've fixed the x86 Windows JIT_StackProbe to first probe at SP-4 and
then move the SP to leave more stack space for the handler.
I've fixed stack overflow check on Linux for some glibc / distros. The
stack limit returned by the pthread_attr_getstack returns the address of
the guard page in some of the glibc / distros and address of the last
accessible page before the guard page on others. So I've relaxed the
check to consider +/- 1 page around the stack limit to recognize sigsegv
as stack overflow.

This is a fixed version of the reverted commit.

This change enables printing stack trace at stack overflow to the
console. In case multiple threads fail with stack overflow in parallel,
only the trace of the first thread is printed.

There are couple of interesting details:

The stack trace is printed in a compressed form. The algorithm finds the
largest sequence of frames starting from the top of the stack that is
repeated and prints it just once with the repeat count.
Only the first thread that hits stack overflow gets its stack printed.
On Linux, others threads that hit stack overflow are held spinning until
the process exits. This enables having a single preallocated stack for
handling stack overflow. On Windows, we just don't print the stack
trace, as it would be messy anyways due to the interleaving of output
from multiple threads and the value of getting stack overflow traces of
multiple threads is questionable.
On debug / checked builds for Windows, I had to bump the stack guarantee
by two pages and also enable setting the stack guarantee for all threads
in these build flavors.
At couple of places in the runtime, there were debug checks comparing
explicit frame and some other struct addresses to the current SP. These
were firing on Linux when we are running on an extra stack overflow
handling stack. I've fixed it by adding a flag to the Thread that
indicates that we are running on an alternate stack and these checks
should be skipped.
I've fixed the x86 Windows JIT_StackProbe to first probe at SP-4 and
then move the SP to leave more stack space for the handler.
I've fixed stack overflow check on Linux for some glibc / distros. The
stack limit returned by the pthread_attr_getstack returns the address of
the guard page in some of the glibc / distros and address of the last
accessible page before the guard page on others. So I've relaxed the
check to consider +/- 1 page around the stack limit to recognize sigsegv
as stack overflow.
@janvorli janvorli added this to the 5.0 milestone Feb 12, 2020
@janvorli janvorli requested a review from jkotas February 12, 2020 11:39
@janvorli janvorli self-assigned this Feb 12, 2020
@janvorli
Copy link
Member Author

This is a cherry pick of the reverted commit with the issue fixed. There was a call to SaveCurrentExceptionInfo in the non-stack overflow code path that has accidentally leaked in from my original experiments with stack overflow handling in the past that was causing the problem.

With this fix, I've been running the System.Threading.Tasks.Tests for several hundreds of iterations without any failures. Without this fix, it was failing on my machine more than every second run of the test on average.

@jkotas
Copy link
Member

jkotas commented Feb 12, 2020

I should have asked on the original PR: Can we add some tests for this? This is the kind of feature that very likely to break in unexpected ways.

@janvorli
Copy link
Member Author

Can we add some tests for this?

I was thinking about it when I have created my stack overflow tests that I was using to drive the work. It seemed to me that our testing infra would only be able to verify that the exit code is the code for stack overflow on Windows. But now I have realized that I can launch the stack overflowing test as a child process, capture its console output and try to parse it to detect whether it worked or not from it. So I'll prepare some tests.

@@ -2808,7 +2808,7 @@ PAL_InjectActivation(
palError = InjectActivationInternal(pTargetThread);
}

if (palError == NO_ERROR)
if (palError != NO_ERROR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also found this in release/3.1. Probably it's not so important to backport? @janvorli

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have any effect on the runtime and we backport only issues that are causing crashes that cannot be worked around or security fixes.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants