Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display stack trace at stack overflow #31956

Merged
merged 4 commits into from
Feb 10, 2020

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Feb 8, 2020

This change enables printing stack trace at stack overflow to the
console. In case multiple threads fail with stack overflow in parallel,
only the trace of the first thread is printed.

There are couple of interesting details:

  • The stack trace is printed in a compressed form. The algorithm finds the largest sequence of frames starting from the top of the stack that is repeated and prints it just once with the repeat count.
  • Only the first thread that hits stack overflow gets its stack printed. On Linux, others threads that hit stack overflow are held spinning until the process exits. This enables having a single preallocated stack for handling stack overflow. On Windows, we just don't print the stack trace, as it would be messy anyways due to the interleaving of output from multiple threads and the value of getting stack overflow traces of multiple threads is questionable.
  • On debug / checked builds for Windows, I had to bump the stack guarantee by two pages and also enable setting the stack guarantee for all threads in these build flavors.
  • At couple of places in the runtime, there were debug checks comparing explicit frame and some other struct addresses to the current SP. These were firing on Linux when we are running on an extra stack overflow handling stack. I've fixed it by adding a flag to the Thread that indicates that we are running on an alternate stack and these checks should be skipped.
  • I've fixed the x86 Windows JIT_StackProbe to first probe at SP-4 and then move the SP to leave more stack space for the handler.
  • I've fixed stack overflow check on Linux for some glibc / distros. The stack limit returned by the pthread_attr_getstack returns the address of the guard page in some of the glibc / distros and address of the last accessible page before the guard page on others. So I've relaxed the check to consider +/- 1 page around the stack limit to recognize sigsegv as stack overflow.

This change enables printing stack trace at stack overflow to the
console. In case multiple threads fail with stack overflow in parallel,
only the trace of the first thread is printed.
@janvorli janvorli added this to the 5.0 milestone Feb 8, 2020
@janvorli janvorli requested a review from jkotas February 8, 2020 00:55
@janvorli janvorli self-assigned this Feb 8, 2020
@janvorli
Copy link
Member Author

janvorli commented Feb 8, 2020

cc: @stephentoub, @echesakovMSFT

@janvorli
Copy link
Member Author

janvorli commented Feb 8, 2020

Here are examples of stack traces printed:
First, a case where there are three functions with no locals calling each other in an infinite loop:

Stack overflow.
Repeat 174584 times:
--------------------------------
   at TestStackOverflow1.Program.InfiniteRecursionB()
   at TestStackOverflow1.Program.InfiniteRecursionA()
   at TestStackOverflow1.Program.InfiniteRecursionC()
--------------------------------
   at TestStackOverflow1.Program.InfiniteRecursionB()
   at TestStackOverflow1.Program.InfiniteRecursionA()
   at TestStackOverflow1.Program.Main(System.String[])

And here is a second example where each of the three functions has a local of a struct type where the struct is 64kB large. This tests the stack probing that JIT emits for frames larger than certain size.

Stack overflow.
Repeat 42 times:
--------------------------------
   at TestStackOverflow2.Program.InfiniteRecursionB2()
   at TestStackOverflow2.Program.InfiniteRecursionA2()
   at TestStackOverflow2.Program.InfiniteRecursionC2()
--------------------------------
   at TestStackOverflow2.Program.InfiniteRecursionB2()
   at TestStackOverflow2.Program.InfiniteRecursionA2()
   at TestStackOverflow2.Program.Main(System.String[])

@jkotas
Copy link
Member

jkotas commented Feb 8, 2020

Repeat 174584 times

Should we have a cut-off for this?

@janvorli
Copy link
Member Author

janvorli commented Feb 8, 2020

Should we have a cut-off for this?

I am not sure I understand what you mean.

@@ -6703,6 +6707,9 @@ bool IsIPInMarkedJitHelper(UINT_PTR uControlPc)
CHECK_RANGE(JIT_WriteBarrier)
CHECK_RANGE(JIT_CheckedWriteBarrier)
CHECK_RANGE(JIT_ByRefWriteBarrier)
#if !defined(TARGET_ARM64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's special about ARM64 that it is not in this list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ARM64 doesn't have the JIT_StackProbe (yet). See #13519.


static volatile LONG g_stackOverflowCallStackLogged = 0;

// Dump stack trace only for the first thread failing with stack overflow to prevent mixing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the other threads going to wait for this one to finish printing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've verified that. I've tried to put a 10 second sleep after printing the stack trace and it was still printed while the other threads were passing by.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the other threads hit stackoverflow exception as well? It may be useful to note in the comment where the other threads are going to wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing it with many threads hitting stack overflow at the same time. But I've just found that the threads wait only when running under a debugger. When running without a debugger, nothing is waiting for this thread to complete and so we get partial or even no stack trace dump here. So I'll need to fix it.

@jkotas
Copy link
Member

jkotas commented Feb 8, 2020

Should we have a cut-off for this?

I am not sure I understand what you mean.

I mean that getting a log message with 174584 lines is not very useful. It would be nice to log e.g. first 1000 lines and last 1000 lines and skip the 172,000 lines in between with ....

@jkotas
Copy link
Member

jkotas commented Feb 8, 2020

Ah ok, I see - you have implemented even something better.

@yyjdelete
Copy link

Will it also work if the loop is not begin from the top of the stack?
For something like the below:

Stack overflow.
   at TestStackOverflow1.Program.InfiniteRecursionE()
   at TestStackOverflow1.Program.InfiniteRecursionD()//`B` calls both `D` and then `C`, and overflow finally happen in `C`
Repeat 174584 times:
--------------------------------
   at TestStackOverflow1.Program.InfiniteRecursionB()
   at TestStackOverflow1.Program.InfiniteRecursionA()
   at TestStackOverflow1.Program.InfiniteRecursionC()
--------------------------------
   at TestStackOverflow1.Program.InfiniteRecursionB()
   at TestStackOverflow1.Program.InfiniteRecursionA()
   at TestStackOverflow1.Program.Main(System.String[])

@GSPP
Copy link

GSPP commented Feb 10, 2020

Does this only search for repeats starting at the top frame? This strategy could get unlucky and not find any repeats if there is a non-recursive function call at the top. For example:

A => B => A => B => ... => F

void A() => B();
void F() { };
void B()
{
 F(); //non-recursive
 A();
}

If I got this right, here's an idea on how to efficiently fix this: Take a histogram of functions involved in the stack. 99% of that histogram will be in the recursive part. That way it should be possible to efficiently determine the repeating section. It would be a heuristic that would essentially always work in practical cases. So first guess the repeating part from the histogram, then validate this assumption. Most of the time the validation succeeds and a shortened stack can be printed.

@janvorli
Copy link
Member Author

@yyjdelete, @GSPP - My primary goal was to get any stack trace at stack overflow since without my change, all you'd get is a "Stack overflow" message. The secondary goal was to add some form of compaction, initially something that would work for cases of an infinite recursion and that would have reasonable performance. I believe this covers majority of the cases of bugs in the code flow.
Of course there is a possibility of improvements to handle other patterns. When this change gets in, it would be easy to experiment with more involved methods of stack trace compaction. I was thinking about using Tarjan's strongly connected components algorithm for the cycle detection combined with compaction of the loops the algorithm detects e.g. the way the current change does it. The call stack can be viewed as a directed graph where functions represent the nodes and calls the edges. This would still be an O(n) algorithm. One could even think about running the same algorithm on the output with already detected loops so that loops within other loops could be detected. For example "a->b->c->a->b->c->x->d->e->d->e>->y->f->g->h->i->f->g->h->i->f->g->h->i->a->b->c->a->b->c->x->d->e->d->e>->y->f->g->h->i->f->g->h->i->f->g->h->i" could be visualized as

  • 2 *
    • 3 * (i, h, g, f)
    • y
    • 2 * (e, d)
    • x
    • 2 * (c, b, a)

@GSPP the histogram idea is interesting, thank you for the suggestion. It seems that there could still be cases where the functions on the top of the stack would occur somewhere else in the stack trace and mislead the detection. I guess you could still mitigate that somehow. It seems though that the Tarjan algorithm would solve it reliably for 100% of the cases.

* Update the stack overflow stack size to use fixed size as a basis rather than
basing it on the virtual page size.
* Rename the AdjustContextForWriteBarrier to AdjustContextForJITHelpers
* Add waiting for the stack overflow stack trace printing completion to
the `EEPolicy::HandleFatalStackOverflow` so that it gets printed
completely even in case of multiple threads failing with stack overflow.

// Dump stack trace only for the first thread failing with stack overflow to prevent mixing
// multiple stack traces together.
CLREvent* event = new CLREvent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can throw exceptions, etc. It may be better to poll with Sleep(50) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've added a new commit replacing the event by polling as you've suggested.

@janvorli janvorli merged commit 65587ba into dotnet:master Feb 10, 2020
@janvorli janvorli deleted the improve-stack-overflow-reporting branch February 10, 2020 23:39
@danmoseley
Copy link
Member

Hooray thank you for doing this @janvorli ! This will make it possible to diagnose these from a log file, which is a huge improvement.

safern added a commit that referenced this pull request Feb 12, 2020
@TYoung86
Copy link

Oof, I was looking forward to this.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants