-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571
Comments
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label. |
Thank you for capturing the dump!
Yes, that is the problem.
You should be able to look at the
Yes, that would help too. |
Hi Jan, thanks for the confirmation. (And after working on the BCL for 14 years, of course I know to get a dump.) I've looked at this more but I am not happy with what I'm seeing. The OwningThread parameter is 0x12c8, which is the CLR's debugger thread. The stack trace of that thread is pretty boring and typical. I did do something weird here. When I first noticed the problem, I attached Visual Studio in managed-only mode to debug the problem. I noticed all the threads blocked on the GC. That's when I detached Visual Studio, then re-attached it in native only mode. So it's vaguely possible that there is a completely unrelated GC problem, I attached a debugger, the debugger thread acquired the threadstore lock, and somehow the deadlock was removed by attaching the managed debugger, then attaching the native debugger saw the process in a similar deadlock but in a different spot. That doesn't seem overly likely, but it may not be incorrect. Fortunately, I got two dumps! I will start downloading and analyzing that later tonight. |
Here's another thread trying to acquire the ThreadStore lock... This is a very weird thread that I don't quite understand - looks like it's trying to run Git... Could the managed debugger have created this thread and/or started running stuff on it? It may not be part of a deadlock, but it's certainly odd to me.
|
This thread is ETW control thread. Looks like something (debugger?) is trying to turn on ETW events for Windows::Foundation::Diagnostics::AsyncCausalityTracer. |
Tagging subscribers to this area: @tommcdon |
@davmason do you have time take a look? |
I can take a look. @brianflex in the past I've had people upload large files to onedrive/google drive. Would that work for you? |
Good idea. I collected two dumps. The DataExporterHungWhileAllocating one was the first dump, where I attached VS in managed only mode. I later detached it, and re-attached with both native and managed debugging. Most of my analysis was on the second dump. That sequence of events shouldn't have affected anything, but as my response to Jan indicates above, it looks like in the second dump, the debugger thread held the threadstore lock. Perhaps there's some missing code in the detach code path? Let me know if you need more help - happy to drive over to Building 25 or find some way to collaborate if needed. Of course, I've been away from your code base for 8 years - at this point maybe assume I can't tell a GC hole from an OBJECTREF that wasn't GC protected. |
I took a look at both dumps, and I don't have much to add beyond the investigation you've already done. In both a thread is attempting to do a GC and believes the debugger thread owns the ThreadStore lock, but the debugger thread is sitting idle. The rest of the threads are what I would expect from a GC that never happens, they execute until they need more alloc_context or do something else that requires a coop/preemptive transition, then wait for the GC to be done. How often are you seeing this? Is is a one time thing or reproducible? If it's reproducible I'd really like to see a dump taken of the hung process with "procdump -ma" or with ".dump /ma" in windbg without ever attaching VS. Attaching a managed debugger can change the state of the process quite a bit. |
I was unable to repro this again today, after several attempts. |
Thanks for trying again. I would like to get to the bottom of this, but I don't believe there is enough data in the dumps you provided to be actionable. Whatever happened to orphan the ThreadStore lock is no longer present. If you get that state again, please collect a native dump via procdump or windbg and share it with me. |
Closing issue since the issue is 5 months old and we have not been able to repro / do not have a dump to diagnose the issue. |
I've got a deadlock where the GC is blocked acquiring the threadstore lock, and almost all the other threads are blocked on the GC. My specific instance has 310 active threads though when the process deadlocks.
I'm using .NET Core 3.1. The program reads different JSON files for every day for the past year or so, using a different thread for each one for performance (a poor man's in-proc Big Data solution). The process usually completes in 10 minutes, but it hung for about 9 hours with no obvious forward progress. Attaching VS, most threads are in various helpers like FastAllocateString, all blocked on the GC. This behavior just showed up today, meaning the app used one more thread - maybe I passed over some thread number limit with 310 active threads?
Stack traces from my crash dump (~*k):
DataExporterThreads.txt
Most threads are blocked in WKS::gc_heap::wait_for_gc_done. Thread 93 is doing a GC, but is blocked in SuspendEE trying to get the ThreadStore lock. In a cursory analysis, I was unable to determine which thread is holding the ThreadStore lock. There were several threads in JIT_PInvokeEndRarePath calling Thread::RareDisablePreemptiveGC, which oddly calls EnablePreemptiveGC and then blocks (I never got a good explanation from Vance or Chris Brumme on that one).
I was able to work around this at least once with the following code:
ThreadPool.SetMaxThreads(200, 50);
I'm attaching the stack trace from each thread. I have a dump captured by VS, with the debugger attached as both native & managed. Happy to upload it somehow, somewhere. 400 MB compressed.
Thread 93's portion back in native code:
ntdll!NtWaitForAlertByThreadId+0x14
ntdll!RtlpWaitOnAddressWithTimeout+0x43
ntdll!RtlpWaitOnAddress+0xb2
ntdll!RtlpWaitOnCriticalSection+0xdb
ntdll!RtlpEnterCriticalSectionContended+0xcc
ntdll!RtlEnterCriticalSection+0x40
coreclr!CrstBase::Enter+0x49 [f:\workspace_work\1\s\src\vm\crst.cpp @ 319]
coreclr!ThreadStore::Enter+0x10 [f:\workspace_work\1\s\src\vm\threads.cpp @ 5255]
coreclr!ThreadSuspend::LockThreadStore+0x7d [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 2361]
coreclr!ThreadSuspend::SuspendEE+0xf7 [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 6443]
coreclr!GCToEEInterface::SuspendEE+0x21 [f:\workspace_work\1\s\src\vm\gcenv.ee.cpp @ 25]
coreclr!WKS::GCHeap::GarbageCollectGeneration+0xff [f:\workspace_work\1\s\src\gc\gc.cpp @ 36545]
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace_work\1\s\src\gc\gc.cpp @ 13832]
coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace_work\1\s\src\gc\gc.cpp @ 13934]
coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14369]
coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14400]
coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace_work\1\s\src\gc\gc.cpp @ 35827]
coreclr!Alloc+0x12f [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 240]
coreclr!AllocateString+0x195 [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 996]
coreclr!FramedAllocateString+0x82 [f:\workspace_work\1\s\src\vm\jithelpers.cpp @ 2884]
A set of threads were blocked on the GC mode around a P/Invoke call to ReadFile.
[0x1] KERNELBASE!WaitForSingleObjectEx + 0x8f
[0x2] coreclr!CLREventWaitHelper2 + 0x6
[0x3] coreclr!CLREventWaitHelper + 0x20
[0x4] coreclr!CLREventBase::WaitEx + 0x12
[0x5] coreclr!CLREventBase::Wait + 0x12
[0x6] coreclr!Thread::WaitSuspendEventsHelper + 0xa9
[0x7] coreclr!Thread::WaitSuspendEvents + 0x8
[0x8] coreclr!Thread::RareEnablePreemptiveGC + 0x1342ec
[0x9] coreclr!Thread::EnablePreemptiveGC + 0x116
[0xa] coreclr!Thread::RareDisablePreemptiveGC + 0x1c6
[0xb] coreclr!JIT_PInvokeEndRarePath + 0x69
[0xc] System_Private_CoreLib!Interop+Kernel32.ReadFile(System.Runtime.InteropServices.SafeHandle, Byte*, Int32, Int32 ByRef, IntPtr)$##6000044 + 0xbf
Brian (who left the BCL team 8 years ago)
The text was updated successfully, but these errors were encountered: