Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

brianflex · 2020-06-07T23:08:59Z

I've got a deadlock where the GC is blocked acquiring the threadstore lock, and almost all the other threads are blocked on the GC. My specific instance has 310 active threads though when the process deadlocks.

I'm using .NET Core 3.1. The program reads different JSON files for every day for the past year or so, using a different thread for each one for performance (a poor man's in-proc Big Data solution). The process usually completes in 10 minutes, but it hung for about 9 hours with no obvious forward progress. Attaching VS, most threads are in various helpers like FastAllocateString, all blocked on the GC. This behavior just showed up today, meaning the app used one more thread - maybe I passed over some thread number limit with 310 active threads?

Stack traces from my crash dump (~*k):
DataExporterThreads.txt

Most threads are blocked in WKS::gc_heap::wait_for_gc_done. Thread 93 is doing a GC, but is blocked in SuspendEE trying to get the ThreadStore lock. In a cursory analysis, I was unable to determine which thread is holding the ThreadStore lock. There were several threads in JIT_PInvokeEndRarePath calling Thread::RareDisablePreemptiveGC, which oddly calls EnablePreemptiveGC and then blocks (I never got a good explanation from Vance or Chris Brumme on that one).

I was able to work around this at least once with the following code:
ThreadPool.SetMaxThreads(200, 50);

I'm attaching the stack trace from each thread. I have a dump captured by VS, with the debugger attached as both native & managed. Happy to upload it somehow, somewhere. 400 MB compressed.

Thread 93's portion back in native code:

ntdll!NtWaitForAlertByThreadId+0x14
ntdll!RtlpWaitOnAddressWithTimeout+0x43
ntdll!RtlpWaitOnAddress+0xb2
ntdll!RtlpWaitOnCriticalSection+0xdb
ntdll!RtlpEnterCriticalSectionContended+0xcc
ntdll!RtlEnterCriticalSection+0x40
coreclr!CrstBase::Enter+0x49 [f:\workspace_work\1\s\src\vm\crst.cpp @ 319]
coreclr!ThreadStore::Enter+0x10 [f:\workspace_work\1\s\src\vm\threads.cpp @ 5255]
coreclr!ThreadSuspend::LockThreadStore+0x7d [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 2361]
coreclr!ThreadSuspend::SuspendEE+0xf7 [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 6443]
coreclr!GCToEEInterface::SuspendEE+0x21 [f:\workspace_work\1\s\src\vm\gcenv.ee.cpp @ 25]
coreclr!WKS::GCHeap::GarbageCollectGeneration+0xff [f:\workspace_work\1\s\src\gc\gc.cpp @ 36545]
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace_work\1\s\src\gc\gc.cpp @ 13832]
coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace_work\1\s\src\gc\gc.cpp @ 13934]
coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14369]
coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14400]
coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace_work\1\s\src\gc\gc.cpp @ 35827]
coreclr!Alloc+0x12f [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 240]
coreclr!AllocateString+0x195 [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 996]
coreclr!FramedAllocateString+0x82 [f:\workspace_work\1\s\src\vm\jithelpers.cpp @ 2884]

A set of threads were blocked on the GC mode around a P/Invoke call to ReadFile.

[0x1] KERNELBASE!WaitForSingleObjectEx + 0x8f
[0x2] coreclr!CLREventWaitHelper2 + 0x6
[0x3] coreclr!CLREventWaitHelper + 0x20
[0x4] coreclr!CLREventBase::WaitEx + 0x12
[0x5] coreclr!CLREventBase::Wait + 0x12
[0x6] coreclr!Thread::WaitSuspendEventsHelper + 0xa9
[0x7] coreclr!Thread::WaitSuspendEvents + 0x8
[0x8] coreclr!Thread::RareEnablePreemptiveGC + 0x1342ec
[0x9] coreclr!Thread::EnablePreemptiveGC + 0x116
[0xa] coreclr!Thread::RareDisablePreemptiveGC + 0x1c6
[0xb] coreclr!JIT_PInvokeEndRarePath + 0x69
[0xc] System_Private_CoreLib!Interop+Kernel32.ReadFile(System.Runtime.InteropServices.SafeHandle, Byte*, Int32, Int32 ByRef, IntPtr)$##6000044 + 0xbf

Brian (who left the BCL team 8 years ago)

Dotnet-GitSync-Bot · 2020-06-07T23:09:05Z

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

jkotas · 2020-06-08T00:57:44Z

Thank you for capturing the dump!

Thread 93 is doing a GC, but is blocked in SuspendEE trying to get the ThreadStore lock.

Yes, that is the problem.

In a cursory analysis, I was unable to determine which thread is holding the ThreadStore lock.

You should be able to look at the OwningThread field of the CRITICAL_SECTION structure passed into RtlEnterCriticalSection API. It should point to the thread that owns the thread store lock.

Happy to upload it somehow, somewhere. 400 MB compressed.

Yes, that would help too.

brianflex · 2020-06-08T04:10:04Z

Hi Jan, thanks for the confirmation. (And after working on the BCL for 14 years, of course I know to get a dump.) I've looked at this more but I am not happy with what I'm seeing.

The OwningThread parameter is 0x12c8, which is the CLR's debugger thread. The stack trace of that thread is pretty boring and typical.

I did do something weird here. When I first noticed the problem, I attached Visual Studio in managed-only mode to debug the problem. I noticed all the threads blocked on the GC. That's when I detached Visual Studio, then re-attached it in native only mode. So it's vaguely possible that there is a completely unrelated GC problem, I attached a debugger, the debugger thread acquired the threadstore lock, and somehow the deadlock was removed by attaching the managed debugger, then attaching the native debugger saw the process in a similar deadlock but in a different spot. That doesn't seem overly likely, but it may not be incorrect.

Fortunately, I got two dumps! I will start downloading and analyzing that later tonight.
Also, do you have a spot to place 400 MB dump files? GitHub won't let me upload one here, and I can't upload it to my web site. Should I email it to you? Worst case, I can set up some SFTP server...

briangru · 2020-06-08T04:35:29Z

Here's another thread trying to acquire the ThreadStore lock... This is a very weird thread that I don't quite understand - looks like it's trying to run Git... Could the managed debugger have created this thread and/or started running stuff on it? It may not be part of a deadlock, but it's certainly odd to me.

ntdll.dll!NtWaitForAlertByThreadId�() Unknown
ntdll.dll!RtlpWaitOnAddressWithTimeout() Unknown
ntdll.dll!RtlpWaitOnAddress() Unknown
ntdll.dll!RtlpWaitOnCriticalSection() Unknown
ntdll.dll!RtlpEnterCriticalSectionContended�() Unknown
ntdll.dll!RtlEnterCriticalSection�() Unknown
coreclr.dll!CrstBase::Enter() Line 319 C++
[Inline Frame] coreclr.dll!ThreadStore::Enter() Line 5255 C++
coreclr.dll!ThreadSuspend::LockThreadStore(ThreadSuspend::SUSPEND_REASON reason) Line 2361 C++
[Inline Frame] coreclr.dll!ThreadStore::LockThreadStore() Line 5272 C++
[Inline Frame] coreclr.dll!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::Acquire() Line 358 C++
coreclr.dll!ThreadStore::AddThread(Thread * newThread, int bRequiresTSL) Line 5298 C++
coreclr.dll!SetupThread(int) Line 740 C++
[Inline Frame] coreclr.dll!SetupThread() Line 628 C++
coreclr.dll!SetupThreadNoThrow(HRESULT * pHR) Line 562 C++
coreclr.dll!Unknown_QueryInterface(IUnknown * pUnk, const _GUID & riid, void * * ppv) Line 375 C++
[Inline Frame] combase.dll!CGIPTable::GetRequestedInterface(IUnknown * pUnk, const _GUID & pVtableAddress, void *) Line 1564 C++
combase.dll!CGIPTable::GetInterfaceFromGlobal(unsigned long dwCookie, const _GUID & riid, void * * ppv) Line 1613 C++
[Inline Frame] combase.dll!Windows::Internal::GitPtrImplWindows::Internal::GitPtr::Localize(const _GUID &) Line 239 C++
[Inline Frame] combase.dll!Windows::Internal::GitPtrImplWindows::Internal::GitPtr::CopyLocal(Microsoft::WRL::Details::ComPtrRef<Microsoft::WRL::ComPtr<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>>>) Line 167 C++
combase.dll!Windows::Internal::Details::GitInvokeHelper<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs * __ptr64>,Windows::Internal::GitPtr,2>::Invoke(IInspectable * arg1, Windows::Foundation::Diagnostics::ITracingStatusChangedEventArgs * arg2) Line 102 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>::operator()(Microsoft::WRL::ComPtr &) Line 1273 C++
combase.dll!Microsoft::WRL::InvokeTraits<-2>::InvokeDelegates<<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>,Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs * __ptr64>>(Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd> invokeOne, Microsoft::WRL::Details::EventTargetArray * targetArray, Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>> * pEvent) Line 118 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::DoInvoke(Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>) Line 1251 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll(void *) Line 1272 C++
[Inline Frame] combase.dll!Windows::Foundation::Diagnostics::CausalityTraceState::TraceEnabled(Windows::Foundation::Diagnostics::ITracingStatusChangedEventArgs *) Line 705 C++
combase.dll!Windows::Foundation::Diagnostics::CausalityTraceState::ControlCallback(unsigned char enabled, Windows::Foundation::Diagnostics::CausalityTraceLevel enableLevel) Line 760 C++
combase.dll!CausalityControlCallback::InvokeAllCallbacks(unsigned char enabled, Windows::Foundation::Diagnostics::CausalityTraceLevel enableLevel) Line 179 C++
combase.dll!McGenControlCallbackV2(const _GUID * SourceId, unsigned long ControlCode, unsigned char Level, unsigned __int64 MatchAnyKeyword, unsigned __int64 MatchAllKeyword, _EVENT_FILTER_DESCRIPTOR * FilterData, void * CallbackContext) Line 201 C++
ntdll.dll!EtwpEventApiCallback() Unknown
ntdll.dll!EtwpUpdateEnableInfoAndCallback�() Unknown
ntdll.dll!EtwpProcessNotification() Unknown
ntdll.dll!EtwDeliverDataBlock() Unknown
ntdll.dll!EtwpNotificationThread() Unknown
ntdll.dll!TppExecuteWaitCallback() Unknown
ntdll.dll!TppWorkerThread() Unknown
kernel32.dll!BaseThreadInitThunk�() Unknown
ntdll.dll!RtlUserThreadStart�() Unknown

jkotas · 2020-06-08T05:17:43Z

This is a very weird thread that I don't quite understand

This thread is ETW control thread. Looks like something (debugger?) is trying to turn on ETW events for Windows::Foundation::Diagnostics::AsyncCausalityTracer.

ghost · 2020-06-08T16:30:56Z

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

tommcdon · 2020-06-08T19:32:16Z

@davmason do you have time take a look?

davmason · 2020-06-08T20:00:53Z

I can take a look.

@brianflex in the past I've had people upload large files to onedrive/google drive. Would that work for you?

briangru · 2020-06-08T20:29:06Z

Good idea. I collected two dumps.
https://1drv.ms/u/s!Ar0vA-jO4bHnnco3PJvZMS3X7a2yQA?e=EBmwJu

The DataExporterHungWhileAllocating one was the first dump, where I attached VS in managed only mode. I later detached it, and re-attached with both native and managed debugging. Most of my analysis was on the second dump. That sequence of events shouldn't have affected anything, but as my response to Jan indicates above, it looks like in the second dump, the debugger thread held the threadstore lock. Perhaps there's some missing code in the detach code path?

Let me know if you need more help - happy to drive over to Building 25 or find some way to collaborate if needed. Of course, I've been away from your code base for 8 years - at this point maybe assume I can't tell a GC hole from an OBJECTREF that wasn't GC protected.

davmason · 2020-06-09T09:08:03Z

I took a look at both dumps, and I don't have much to add beyond the investigation you've already done. In both a thread is attempting to do a GC and believes the debugger thread owns the ThreadStore lock, but the debugger thread is sitting idle. The rest of the threads are what I would expect from a GC that never happens, they execute until they need more alloc_context or do something else that requires a coop/preemptive transition, then wait for the GC to be done.

How often are you seeing this? Is is a one time thing or reproducible? If it's reproducible I'd really like to see a dump taken of the hung process with "procdump -ma" or with ".dump /ma" in windbg without ever attaching VS. Attaching a managed debugger can change the state of the process quite a bit.

briangru · 2020-06-10T02:07:31Z

I was unable to repro this again today, after several attempts.

davmason · 2020-06-11T07:31:35Z

Thanks for trying again. I would like to get to the bottom of this, but I don't believe there is enough data in the dumps you provided to be actionable. Whatever happened to orphan the ThreadStore lock is no longer present.

If you get that state again, please collect a native dump via procdump or windbg and share it with me.

tommcdon · 2020-10-13T20:12:02Z

Closing issue since the issue is 5 months old and we have not been able to repro / do not have a dump to diagnose the issue.

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jun 7, 2020

brianflex mentioned this issue Jun 7, 2020

Deadlock while trying to Suspend EE & acquire threadstore lock from the GC dotnet/sdk#11936

Closed

jkotas added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jun 8, 2020

jeffschwMSFT added the area-Diagnostics-coreclr label Jun 8, 2020

tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 8, 2020

tommcdon added this to the 5.0 milestone Jun 8, 2020

tommcdon added the p1 label Jun 8, 2020

hoyosjs added the needs more info label Jul 20, 2020

tommcdon modified the milestones: 5.0.0, 6.0.0 Jul 30, 2020

klesta490 mentioned this issue Sep 8, 2020

Application hangs on GC #41958

Closed

tommcdon removed the p1 label Oct 13, 2020

tommcdon closed this as completed Oct 13, 2020

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

brianflex commented Jun 7, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Jun 7, 2020

jkotas commented Jun 8, 2020

brianflex commented Jun 8, 2020

briangru commented Jun 8, 2020

jkotas commented Jun 8, 2020

ghost commented Jun 8, 2020

tommcdon commented Jun 8, 2020

davmason commented Jun 8, 2020

briangru commented Jun 8, 2020

davmason commented Jun 9, 2020

briangru commented Jun 10, 2020

davmason commented Jun 11, 2020

tommcdon commented Oct 13, 2020

Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

Comments

brianflex commented Jun 7, 2020 • edited Loading

Dotnet-GitSync-Bot commented Jun 7, 2020

jkotas commented Jun 8, 2020

brianflex commented Jun 8, 2020

briangru commented Jun 8, 2020

jkotas commented Jun 8, 2020

ghost commented Jun 8, 2020

tommcdon commented Jun 8, 2020

davmason commented Jun 8, 2020

briangru commented Jun 8, 2020

davmason commented Jun 9, 2020

briangru commented Jun 10, 2020

davmason commented Jun 11, 2020

tommcdon commented Oct 13, 2020

brianflex commented Jun 7, 2020 •

edited

Loading