Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock while trying to Suspend EE & acquire threadstore lock while starting a GC #37571

Closed
brianflex opened this issue Jun 7, 2020 · 13 comments
Labels
area-Diagnostics-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@brianflex
Copy link

brianflex commented Jun 7, 2020

I've got a deadlock where the GC is blocked acquiring the threadstore lock, and almost all the other threads are blocked on the GC. My specific instance has 310 active threads though when the process deadlocks.

I'm using .NET Core 3.1. The program reads different JSON files for every day for the past year or so, using a different thread for each one for performance (a poor man's in-proc Big Data solution). The process usually completes in 10 minutes, but it hung for about 9 hours with no obvious forward progress. Attaching VS, most threads are in various helpers like FastAllocateString, all blocked on the GC. This behavior just showed up today, meaning the app used one more thread - maybe I passed over some thread number limit with 310 active threads?

Stack traces from my crash dump (~*k):
DataExporterThreads.txt

Most threads are blocked in WKS::gc_heap::wait_for_gc_done. Thread 93 is doing a GC, but is blocked in SuspendEE trying to get the ThreadStore lock. In a cursory analysis, I was unable to determine which thread is holding the ThreadStore lock. There were several threads in JIT_PInvokeEndRarePath calling Thread::RareDisablePreemptiveGC, which oddly calls EnablePreemptiveGC and then blocks (I never got a good explanation from Vance or Chris Brumme on that one).

I was able to work around this at least once with the following code:
ThreadPool.SetMaxThreads(200, 50);

I'm attaching the stack trace from each thread. I have a dump captured by VS, with the debugger attached as both native & managed. Happy to upload it somehow, somewhere. 400 MB compressed.

Thread 93's portion back in native code:

ntdll!NtWaitForAlertByThreadId+0x14
ntdll!RtlpWaitOnAddressWithTimeout+0x43
ntdll!RtlpWaitOnAddress+0xb2
ntdll!RtlpWaitOnCriticalSection+0xdb
ntdll!RtlpEnterCriticalSectionContended+0xcc
ntdll!RtlEnterCriticalSection+0x40
coreclr!CrstBase::Enter+0x49 [f:\workspace_work\1\s\src\vm\crst.cpp @ 319]
coreclr!ThreadStore::Enter+0x10 [f:\workspace_work\1\s\src\vm\threads.cpp @ 5255]
coreclr!ThreadSuspend::LockThreadStore+0x7d [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 2361]
coreclr!ThreadSuspend::SuspendEE+0xf7 [f:\workspace_work\1\s\src\vm\threadsuspend.cpp @ 6443]
coreclr!GCToEEInterface::SuspendEE+0x21 [f:\workspace_work\1\s\src\vm\gcenv.ee.cpp @ 25]
coreclr!WKS::GCHeap::GarbageCollectGeneration+0xff [f:\workspace_work\1\s\src\gc\gc.cpp @ 36545]
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace_work\1\s\src\gc\gc.cpp @ 13832]
coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace_work\1\s\src\gc\gc.cpp @ 13934]
coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14369]
coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace_work\1\s\src\gc\gc.cpp @ 14400]
coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace_work\1\s\src\gc\gc.cpp @ 35827]
coreclr!Alloc+0x12f [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 240]
coreclr!AllocateString+0x195 [f:\workspace_work\1\s\src\vm\gchelpers.cpp @ 996]
coreclr!FramedAllocateString+0x82 [f:\workspace_work\1\s\src\vm\jithelpers.cpp @ 2884]

A set of threads were blocked on the GC mode around a P/Invoke call to ReadFile.

[0x1] KERNELBASE!WaitForSingleObjectEx + 0x8f
[0x2] coreclr!CLREventWaitHelper2 + 0x6
[0x3] coreclr!CLREventWaitHelper + 0x20
[0x4] coreclr!CLREventBase::WaitEx + 0x12
[0x5] coreclr!CLREventBase::Wait + 0x12
[0x6] coreclr!Thread::WaitSuspendEventsHelper + 0xa9
[0x7] coreclr!Thread::WaitSuspendEvents + 0x8
[0x8] coreclr!Thread::RareEnablePreemptiveGC + 0x1342ec
[0x9] coreclr!Thread::EnablePreemptiveGC + 0x116
[0xa] coreclr!Thread::RareDisablePreemptiveGC + 0x1c6
[0xb] coreclr!JIT_PInvokeEndRarePath + 0x69
[0xc] System_Private_CoreLib!Interop+Kernel32.ReadFile(System.Runtime.InteropServices.SafeHandle, Byte*, Int32, Int32 ByRef, IntPtr)$##6000044 + 0xbf

Brian (who left the BCL team 8 years ago)

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jun 7, 2020
@jkotas jkotas added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jun 8, 2020
@jkotas
Copy link
Member

jkotas commented Jun 8, 2020

Thank you for capturing the dump!

Thread 93 is doing a GC, but is blocked in SuspendEE trying to get the ThreadStore lock.

Yes, that is the problem.

In a cursory analysis, I was unable to determine which thread is holding the ThreadStore lock.

You should be able to look at the OwningThread field of the CRITICAL_SECTION structure passed into RtlEnterCriticalSection API. It should point to the thread that owns the thread store lock.

Happy to upload it somehow, somewhere. 400 MB compressed.

Yes, that would help too.

@brianflex
Copy link
Author

Hi Jan, thanks for the confirmation. (And after working on the BCL for 14 years, of course I know to get a dump.) I've looked at this more but I am not happy with what I'm seeing.

The OwningThread parameter is 0x12c8, which is the CLR's debugger thread. The stack trace of that thread is pretty boring and typical.

I did do something weird here. When I first noticed the problem, I attached Visual Studio in managed-only mode to debug the problem. I noticed all the threads blocked on the GC. That's when I detached Visual Studio, then re-attached it in native only mode. So it's vaguely possible that there is a completely unrelated GC problem, I attached a debugger, the debugger thread acquired the threadstore lock, and somehow the deadlock was removed by attaching the managed debugger, then attaching the native debugger saw the process in a similar deadlock but in a different spot. That doesn't seem overly likely, but it may not be incorrect.

image

image

image

Fortunately, I got two dumps! I will start downloading and analyzing that later tonight.
Also, do you have a spot to place 400 MB dump files? GitHub won't let me upload one here, and I can't upload it to my web site. Should I email it to you? Worst case, I can set up some SFTP server...

@briangru
Copy link

briangru commented Jun 8, 2020

Here's another thread trying to acquire the ThreadStore lock... This is a very weird thread that I don't quite understand - looks like it's trying to run Git... Could the managed debugger have created this thread and/or started running stuff on it? It may not be part of a deadlock, but it's certainly odd to me.

ntdll.dll!NtWaitForAlertByThreadId�() Unknown
ntdll.dll!RtlpWaitOnAddressWithTimeout() Unknown
ntdll.dll!RtlpWaitOnAddress() Unknown
ntdll.dll!RtlpWaitOnCriticalSection() Unknown
ntdll.dll!RtlpEnterCriticalSectionContended�() Unknown
ntdll.dll!RtlEnterCriticalSection�() Unknown
coreclr.dll!CrstBase::Enter() Line 319 C++
[Inline Frame] coreclr.dll!ThreadStore::Enter() Line 5255 C++
coreclr.dll!ThreadSuspend::LockThreadStore(ThreadSuspend::SUSPEND_REASON reason) Line 2361 C++
[Inline Frame] coreclr.dll!ThreadStore::LockThreadStore() Line 5272 C++
[Inline Frame] coreclr.dll!StateHolder<&ThreadStore::LockThreadStore,&ThreadStore::UnlockThreadStore>::Acquire() Line 358 C++
coreclr.dll!ThreadStore::AddThread(Thread * newThread, int bRequiresTSL) Line 5298 C++
coreclr.dll!SetupThread(int) Line 740 C++
[Inline Frame] coreclr.dll!SetupThread() Line 628 C++
coreclr.dll!SetupThreadNoThrow(HRESULT * pHR) Line 562 C++
coreclr.dll!Unknown_QueryInterface(IUnknown * pUnk, const _GUID & riid, void * * ppv) Line 375 C++
[Inline Frame] combase.dll!CGIPTable::GetRequestedInterface(IUnknown * pUnk, const _GUID & pVtableAddress, void *) Line 1564 C++
combase.dll!CGIPTable::GetInterfaceFromGlobal(unsigned long dwCookie, const _GUID & riid, void * * ppv) Line 1613 C++
[Inline Frame] combase.dll!Windows::Internal::GitPtrImplWindows::Internal::GitPtr::Localize(const _GUID &) Line 239 C++
[Inline Frame] combase.dll!Windows::Internal::GitPtrImplWindows::Internal::GitPtr::CopyLocal(Microsoft::WRL::Details::ComPtrRef<Microsoft::WRL::ComPtr<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>>>) Line 167 C++
combase.dll!Windows::Internal::Details::GitInvokeHelper<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs * __ptr64>,Windows::Internal::GitPtr,2>::Invoke(IInspectable * arg1, Windows::Foundation::Diagnostics::ITracingStatusChangedEventArgs * arg2) Line 102 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>::operator()(Microsoft::WRL::ComPtr &) Line 1273 C++
combase.dll!Microsoft::WRL::InvokeTraits<-2>::InvokeDelegates<<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>,Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs * __ptr64>>(Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd> invokeOne, Microsoft::WRL::Details::EventTargetArray * targetArray, Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>> * pEvent) Line 118 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::DoInvoke(Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll::__l2::<lambda_4e24764d0eabc73f3d0d5e4e94f813bd>) Line 1251 C++
[Inline Frame] combase.dll!Microsoft::WRL::EventSource<Windows::Foundation::IEventHandler<Windows::Foundation::Diagnostics::TracingStatusChangedEventArgs *>,Microsoft::WRL::InvokeModeOptions<-2>>::InvokeAll(void *) Line 1272 C++
[Inline Frame] combase.dll!Windows::Foundation::Diagnostics::CausalityTraceState::TraceEnabled(Windows::Foundation::Diagnostics::ITracingStatusChangedEventArgs *) Line 705 C++
combase.dll!Windows::Foundation::Diagnostics::CausalityTraceState::ControlCallback(unsigned char enabled, Windows::Foundation::Diagnostics::CausalityTraceLevel enableLevel) Line 760 C++
combase.dll!CausalityControlCallback::InvokeAllCallbacks(unsigned char enabled, Windows::Foundation::Diagnostics::CausalityTraceLevel enableLevel) Line 179 C++
combase.dll!McGenControlCallbackV2(const _GUID * SourceId, unsigned long ControlCode, unsigned char Level, unsigned __int64 MatchAnyKeyword, unsigned __int64 MatchAllKeyword, _EVENT_FILTER_DESCRIPTOR * FilterData, void * CallbackContext) Line 201 C++
ntdll.dll!EtwpEventApiCallback() Unknown
ntdll.dll!EtwpUpdateEnableInfoAndCallback�() Unknown
ntdll.dll!EtwpProcessNotification() Unknown
ntdll.dll!EtwDeliverDataBlock() Unknown
ntdll.dll!EtwpNotificationThread() Unknown
ntdll.dll!TppExecuteWaitCallback() Unknown
ntdll.dll!TppWorkerThread() Unknown
kernel32.dll!BaseThreadInitThunk�() Unknown
ntdll.dll!RtlUserThreadStart�() Unknown

@jkotas
Copy link
Member

jkotas commented Jun 8, 2020

This is a very weird thread that I don't quite understand

This thread is ETW control thread. Looks like something (debugger?) is trying to turn on ETW events for Windows::Foundation::Diagnostics::AsyncCausalityTracer.

@ghost
Copy link

ghost commented Jun 8, 2020

Tagging subscribers to this area: @tommcdon
Notify danmosemsft if you want to be subscribed.

@tommcdon
Copy link
Member

tommcdon commented Jun 8, 2020

@davmason do you have time take a look?

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Jun 8, 2020
@tommcdon tommcdon added this to the 5.0 milestone Jun 8, 2020
@tommcdon tommcdon added the p1 label Jun 8, 2020
@davmason
Copy link
Member

davmason commented Jun 8, 2020

I can take a look.

@brianflex in the past I've had people upload large files to onedrive/google drive. Would that work for you?

@briangru
Copy link

briangru commented Jun 8, 2020

Good idea. I collected two dumps.
https://1drv.ms/u/s!Ar0vA-jO4bHnnco3PJvZMS3X7a2yQA?e=EBmwJu

The DataExporterHungWhileAllocating one was the first dump, where I attached VS in managed only mode. I later detached it, and re-attached with both native and managed debugging. Most of my analysis was on the second dump. That sequence of events shouldn't have affected anything, but as my response to Jan indicates above, it looks like in the second dump, the debugger thread held the threadstore lock. Perhaps there's some missing code in the detach code path?

Let me know if you need more help - happy to drive over to Building 25 or find some way to collaborate if needed. Of course, I've been away from your code base for 8 years - at this point maybe assume I can't tell a GC hole from an OBJECTREF that wasn't GC protected.

@davmason
Copy link
Member

davmason commented Jun 9, 2020

I took a look at both dumps, and I don't have much to add beyond the investigation you've already done. In both a thread is attempting to do a GC and believes the debugger thread owns the ThreadStore lock, but the debugger thread is sitting idle. The rest of the threads are what I would expect from a GC that never happens, they execute until they need more alloc_context or do something else that requires a coop/preemptive transition, then wait for the GC to be done.

How often are you seeing this? Is is a one time thing or reproducible? If it's reproducible I'd really like to see a dump taken of the hung process with "procdump -ma" or with ".dump /ma" in windbg without ever attaching VS. Attaching a managed debugger can change the state of the process quite a bit.

@briangru
Copy link

I was unable to repro this again today, after several attempts.

@davmason
Copy link
Member

Thanks for trying again. I would like to get to the bottom of this, but I don't believe there is enough data in the dumps you provided to be actionable. Whatever happened to orphan the ThreadStore lock is no longer present.

If you get that state again, please collect a native dump via procdump or windbg and share it with me.

@tommcdon
Copy link
Member

Closing issue since the issue is 5 months old and we have not been able to repro / do not have a dump to diagnose the issue.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Diagnostics-coreclr tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

8 participants