Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access Violation on x86 #12113

Closed
ayende opened this issue Feb 24, 2019 · 91 comments · Fixed by dotnet/coreclr#23044
Closed

Access Violation on x86 #12113

ayende opened this issue Feb 24, 2019 · 91 comments · Fixed by dotnet/coreclr#23044
Assignees
Labels
arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI area-ExceptionHandling-coreclr bug os-windows
Milestone

Comments

@ayende
Copy link
Contributor

ayende commented Feb 24, 2019

We have a scenario in which we are getting an access violation exception during a particular load in our system.
I have a full dump of the process when the crash happened, available here:
https://drive.google.com/file/d/11oqZaegxKcoNT8Xj1u9YDIcH7LcmMBsW/view?usp=sharing

The scenario we have is a few servers running and communicating with one another on the same process.
This is part of our test setup. We recently started seeing hard failures, such as this one: (d2c.3390): Access violation - code c0000005 (first chance)

The event log reports:

Application: dotnet.exe
CoreCLR Version: 4.6.27317.3
Description: The process was terminated due to an internal error in the .NET Runtime at IP 5A29AA9B (5A230000) with exit code 80131506.


Faulting application name: dotnet.exe, version: 2.1.27130.1, time stamp: 0x5c007ea0
Faulting module name: coreclr.dll, version: 4.6.27317.3, time stamp: 0x5c40c18e
Exception code: 0xc0000005
Fault offset: 0x0006aa9b
Faulting process id: 0x52d4
Faulting application start time: 0x01d4cc3f7d83fa59
Faulting application path: C:\Program Files (x86)\dotnet\dotnet.exe
Faulting module path: C:\Users\ayende\.nuget\packages\runtime.win-x86.microsoft.netcore.app\2.1.8\runtimes\win-x86\native\coreclr.dll
Report Id: ff99edfe-baf1-431a-9ce3-06e987219e6c
Faulting package full name: 
Faulting package-relative application ID: 

This machine has the following hot fixes applies:

PS C:\Windows\SysWOW64> Get-HotFix

Source        Description      HotFixID      InstalledBy          InstalledOn
------        -----------      --------      -----------          -----------
OREN-PC       Update           KB4100347     NT AUTHORITY\SYSTEM  2/18/2019 12:00:00 AM
OREN-PC       Update           KB4343669     NT AUTHORITY\SYSTEM  7/11/2018 12:00:00 AM
OREN-PC       Update           KB4456655     NT AUTHORITY\SYSTEM  9/13/2018 12:00:00 AM
OREN-PC       Security Update  KB4465663     NT AUTHORITY\SYSTEM  11/14/2018 12:00:00 AM
OREN-PC       Security Update  KB4471331     NT AUTHORITY\SYSTEM  12/6/2018 12:00:00 AM
OREN-PC       Security Update  KB4477137     NT AUTHORITY\SYSTEM  12/12/2018 12:00:00 AM
OREN-PC       Security Update  KB4480979     NT AUTHORITY\SYSTEM  1/9/2019 12:00:00 AM
OREN-PC       Security Update  KB4485449     NT AUTHORITY\SYSTEM  2/19/2019 12:00:00 AM
OREN-PC       Security Update  KB4487038     NT AUTHORITY\SYSTEM  2/19/2019 12:00:00 AM
OREN-PC       Security Update  KB4480966     HRHINOS\Ayende       2/21/2019 12:00:00 AM
OREN-PC       Security Update  KB4487017

The actual stack we are seeing is always something similar to:

0:043> kp
 # ChildEBP RetAddr  
00 (Inline) -------- coreclr!VolatileLoad+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 153] 
01 (Inline) -------- coreclr!Volatile<unsigned long>::Load+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 292] 
02 (Inline) -------- coreclr!Volatile<unsigned long>::operator unsigned long+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 346] 
03 (Inline) -------- coreclr!RelativePointer<Module *>::GetValue+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 68] 
04 (Inline) -------- coreclr!RelativePointer<Module *>::GetValueAtPtr+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 85] 
05 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
06 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
07 (Inline) -------- coreclr!MethodTable::GetLoaderModule+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 176] 
08 (Inline) -------- coreclr!MethodTable::GetLoaderAllocator+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 182] 
09 0e60d0a0 5a299c42 coreclr!VirtualCallStubManager::ResolveWorker(struct StubCallSite * pCallSite = 0x0e60d13c, class Object ** protectedObj = 0x0e60d180, struct DispatchToken token = struct DispatchToken, VirtualCallStubManager::StubKind stubKind = SK_DISPATCH (0n2))+0x6b [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1719] 
0a 0e60d168 5a34a26b coreclr!VSD_ResolveWorker(struct TransitionBlock * pTransitionBlock = 0x0e60d17c, unsigned long siteAddrForRegisterIndirect = 0, unsigned int token = 0xa0000)+0x24f [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1611] 
0b 0e60d190 0e995fca coreclr!ResolveWorkerAsmStub(void)+0x1b [e:\a\_work\335\s\src\vm\i386\virtualcallstubcpu.hpp @ 525] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
0c 0e60d194 5a33afeb 0xe995fca
0d 0e60d1a8 5a47fc7f coreclr!CallJitEHFinallyHelper(void)+0x28 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 390] 
0e 0e60d200 5a3a2acc coreclr!CallJitEHFinally(class CrawlFrame * pCf = 0x43a07fdc, unsigned char * startPC = <Value unavailable error>, struct EE_ILEXCEPTION_CLAUSE * EHClausePtr = 0x0e60d260, unsigned long nestingLevel = 0)+0xb8 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 3385] 
0f 0e60d2d0 5a249cf5 coreclr!COMPlusUnwindCallback+0x15a70c [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 2996] 
10 (Inline) -------- coreclr!Thread::MakeStackwalkerCallback+0x151 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 877] 
11 0e60d59c 5a2525c1 coreclr!Thread::StackWalkFramesEx(struct REGDISPLAY * pRD = 0x0e60d5c8, <function> * pCallback = 0x0ddb401c, void * pData = 0x0e60d954, unsigned int flags = 4, class Frame * pStartFrame = 0x00000000)+0x1d4 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 958] 
12 0e60d8d0 5a251f60 coreclr!Thread::StackWalkFrames(<function> * pCallback = 0x5a2483c0, void * pData = 0x0e60d954, unsigned int flags = 4, class Frame * pStartFrame = 0x00000000)+0xa1 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 1042] 
13 0e60d8f0 5a252a52 coreclr!UnwindFrames(class Thread * pThread = 0x0ddb3ea8, struct ThrowCallbackType * tct = 0x0e60d954)+0x3e [e:\a\_work\335\s\src\vm\excep.cpp @ 2228] 
14 (Inline) -------- coreclr!COMPlusAfterUnwind+0x98 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 482] 
15 0e60db48 5a2512fa coreclr!CPFH_RealFirstPassHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = <Value unavailable error>, struct _CONTEXT * pContext = 0x0e60dd00, int bAsynchronousThreadStop = 0n0)+0x459 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1263] 
16 0e60db88 5a250ef3 coreclr!CPFH_FirstPassHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = 0x0e60e650, struct _CONTEXT * pContext = 0x0e60dd00)+0xc3 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1401] 
17 0e60dbac 7772f1a2 coreclr!COMPlusFrameHandler(struct _EXCEPTION_RECORD * pExceptionRecord = 0x0e60dcb0, struct _EXCEPTION_REGISTRATION_RECORD * pEstablisherFrame = 0x0e60e650, struct _CONTEXT * pContext = 0x0e60dd00, struct _DISPATCHER_CONTEXT * pDispatcherContext = 0x0e60dc38)+0x83 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1821] 
18 0e60dbd0 7772f174 ntdll!ExecuteHandler2+0x26
19 0e60dc98 7771cd86 ntdll!ExecuteHandler+0x24
1a 0e60dc98 76f21812 ntdll!KiUserExceptionDispatcher+0x26
1b 0e60e1bc 5a252d94 KERNELBASE!RaiseException+0x62
1c 0e60e264 5a3248fb coreclr!RaiseTheExceptionInternalOnly(class Object * throwable = <Value unavailable error>, int rethrow = <Value unavailable error>, int fForStackOverflow = 0n0)+0x11d [e:\a\_work\335\s\src\vm\excep.cpp @ 3039] 
1d 0e60e32c 2ed2b3e4 coreclr!IL_Throw(class Object * obj = <Value unavailable error>)+0x11b [e:\a\_work\335\s\src\vm\jithelpers.cpp @ 4860] 
1e 0e60e33c 0e3bd604 0x2ed2b3e4
1f 0e60e354 0e99a747 0xe3bd604
20 0e60e384 0e999f92 0xe99a747
21 0e60e3b0 0eecd3c9 0xe999f92
22 0e60e400 0e995ad1 0xeecd3c9
23 0e60e578 0e9934e6 0xe995ad1
24 0e60e5b0 0c3c7770 0xe9934e6
25 0e60e5d8 593868dd 0xc3c7770
26 0e60e5f4 59cfa31d System_Threading_Thread+0x68dd
27 0e60e624 59cfc0cc System_Private_CoreLib+0x4ca31d
28 0e60e638 5a33b0ef System_Private_CoreLib+0x4cc0cc
29 0e60e644 5a26fbf1 coreclr!CallDescrWorkerInternal(unsigned long pParams = 0xe60e6b8)+0x34 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 618] 
2a (Inline) -------- coreclr!CallDescrWorkerWithHandler+0x52 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 78] 
2b 0e60e6e4 5a32d4e4 coreclr!MethodDescCallSite::CallTargetWorker(unsigned int64 * pArguments = 0x0e60e730, unsigned int64 * pReturnValue = 0x00000000, int cbReturnValue = 0n0)+0x235 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 620] 
2c 0e60e7bc 5a342b06 coreclr!ThreadNative::KickOffThread_Worker(void * ptr = 0x0e60e948)+0x104 [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 260] 
2d 0e60e7d4 5a26f86a coreclr!ManagedThreadBase_DispatchInner(struct ManagedThreadCallState * pCallState = <Value unavailable error>)+0x70 [e:\a\_work\335\s\src\vm\threads.cpp @ 8852] 
2e 0e60e880 5a26f7bb coreclr!ManagedThreadBase_DispatchMiddle(struct ManagedThreadCallState * pCallState = <Value unavailable error>)+0x65 [e:\a\_work\335\s\src\vm\threads.cpp @ 8902] 
2f 0e60e8e4 5a3352b9 coreclr!ManagedThreadBase_DispatchOuter(struct ManagedThreadCallState * pCallState = 0x0e60e8ec)+0x78 [e:\a\_work\335\s\src\vm\threads.cpp @ 9161] 
30 0e60e908 5a2d9e7c coreclr!ManagedThreadBase_FullTransitionWithAD(struct ADID pAppDomain = struct ADID, <function> * pTarget = <Value unavailable error>, void * args = <Value unavailable error>, UnhandledExceptionLocation filterType = ManagedThread (0n2))+0x2f [e:\a\_work\335\s\src\vm\threads.cpp @ 9200] 
31 (Inline) -------- coreclr!ManagedThreadBase::KickOff+0x15 [e:\a\_work\335\s\src\vm\threads.cpp @ 9234] 
32 0e60e984 5a2d9d90 coreclr!ThreadNative::KickOffThread(void * pass = 0x0ad364d0)+0xcc [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 380] 
33 0e60f824 76388484 coreclr!Thread::intermediateThreadProc(void * arg = 0x0abac020)+0x50 [e:\a\_work\335\s\src\vm\threads.cpp @ 2255] 
34 0e60f838 77713ab8 KERNEL32!BaseThreadInitThunk+0x24
35 0e60f880 77713a88 ntdll!__RtlUserThreadStart+0x2f
36 0e60f890 00000000 ntdll!_RtlUserThreadStart+0x1b

The managed stack, FWIW, is:

0:043> !clrstack
OS Thread Id: 0x3390 (43)
Child SP       IP Call Site
0e60d0d8 5a29aa9b [GCFrame: 0e60d0d8] 
0e60d118 5a29aa9b [StubDispatchFrame: 0e60d118] System.IDisposable.Dispose()
0e60d198 0e995fca Raven.Server.Rachis.FollowerAmbassador.Run()
0e60e580 0e9934e6 Raven.Server.Rachis.FollowerAmbassador.b__58_0(System.Object)
0e60e584 0c3c8000 Raven.Server.Utils.PoolOfThreads+PooledThread.DoWork()
0e60e5b8 0c3c7770 Raven.Server.Utils.PoolOfThreads+PooledThread.Run()
0e60e5e0 593868dd System.Threading.Thread.ThreadMain_ThreadStart() [E:\A\_work\321\s\corefx\src\System.Threading.Thread\src\System\Threading\Thread.cs @ 93]
0e60e5e8 59cfc00c System.Threading.ThreadHelper.ThreadStart_Context(System.Object) [E:\A\_work\335\s\src\mscorlib\src\System\Threading\Thread.cs @ 62]
0e60e5fc 59cfa31d System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [E:\A\_work\335\s\src\mscorlib\shared\System\Threading\ExecutionContext.cs @ 167]
0e60e630 59cfc0cc System.Threading.ThreadHelper.ThreadStart() [E:\A\_work\335\s\src\mscorlib\src\System\Threading\Thread.cs @ 91]
0e60e718 5a33b0ef [GCFrame: 0e60e718] 
0e60e8a8 5a33b0ef [DebuggerU2MCatchHandlerFrame: 0e60e8a8] 

We are using unsafe code, but we are pretty sure that we aren't corrupting the heap in any manner (lots of tests cover that) and if we were, I would expect to see the failure in different locations.

From trying to figure out what is going on, a few really strange things seem to be happening here:

Here is the actual failure:

FAULTING_IP: 
KERNELBASE!RaiseException+62
76f21812 8b4c2454        mov     ecx,dword ptr [esp+54h]

And the full register usage is:

(d2c.3390): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=00e527a0 ecx=00000010 edx=15554140 esi=5a339280 edi=00e527a0
eip=5a29aa9b esp=0e60ce44 ebp=0e60d0a0 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246

As you can see, the esp value has a non null value, but checking the memory location with the offset provided to the instruction shows just zeros.

While troubleshooting this, we found a NullReferenceException in our code. We fixed it, but that made the problem go away.
We suspect that this is some issue related to error handling inside the CoreCLR during JIT generation.
We have run into a different issue with KB4487017 (See: https://github.com/dotnet/coreclr/issues/22597), but we are reproducing this on different versions of Windows and without the KB in question.

We aren't able to reproduce this issue in 64 bits.

@ayende
Copy link
Contributor Author

ayende commented Feb 25, 2019

We are also seeing something similar with this stack trace:

15ddd310 503a91a4 15ddd3ac 15ddd3f0 000a0000 coreclr!coreclr_shutdown_2+0x347bb
15ddd3d8 5043057b 15ddd3ec 00000000 000a0000 coreclr!coreclr_shutdown_2+0x33924
15ddd400 0fac80c5 504207bb 03469698 503cee20 coreclr!MetaDataGetDispenser+0x4753b
15dde8f8 0fac49e6 0d58d430 00000000 05715594 0xfac80c5
15dde930 0d58ceb0 05715594 00000000 00000000 0xfac49e6
15dde958 544e68dd 0efc6344 30aadc5c 0efc6344 0xd58ceb0
15dde974 30aabf6d 00000000 00000000 05715640 System_Threading_Thread!System.Threading.Thread.ThreadMain_ThreadStart()$##600001C+0x35
15dde9a4 30aadd1c 05715668 15dde9d0 12781560 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##600288D+0x65
15dde9b8 504208bf 15ddeac8 15ddea64 5033a6d0 System_Private_CoreLib!System.Threading.ThreadHelper.ThreadStart()$##6002636+0x2c
15dde9c4 5033a6d0 15ddea38 15ddebf4 50387b70 coreclr!MetaDataGetDispenser+0x3787f
15ddea64 504cbcd2 15ddeab0 00000000 00000000 coreclr+0x2a6d0
15ddeb3c 5042826d 15ddecc8 104f5b50 15ddec6c coreclr!MetaDataGetDispenser+0xe2c92
15ddeb54 5033a3a5 f762de38 00000001 15ddec6c coreclr!MetaDataGetDispenser+0x3f22d
15ddec00 5033a2f6 f762de5c 00000001 12781560 coreclr+0x2a3a5
15ddec64 5040b09b 00000001 00000000 00000001 coreclr+0x2a2f6
15ddec88 50375c0c 00000001 00000002 f762df3c coreclr!MetaDataGetDispenser+0x2205b
15dded04 50375b20 127fab48 00000000 00000000 coreclr!coreclr_shutdown_2+0x38c
15ddfc9c 74ee8484 1273fc78 74ee8460 42821e2e coreclr!coreclr_shutdown_2+0x2a0
15ddfcb0 77ec3ab8 1273fc78 2d5961f5 00000000 KERNEL32!BaseThreadInitThunk+0x24
15ddfcf8 77ec3a88 ffffffff 77edf31d 00000000 ntdll!__RtlUserThreadStart+0x2f
15ddfd08 00000000 50375ad0 1273fc78 00000000 ntdll!_RtlUserThreadStart+0x1b

@filipnavara
Copy link
Member

filipnavara commented Feb 25, 2019

I see quite a different managed stack trace in the dump you provided:

00b7e400 7771ae8c [HelperMethodFrame_1OBJ: 00b7e400] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00b7e480 59db9604 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00b7e4d4 59d9062a System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)
00b7e510 59d90500 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken)
00b7e53c 59dc76d9 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)
00b7e548 07011159 Tryouts.Program.(System.String[])

The unmanaged one is also quite different too and ends in:

 # ChildEBP RetAddr  Args to Child              
00 00b7e03c 76f15983 00000001 00e7ec30 00000001 ntdll!NtWaitForMultipleObjects+0xc
01 00b7e1d0 5a254f2f 00000001 00e7ec30 00000000 KERNELBASE!WaitForMultipleObjectsEx+0x133
02 (Inline) -------- -------- -------- -------- coreclr!WaitForMultipleObjectsEx_SO_TOLERANT+0x13 [e:\a\_work\335\s\src\vm\threads.cpp @ 3462] 
03 00b7e1f0 5a254ca8 00000001 00e7ec30 00000000 coreclr!Thread::DoAppropriateAptStateWait+0x39 [e:\a\_work\335\s\src\vm\threads.cpp @ 3496] 
04 00b7e280 5a254a07 00000001 00e7ec30 00000000 coreclr!Thread::DoAppropriateWaitWorker+0x248 [e:\a\_work\335\s\src\vm\threads.cpp @ 3633] 
05 00b7e300 5a25497c 00000001 00e7ec30 00000000 coreclr!Thread::DoAppropriateWait+0x79 [e:\a\_work\335\s\src\vm\threads.cpp @ 3307] 
06 00b7e324 5a2d8efb ffffffff 00000001 00b7e3a8 coreclr!CLREventBase::WaitEx+0x39 [e:\a\_work\335\s\src\vm\synch.cpp @ 486] 
07 00b7e33c 5a254226 ffffffff 00000001 00b7e3a8 coreclr!CLREventBase::Wait+0x1a [e:\a\_work\335\s\src\vm\synch.cpp @ 427] 
08 (Inline) -------- -------- -------- -------- coreclr!Thread::Wait+0x19 [e:\a\_work\335\s\src\vm\threads.cpp @ 4113] 
09 (Inline) -------- -------- -------- -------- coreclr!Thread::Block+0x22 [e:\a\_work\335\s\src\vm\threads.cpp @ 4070] 
0a 00b7e3d8 5a254079 ffffffff 00e7eb18 2c559a49 coreclr!SyncBlock::Wait+0x173 [e:\a\_work\335\s\src\vm\syncblk.cpp @ 3376] 
0b (Inline) -------- -------- -------- -------- coreclr!ObjHeader::Wait+0x23 [e:\a\_work\335\s\src\vm\syncblk.cpp @ 2789] 
0c (Inline) -------- -------- -------- -------- coreclr!Object::Wait+0x23 [e:\a\_work\335\s\src\vm\object.h @ 391] 
0d 00b7e474 59db9604 0f28f38c 051867d0 051867bc coreclr!ObjectNative::WaitTimeout+0x89 [e:\a\_work\335\s\src\classlibnative\bcltype\objectnative.cpp @ 297] 
0e 00b7e4c8 59d9062a 00000000 0f28f378 051867bc System_Private_CoreLib+0x589604
0f 00b7e504 59d90500 00000000 00b7e5c0 04e38b60 System_Private_CoreLib+0x56062a
10 00b7e530 59dc76d9 00000000 00b7e560 00b7e548 System_Private_CoreLib+0x560500
11 00b7e540 07011159 00b7e554 5a33b0ef 00b7e610 System_Private_CoreLib+0x5976d9

@ayende
Copy link
Contributor Author

ayende commented Feb 25, 2019

Check thread 35, which is what WinDBG stopped on when I was doing live debugging for this.

0:000> ~35s
eax=00000000 ebx=00e527a0 ecx=00000010 edx=15554140 esi=5a339280 edi=00e527a0
eip=5a29aa9b esp=0e60ce44 ebp=0e60d0a0 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
coreclr!VolatileLoad+0x3 [inlined in coreclr!VirtualCallStubManager::ResolveWorker+0x6b]:
5a29aa9b 8b4114          mov     eax,dword ptr [ecx+14h] ds:002b:00000024=????????
0:035> k
 # ChildEBP RetAddr  
00 (Inline) -------- coreclr!VolatileLoad+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 153] 
01 (Inline) -------- coreclr!Volatile<unsigned long>::Load+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 292] 
02 (Inline) -------- coreclr!Volatile<unsigned long>::operator unsigned long+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 346] 
03 (Inline) -------- coreclr!RelativePointer<Module *>::GetValue+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 68] 
04 (Inline) -------- coreclr!RelativePointer<Module *>::GetValueAtPtr+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 85] 
05 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
06 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
07 (Inline) -------- coreclr!MethodTable::GetLoaderModule+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 176] 
08 (Inline) -------- coreclr!MethodTable::GetLoaderAllocator+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 182] 
09 0e60d0a0 5a299c42 coreclr!VirtualCallStubManager::ResolveWorker+0x6b [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1719] 
0a 0e60d168 5a34a26b coreclr!VSD_ResolveWorker+0x24f [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1611] 
0b 0e60d190 0e995fca coreclr!ResolveWorkerAsmStub+0x1b [e:\a\_work\335\s\src\vm\i386\virtualcallstubcpu.hpp @ 525] 
WARNING: Frame IP not in any known module. Following frames may be wrong.
0c 0e60d194 5a33afeb 0xe995fca
0d 0e60d1a8 5a47fc7f coreclr!CallJitEHFinallyHelper+0x28 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 390] 
0e 0e60d200 5a3a2acc coreclr!CallJitEHFinally+0xb8 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 3385] 
0f 0e60d2d0 5a249cf5 coreclr!COMPlusUnwindCallback+0x15a70c [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 2996] 
10 (Inline) -------- coreclr!Thread::MakeStackwalkerCallback+0x151 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 877] 
11 0e60d59c 5a2525c1 coreclr!Thread::StackWalkFramesEx+0x1d4 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 958] 
12 0e60d8d0 5a251f60 coreclr!Thread::StackWalkFrames+0xa1 [e:\a\_work\335\s\src\vm\stackwalk.cpp @ 1042] 
13 0e60d8f0 5a252a52 coreclr!UnwindFrames+0x3e [e:\a\_work\335\s\src\vm\excep.cpp @ 2228] 
14 (Inline) -------- coreclr!COMPlusAfterUnwind+0x98 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 482] 
15 0e60db48 5a2512fa coreclr!CPFH_RealFirstPassHandler+0x459 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1263] 
16 0e60db88 5a250ef3 coreclr!CPFH_FirstPassHandler+0xc3 [e:\a\_work\335\s\src\vm\i386\excepx86.cpp @ 1401] 

@filipnavara
Copy link
Member

Weird, usually the debugger switches me to correct thread after !analyze -v. Oh well, at least I see the same thing now.

@benaadams
Copy link
Member

/cc @kouvel @janvorli

@janvorli
Copy link
Member

I am taking a look.

@janvorli janvorli self-assigned this Feb 25, 2019
@janvorli
Copy link
Member

@ayende I assume you have used 64 bit windbg to open the dump, right? It you open it with the wow64 version of windbg, then the call stack shown is quite different. The frame with unknown IP address is gone and there is no EH. While it is strange that the call stack differs this way, the one showed with Wow64 (which is the right one to use for debugging 32 bit code) matches your managed stack dump, as you can see below.
Looking at locals in VSD_ResolveWorker, I can see that the ECX in the TransitionBlock, which should be the "this" reference, points to a garbage. The first 4 bytes should be a MethodTable pointer, but it contains 0x10. And that is causing the crash, as we try to extract LoaderAllocator from that MethodTable.
The caller of VSD_ResolveWorker, the ResolveWorkerAsmStub just stores the ECX passed in by the managed caller into the TransitionBlock.
So it looks like a GC hole to me. I am looking into it further.

 # ChildEBP RetAddr  
00 (Inline) -------- coreclr!VolatileLoad+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 153] 
01 (Inline) -------- coreclr!Volatile<unsigned long>::Load+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 292] 
02 (Inline) -------- coreclr!Volatile<unsigned long>::operator unsigned long+0x3 [e:\a\_work\335\s\src\inc\volatile.h @ 346] 
03 (Inline) -------- coreclr!RelativePointer<Module *>::GetValue+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 68] 
04 (Inline) -------- coreclr!RelativePointer<Module *>::GetValueAtPtr+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 85] 
05 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
06 (Inline) -------- coreclr!ReadPointer+0x3 [e:\a\_work\335\s\src\inc\fixuppointer.h @ 954] 
07 (Inline) -------- coreclr!MethodTable::GetLoaderModule+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 176] 
08 (Inline) -------- coreclr!MethodTable::GetLoaderAllocator+0x3 [e:\a\_work\335\s\src\vm\methodtable.inl @ 182] 
09 0e60d0a0 5a299c42 coreclr!VirtualCallStubManager::ResolveWorker+0x6b [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1719] 
0a 0e60d168 5a34a26b coreclr!VSD_ResolveWorker+0x24f [e:\a\_work\335\s\src\vm\virtualcallstub.cpp @ 1611] 
0b 0e60d190 0e995fca coreclr!ResolveWorkerAsmStub+0x1b [e:\a\_work\335\s\src\vm\i386\virtualcallstubcpu.hpp @ 525] 
0c 0e60d190 0e9934e6 Raven_Server!Raven.Server.Rachis.FollowerAmbassador.Run()+0x11da
0d 0e60e5b0 0c3c8000 Raven_Server!Raven.Server.Rachis.FollowerAmbassador.<Start>b__58_0(System.Object)+0x6
0e 0e60e5b0 0c3c7770 Raven_Server!Raven.Server.Utils.PoolOfThreads+PooledThread.DoWork()+0xb8
0f 0e60e5d8 593868dd Raven_Server!Raven.Server.Utils.PoolOfThreads+PooledThread.Run()+0xc8
10 0e60e5f4 59cfc00c System_Threading_Thread!System.Threading.Thread.ThreadMain_ThreadStart()+0x35
11 0e60e5f4 59cfa31d System_Private_CoreLib!System.Threading.ThreadHelper.ThreadStart_Context(System.Object)+0x5c
12 0e60e624 59cfc0cc System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x65
13 0e60e638 5a33b0ef System_Private_CoreLib!System.Threading.ThreadHelper.ThreadStart()+0x2c
14 0e60e644 5a26fbf1 coreclr!CallDescrWorkerInternal+0x34 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 618] 
15 (Inline) -------- coreclr!CallDescrWorkerWithHandler+0x52 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 78] 
16 0e60e6e4 5a32d4e4 coreclr!MethodDescCallSite::CallTargetWorker+0x235 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 620] 
17 0e60e7bc 5a342b06 coreclr!ThreadNative::KickOffThread_Worker+0x104 [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 260] 
18 0e60e7d4 5a26f86a coreclr!ManagedThreadBase_DispatchInner+0x70 [e:\a\_work\335\s\src\vm\threads.cpp @ 8852] 
19 0e60e880 5a26f7bb coreclr!ManagedThreadBase_DispatchMiddle+0x65 [e:\a\_work\335\s\src\vm\threads.cpp @ 8902] 
1a 0e60e8e4 5a3352b9 coreclr!ManagedThreadBase_DispatchOuter+0x78 [e:\a\_work\335\s\src\vm\threads.cpp @ 9161] 
1b 0e60e908 5a2d9e7c coreclr!ManagedThreadBase_FullTransitionWithAD+0x2f [e:\a\_work\335\s\src\vm\threads.cpp @ 9200] 
1c (Inline) -------- coreclr!ManagedThreadBase::KickOff+0x15 [e:\a\_work\335\s\src\vm\threads.cpp @ 9234] 
1d 0e60e984 5a2d9d90 coreclr!ThreadNative::KickOffThread+0xcc [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 380] 
1e 0e60f824 76388484 coreclr!Thread::intermediateThreadProc+0x50 [e:\a\_work\335\s\src\vm\threads.cpp @ 2255] 
1f 0e60f838 77713ab8 kernel32!BaseThreadInitThunk+0x24 [base\win32\client\thread.c @ 64] 
20 0e60f880 77713a88 ntdll!__RtlUserThreadStart+0x2f [minkernel\ntdll\rtlstrt.c @ 998] 
21 0e60f890 00000000 ntdll!_RtlUserThreadStart+0x1b [minkernel\ntdll\rtlstrt.c @ 915]

@ayende
Copy link
Contributor Author

ayende commented Feb 25, 2019

If this helps, it is fairly easy to reproduce.
The code is here: https://github.com/ravendb/ravendb with commit hash 6aa5833c5c6470f714d5432b10299a02d8d11639.

You can go to tests/Tryout folder and just execute dotnet run in x86 and it will fail in the first 100 iterations or so in most cases

@janvorli
Copy link
Member

That's great, I was just about to ask if the code is open source.

@janvorli
Copy link
Member

I've collected the following additional details:

The call at offset 0x11D4 - address 0e995fc4 (return address at offset 0x11DA - 0e995fca) is passed an object reference in ECX which is extracted from [ebp-134h] right before the call. The call site is in a catch handler and the [ebp-134h] is initialized before the respective try.

@janvorli
Copy link
Member

@ayende I've tried to repro the issue, but the commit doesn't exist (I've found another one that was result of merging in the other commit) and so that's what I've checked out: ravendb/ravendb#8784.

The dotnet run fails to build with:

obj\Debug\netcoreapp2.1\Raven.Server.AssemblyInfo.cs(10,47): error CS0234: The type or namespace name 'UserSecrets' does not exist in the namespace 'Microsoft.Extensions.Configuration' (are you missing an assembly reference?) [D:\git\ravendb\src\Raven.Server\Raven.Server.csproj]

The build failed. Please fix the build errors and run again.

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

Can you try f04980d83c3cf83e5b8ac4306ea1c15765df0237 ?
The command I run is: & 'C:\Program Files (x86)\dotnet\dotnet.exe' run inside the test\Tryouts folder.
You might need to run the build.ps1 folder from the root folder

@janvorli
Copy link
Member

Somehow the 3.0 SDK was failing to build it even though it should be able to target older SDKs. Now I was able to build it successfuly, but running it fails:

Unhandled Exception: System.AggregateException: One or more errors occurred. (Failed to read license from 'D:\git\ravendb\test\Tryouts\bin\Debug\netcoreapp2.1\license.json' path.) ---> Raven.Client.Exceptions.Commercial.LicenseActivationException: Failed to read license from 'D:\git\ravendb\test\Tryouts\bin\Debug\netcoreapp2.1\license.json' path. ---> System.IO.FileNotFoundException: Could not find file 'D:\git\ravendb\test\Tryouts\bin\Debug\netcoreapp2.1\license.json'.

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

Sorry about that. Forgot that you need it.
You can fix this by first running this command:

$env:RAVEN_License='{"Id":"933260e7-a7f0-47da-981d-bdd5662ce3fb","Name":"Testing McTester","Keys":["A+8f8x3pAiwfk2xmk4F7DBqMa","B6hyAebFwYY1PmFMG5UXlINTd","WY/rIk9ex/eTQUpqo4xMCP8SE","1Axf9wywplVih6qPyYHewBw8v","PU+l4DTpVcqHhM3qKOWO+mfvQ","NrjBU6ahBt9xjOU9tog1HDs51","xYhq2MAVWD/Gm6E/DJZShABUE","oBSYoSQMqKywtLi8wJzFDJEQJ","Yl1Q"]}'

Please note that I'm running this on CoreCLR 2.1.8, not 3.0

We haven't verified this issue on other versions of the runtime.

@janvorli
Copy link
Member

Please note that I'm running this on CoreCLR 2.1.8, not 3.0

I know. But it should be possible to build using 3.0 targetting older versions - it should just honor the versions specified in the project files.

@janvorli
Copy link
Member

Ok, I was able to start the test now, but it is getting one exception at the beginning and then some timeout exceptions:

0
        To attach debugger to test process (x86), use proc-id: 6996.
Max number of concurrent tests is: 4

Unhandled Exception: System.AggregateException: One or more errors occurred. (Tried to send 'DeleteDatabaseCommand' request via `DELETE /admin/databases` to all configured nodes in the topology, none of the attempt succeeded.
I was able to fetch  topology from http://127.0.0.1:51592.
Topology:
[Url: http://127.0.0.1:51592, ClusterTag: A, ServerRole: None, Exception: System.TimeoutException: Waited for 00:00:15 but didn't get index notification for 36. Last commit index is: 34. Number of errors is: 0.
Index: 2. Type: PutLicenseCommand. ExecutionTime: 00:00:00.0048047. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 464. Exception:
Index: 3. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.1255138. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 634. Exception:
Index: 5. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0029094. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 990. Exception:
Index: 8. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0011608. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1350. Exception:
Index: 11. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0026476. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1486. Exception:
Index: 14. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0168839. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1701. Exception:
Index: 16. Type: AddDatabaseCommand. ExecutionTime: 00:00:00.0157038. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 2234. Exception:
Index: 18. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0054394. Term: 3. LeaderErrorCount: 0. LeaderShipDuration: 546. Exception:
Index: 20. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0011850. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 23. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012422. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 24. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0014008. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 27. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012933. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 29. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0014345. Term: 11. LeaderErrorCount: 0. LeaderShipDuration: 250. Exception:
Index: 30. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0015241. Term: 11. LeaderErrorCount: 0. LeaderShipDuration: 474. Exception:
Index: 32. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0013954. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 34. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012890. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 36. Type: DeleteDatabaseCommand. ExecutionTime: 00:00:00.0016575. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:

   at Raven.Server.ServerWide.RachisLogIndexNotifications.ThrowTimeoutException(TimeSpan value, Int64 index, Int64 lastModifiedIndex) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 1867
   at Raven.Server.ServerWide.RachisLogIndexNotifications.WaitForIndexNotification(Int64 index, TimeSpan timeout) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 1851
   at Raven.Server.ServerWide.ClusterStateMachine.WaitForIndexNotification(Int64 index, Nullable`1 timeout) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 677
   at Raven.Server.Web.System.AdminDatabasesHandler.Delete() in D:\git\ravendb\src\Raven.Server\Web\System\AdminDatabasesHandler.cs:line 628
   at Raven.Server.Routing.RequestRouter.HandlePath(RequestHandlerContext reqCtx) in D:\git\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 129
   at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\git\ravendb\src\Raven.Server\RavenServerStartup.cs:line 173
The server at /admin/databases responded with status code: RequestTimeout. -> System.TimeoutException: Waited for 00:00:15 but didn't get index notification for 36. Last commit index is: 34. Number of errors is: 0.
Index: 2. Type: PutLicenseCommand. ExecutionTime: 00:00:00.0048047. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 464. Exception:
Index: 3. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.1255138. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 634. Exception:
Index: 5. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0029094. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 990. Exception:
Index: 8. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0011608. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1350. Exception:
Index: 11. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0026476. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1486. Exception:
Index: 14. Type: PutLicenseLimitsCommand. ExecutionTime: 00:00:00.0168839. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 1701. Exception:
Index: 16. Type: AddDatabaseCommand. ExecutionTime: 00:00:00.0157038. Term: 1. LeaderErrorCount: 0. LeaderShipDuration: 2234. Exception:
Index: 18. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0054394. Term: 3. LeaderErrorCount: 0. LeaderShipDuration: 546. Exception:
Index: 20. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0011850. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 23. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012422. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 24. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0014008. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 27. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012933. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 29. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0014345. Term: 11. LeaderErrorCount: 0. LeaderShipDuration: 250. Exception:
Index: 30. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0015241. Term: 11. LeaderErrorCount: 0. LeaderShipDuration: 474. Exception:
Index: 32. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0013954. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 34. Type: ClusterTransactionCommand. ExecutionTime: 00:00:00.0012890. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:
Index: 36. Type: DeleteDatabaseCommand. ExecutionTime: 00:00:00.0016575. Term: . LeaderErrorCount: . LeaderShipDuration: . Exception:

   at Raven.Server.ServerWide.RachisLogIndexNotifications.ThrowTimeoutException(TimeSpan value, Int64 index, Int64 lastModifiedIndex) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 1867
   at Raven.Server.ServerWide.RachisLogIndexNotifications.WaitForIndexNotification(Int64 index, TimeSpan timeout) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 1851
   at Raven.Server.ServerWide.ClusterStateMachine.WaitForIndexNotification(Int64 index, Nullable`1 timeout) in D:\git\ravendb\src\Raven.Server\ServerWide\ClusterStateMachine.cs:line 677
   at Raven.Server.Web.System.AdminDatabasesHandler.Delete() in D:\git\ravendb\src\Raven.Server\Web\System\AdminDatabasesHandler.cs:line 628
   at Raven.Server.Routing.RequestRouter.HandlePath(RequestHandlerContext reqCtx) in D:\git\ravendb\src\Raven.Server\Routing\RequestRouter.cs:line 129
   at Raven.Server.RavenServerStartup.RequestHandler(HttpContext context) in D:\git\ravendb\src\Raven.Server\RavenServerStartup.cs:line 173
The server at /admin/databases responded with status code: RequestTimeout.]

.... etc

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

Are you running this inside a debugger, or on some really slow I/O?

@janvorli
Copy link
Member

No, without a debugger on my main dev box (Intel XEON E5530, 8 cores, 24GB RAM). Only the disk is not a SSD. Let me try to run it from a SSD drive.

@janvorli
Copy link
Member

Oh, I've found that I was running some heavy tests in the background, I've forgotten about it. Now it doesn't timeout anymore. I am sorry for the confusion.

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

This test is spinning 5 servers internally which may hit the disk. If you have high disk activity and non SSD disk, that can certainly explain it. Great that this works now.

On my end, in 64 bits, it just works. In 32 bits, it fails after about 50 runs

@janvorli
Copy link
Member

Hmm, it seems I was just lucky before. That test run has crashed after 4 iterations with stack overflow though. Attempt to re-run the test got the timeout again. So it seems I am really at an edge. Is there a way to increase that timeout? I would like to run it with GC stress enabled to trigger the issue more reliably and that will slow everything down a lot.

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

Try this:

$env:Raven.Cluster.OperationTimeoutInSec=300
$env:Raven.Cluster.ElectionTimeoutInMs=5000

By the way, nothing in this test should cause StackOverlow, I also run into that error (but couldn't reproduce in WinDBG so can't pull a stack trace).

I'm assuming that this is another instance of: https://github.com/dotnet/coreclr/issues/22597
but this this when you have the access violation on a stack pointer, which result in a different error.

@janvorli
Copy link
Member

I have forgotten to ask - are you hitting this issue with both Debug and Release configurations?

@benaadams
Copy link
Member

By the way, nothing in this test should cause StackOverlow, I also run into that error (but couldn't reproduce in WinDBG so can't pull a stack trace).

If you are throwing exceptions in (reasonably deep) async it could also be dotnet/roslyn#26567

@AndyAyersMS
Copy link
Member

My impression is that the issue behind #12038 will only cause spurious AVs in a short window just after calling VirtualAlloc (as the 2.1 jit does when it needs more memory) on that particular osver/kb combination. I would not expect it to lead to a stack overflow.

@ayende
Copy link
Contributor Author

ayende commented Feb 26, 2019

@janvorli I'm sorry, I forgot to mention that. I'm hitting that in Release configuration.
Now that I think about it, the stack overflow I got was when I run it in Debug configuration,not Release

@benaadams We do have a lot of async calls, yes.

@AndyAyersMS I'm not sure what the root cause is for this, but wouldn't calling VirtualAlloc to commit additional stack space and then getting AV on that cause SO?

@AndyAyersMS
Copy link
Member

wouldn't calling VirtualAlloc to commit additional stack space and then getting AV on that cause SO?

VirtualAlloc'd memory and stack are in different parts of virtual address space. Stack overflow is usually triggered by accessing a reserved guard page at end of stack. So, no, not likely they'd interfere with one another.

@janvorli
Copy link
Member

@ayende with COMPlus_GCStress=4, I've hit an interesting exception that is probably unrelated to the issue we are hunting, but that might indicate a bug in your code. So I am sharing it in case you'd like to double check that:

0
        To attach debugger to test process (x86), use proc-id: 25716.
Max number of concurrent tests is: 4

Unhandled Exception: System.AggregateException: One or more errors occurred. (Cannot access a disposed object.
Object name: 'Leader'.) ---> Sparrow.Utils.LockAlreadyDisposedException: Cannot access a disposed object.
Object name: 'Leader'.
   at Sparrow.Utils.DisposeLock.ThrowDisposed() in E:\issues\ayende\ravendb\src\Sparrow\Utils\DisposeLock.cs:line 75
   at Sparrow.Utils.DisposeLock.EnsureNotDisposed() in E:\issues\ayende\ravendb\src\Sparrow\Utils\DisposeLock.cs:line 60
   at Raven.Server.Rachis.Leader.TryModifyTopology(String nodeTag, String nodeUrl, TopologyModification modification, Task& task, Boolean validateNotInTopology, Action`1 beforeCommit) in E:\issues\ayende\ravendb\src\Raven.Server\Rachis\Leader.cs:line 890
   at Raven.Server.Rachis.RachisConsensus.ModifyTopologyAsync(String nodeTag, String nodeUrl, TopologyModification modification, Boolean validateNotInTopology) in E:\issues\ayende\ravendb\src\Raven.Server\Rachis\RachisConsensus.cs:line 1761
   at Raven.Client.Extensions.TaskExtensions.<>c.<WithCancellation>b__2_0(Task t) in E:\issues\ayende\ravendb\src\Raven.Client\Extensions\TaskExtensions.cs:line 28
   at System.Threading.Tasks.ContinuationTaskFromTask.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__278_1(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
--- End of stack trace from previous location where exception was thrown ---
   at Raven.Server.ServerWide.ServerStore.AddNodeToClusterAsync(String nodeUrl, String nodeTag, Boolean validateNotInTopology, Boolean asWatcher) in E:\issues\ayende\ravendb\src\Raven.Server\ServerWide\ServerStore.cs:line 418
   at Tests.Infrastructure.ClusterTestBase.CreateRaftCluster(Int32 numberOfNodes, Boolean shouldRunInMemory, Nullable`1 leaderIndex, Boolean useSsl, IDictionary`2 customSettings) in E:\issues\ayende\ravendb\test\Tests.Infrastructure\ClusterTestBase.cs:line 479
   at Tests.Infrastructure.ClusterTestBase.CreateRaftClusterAndGetLeader(Int32 numberOfNodes, Boolean shouldRunInMemory, Nullable`1 leaderIndex, Boolean useSsl, IDictionary`2 customSettings) in E:\issues\ayende\ravendb\test\Tests.Infrastructure\ClusterTestBase.cs:line 495
   at SlowTests.Cluster.ClusterTransactionTests.CanPreformSeveralClusterTransactions(Int32 numberOfNodes) in E:\issues\ayende\ravendb\test\SlowTests\Cluster\ClusterTransactionTests.cs:line 114
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at Tryouts.Program.Main(String[] args) in E:\issues\ayende\ravendb\test\Tryouts\Program.cs:line 32

@janvorli
Copy link
Member

I keep getting the stack overflow even in Release builds. So I've tried to run it under WinDbg and I can see it getting constant stream of exceptions (I guess at least 10 per second or even more):

Exception type:   System.Net.Sockets.SocketException
Message:          A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
InnerException:   <none>

At the following call stack:
 # ChildEBP RetAddr  
00 1c0feafc 6da02d94 KERNELBASE!RaiseException+0x62 [minkernel\kernelbase\xcpt.c @ 938] 
01 1c0feba4 6dad48fb coreclr!RaiseTheExceptionInternalOnly+0x11d [e:\a\_work\335\s\src\vm\excep.cpp @ 3039] 
02 1c0fec6c 656c5954 coreclr!IL_Throw+0x11b [e:\a\_work\335\s\src\vm\jithelpers.cpp @ 4860] 
03 1c0fec9c 0c92a181 System_Net_Sockets!System.Net.Sockets.NetworkStream.Read(Byte[], Int32, Int32)+0xb0
04 1c0fed28 0c929a5e Sparrow!Sparrow.Json.JsonOperationContext.ParseToMemory(System.IO.Stream, System.String, UsageMode, ManagedPinnedBuffer, Sparrow.Json.IBlittableDocumentModifier)+0x1c1
05 1c0fed6c 0ca02b30 Sparrow!Sparrow.Json.JsonOperationContext.ParseToMemory(System.IO.Stream, System.String, UsageMode, Sparrow.Json.IBlittableDocumentModifier)+0x46
06 1c0fee40 0ca0262b Raven_Server!Raven.Server.ServerWide.Maintenance.ClusterMaintenanceSupervisor+ClusterNode.ListenToMaintenanceWorker()+0x490
07 1c0fee6c 0ca27390 Raven_Server!Raven.Server.ServerWide.Maintenance.ClusterMaintenanceSupervisor+ClusterNode.<Start>b__18_0(System.Object)+0x23
08 1c0feea0 0ca26e10 Raven_Server!Raven.Server.Utils.PoolOfThreads+PooledThread.DoWork()+0xb8
09 1c0feec8 6ae768dd Raven_Server!Raven.Server.Utils.PoolOfThreads+PooledThread.Run()+0xc8
0a 1c0feee4 6cedc00c System_Threading_Thread!System.Threading.Thread.ThreadMain_ThreadStart()+0x35
0b 1c0feee4 6ceda31d System_Private_CoreLib!System.Threading.ThreadHelper.ThreadStart_Context(System.Object)+0x5c
0c 1c0fef14 6cedc0cc System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+0x65
0d 1c0fef28 6daeb0ef System_Private_CoreLib!System.Threading.ThreadHelper.ThreadStart()+0x2c
0e 1c0fef34 6da1fbf1 coreclr!CallDescrWorkerInternal+0x34 [E:\A\_work\335\s\src\vm\i386\asmhelpers.asm @ 618] 
0f (Inline) -------- coreclr!CallDescrWorkerWithHandler+0x52 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 78] 
10 1c0fefd4 6dadd4e4 coreclr!MethodDescCallSite::CallTargetWorker+0x235 [e:\a\_work\335\s\src\vm\callhelpers.cpp @ 620] 
11 1c0ff0ac 6daf2b06 coreclr!ThreadNative::KickOffThread_Worker+0x104 [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 260] 
12 1c0ff0c4 6da1f86a coreclr!ManagedThreadBase_DispatchInner+0x70 [e:\a\_work\335\s\src\vm\threads.cpp @ 8852] 
13 1c0ff170 6da1f7bb coreclr!ManagedThreadBase_DispatchMiddle+0x65 [e:\a\_work\335\s\src\vm\threads.cpp @ 8902] 
14 1c0ff1d4 6dae52b9 coreclr!ManagedThreadBase_DispatchOuter+0x78 [e:\a\_work\335\s\src\vm\threads.cpp @ 9161] 
15 1c0ff1f8 6da89e7c coreclr!ManagedThreadBase_FullTransitionWithAD+0x2f [e:\a\_work\335\s\src\vm\threads.cpp @ 9200] 
16 (Inline) -------- coreclr!ManagedThreadBase::KickOff+0x15 [e:\a\_work\335\s\src\vm\threads.cpp @ 9234] 
17 1c0ff274 6da89d90 coreclr!ThreadNative::KickOffThread+0xcc [e:\a\_work\335\s\src\vm\comsynchronizable.cpp @ 380] 
18 1c0ffe8c 76f90179 coreclr!Thread::intermediateThreadProc+0x50 [e:\a\_work\335\s\src\vm\threads.cpp @ 2255] 
19 1c0ffe9c 771c662d KERNEL32!BaseThreadInitThunk+0x19 [base\win32\client\thread.c @ 64] 
1a 1c0ffef8 771c65fd ntdll!__RtlUserThreadStart+0x2f [minkernel\ntdll\rtlstrt.c @ 1163] 
1b 1c0fff08 00000000 ntdll!_RtlUserThreadStart+0x1b [minkernel\ntdll\rtlstrt.c @ 1080] 

@arekpalinski
Copy link

@janvorli
Copy link
Member

janvorli commented Mar 5, 2019

Ok, the VerifyHeap has shown there is quite a large scale heap corruption. Also, from the stress log, I can see that the last GC scan of the current thread was executed when the Voron.Data.Fixed.FixedSizeTree.AddEmbeddedEntry was not on the stack. So that rules out the GC hole in its GC info.
My guess is that something unsafe is corrupting memory.

We can get some more detail on that by running the app with env var COMPlus_HeapVerify=1. That results in checking the GC heap consistency before and after each GC. When it asserts due to the heap corruption, please share the stack trace and keep the dump or windbg session so that we can check whether the corruption happened between GCs or during the GC.

@ayende
Copy link
Contributor Author

ayende commented Mar 5, 2019

My main concern here is that this is running on Win 2019 server, and it crashes consistently.
The same code on a Win 2016 server works fine. There is this issue: https://github.com/dotnet/coreclr/issues/22597 , which I believe might be another manifestation of the problem.

@AndyAyersMS
Copy link
Member

My understanding is that #12038 will result in crashes but not heap corruption.

@AndyAyersMS
Copy link
Member

I'm evaluating a candidate fix now. It is not x86 specific, and while I haven't looked in depth yet, I think the issue it addresses would apply to all architectures... will update when I know more.

@janvorli
Copy link
Member

janvorli commented Mar 5, 2019

It is not x86 specific

The issue is x86 windows specific. We handle that differently on other architectures. The reason is that on other architectures / platforms, when we are running in the filter, we have the try block frame on the call stack too and so we walk it and report the locals from there. Thus we don't need to report them from the filter frame. It is not the case on x86 Windows due to the way how SEH works.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Mar 5, 2019
When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
AndyAyersMS referenced this issue in AndyAyersMS/coreclr Mar 5, 2019
When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
@AndyAyersMS
Copy link
Member

The issue is x86 windows specific.

Yes, I see. From the jit's internal standpoint the liveness computation was still wrong for filters on other architectures, but fixing that didn't impact jit codegen or gc info.

PR for the filter liveness fix is up: dotnet/coreclr#23044.

@AndyAyersMS
Copy link
Member

@janvorli @ayende @arekpalinski would be great if you could verify the fix if possible...

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

@AndyAyersMS I will.

@arekpalinski
Copy link

arekpalinski commented Mar 6, 2019

@AndyAyersMS
I've compiled CoreCLR 2.1.8 with dotnet/coreclr#23044 fix and tried it on our x64 instance (as I noticed this comment in your PR: dotnet/coreclr#23044 (comment)). It's still throwing AVE so it looks like what we experience on x64 is something different.

I've tried the same (CoreCLR 2.1.8 with dotnet/coreclr#23044) on x86 and run the original repro reported by @ayende here - no failure. Great!

@janvorli
The usage of COMPlus_HeapVerify=1 didn't make any difference - no assertion due to heap corruption. I got AVE - same stacktrace as before. Heap is corrupted. Any advice how to investigate further on x64?

@janvorli
Copy link
Member

janvorli commented Mar 6, 2019

@arekpalinski can you please try with COMPlus_HeapVerify=3?

@arekpalinski
Copy link

@janvorli We have a lead that heap corruption might be caused by change in our code. We're investigating that.

AndyAyersMS referenced this issue in dotnet/coreclr Mar 7, 2019
When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
@AndyAyersMS
Copy link
Member

x86 issue seems to be fixed.

Am going to re-open this until we've got more clarity on what is happening for x64.

@AndyAyersMS AndyAyersMS reopened this Mar 7, 2019
@ayende
Copy link
Contributor Author

ayende commented Mar 7, 2019

We are pretty sure that the x64 stuff is our fault and not related to this issue.
It just manifested in pretty much the same way and stack trace.

@arekpalinski
Copy link

We can confirm that we no longer experience the issue on x64. @janvorli Thanks for help in narrowing it down.

@AndyAyersMS
Copy link
Member

Thanks. Now keeping this open to track the (proposed) porting of this fix to 2.1.

AndyAyersMS referenced this issue in AndyAyersMS/coreclr Mar 8, 2019
Port of dotnet#23044 to release/2.1.

When a filter is finished executing, control can logically pass to the
associated handler, any enclosing handler or filter, or any finally or fault
handler nested within the associated try. This is a consequence of two-pass EH.

The jit was not propagating liveness from the nested handlers, which lead to a
live object being collected inadvertently.

This change updates `fgGetHandlerLiveVars` to find the nested handlers and
merge their live-in into the filter block live sets.

Because these implicit EH flow edges can create cycles in the liveness dataflow
equations, the jit will also now always iterate liveness when it sees there is
exception flow, to ensure livness reaches the appropriate fixed point.

Added test case.

Closes #22820.
@BruceForstall
Copy link
Member

@AndyAyersMS Should the milestone be changed to 2.1/2.2?

@AndyAyersMS
Copy link
Member

Makes sense, yes.

@RussKeldorph
Copy link
Contributor

@AndyAyersMS
Copy link
Member

Yes. Also merged to 2.2 via dotnet/coreclr#23256.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 2.1.x milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x86 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI area-ExceptionHandling-coreclr bug os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants