Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests hang due to MKL loading blocking all threads #1073

Closed
eerhardt opened this issue Sep 27, 2018 · 3 comments
Closed

Tests hang due to MKL loading blocking all threads #1073

eerhardt opened this issue Sep 27, 2018 · 3 comments
Labels
bug Something isn't working

Comments

@eerhardt
Copy link
Member

eerhardt commented Sep 27, 2018

We are seeing some tests hanging randomly in CI on Windows.

I was able to catch a few hangs (~10-15) on my machine as well. Every time the process was hung, it was during loading of MklImports, which was calling LoadLibraryA("libittnotify"), and then on a different thread MklImports was also on the stack. I am unable to get debugging symbols for MKL, but the two stacks look like this:

Thread 1, calling into MKL from ML.NET:

0:024> kn
 # Child-SP          RetAddr           Call Site
00 0000004f`828fb448 00007ffc`2af8c199 ntdll!NtWaitForSingleObject+0x14
01 0000004f`828fb450 00007ffc`2af80e3e ntdll!LdrpDrainWorkQueue+0x15d
02 0000004f`828fb490 00007ffc`2af84387 ntdll!LdrpLoadDllInternal+0xc2
03 0000004f`828fb510 00007ffc`2af8a724 ntdll!LdrpLoadDll+0x10b
04 0000004f`828fb6c0 00007ffc`27352a7b ntdll!LdrLoadDll+0xa4
05 0000004f`828fb7c0 00007ffc`273814f1 KERNELBASE!LoadLibraryExW+0x17b [minkernel\kernelbase\module.c @ 1207] 
06 0000004f`828fb830 00007ffc`273814a9 KERNELBASE!LoadLibraryExA+0x31 [minkernel\kernelbase\module.c @ 1249] 
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for MklImports.DLL - 
07 0000004f`828fb870 00007ffb`ba4b8854 KERNELBASE!LoadLibraryA+0x39 [minkernel\kernelbase\module.c @ 878] 
08 0000004f`828fb8a0 00007ffb`ba4b908b MklImports+0x8854
09 0000004f`828fb8e0 00007ffb`ba4c64f1 MklImports+0x908b
0a 0000004f`828fb9b0 00007ffb`ba4be343 MklImports!vslDeleteStream+0x4d91
0b 0000004f`828fbbc0 00007ffb`823903d3 MklImports!cblas_sgemm+0x213
0c 0000004f`828fbc60 00007ffb`8238f446 0x00007ffb`823903d3
0d 0000004f`828fbd80 00007ffb`8238d91a 0x00007ffb`8238f446
0e 0000004f`828fbf90 00007ffb`8238d330 0x00007ffb`8238d91a
0f 0000004f`828fc0a0 00007ffb`8238d0b3 0x00007ffb`8238d330
10 0000004f`828fc160 00007ffb`81e10837 0x00007ffb`8238d0b3
11 0000004f`828fc1e0 00007ffb`81e0f88e 0x00007ffb`81e10837
12 0000004f`828fc250 00007ffb`82389736 0x00007ffb`81e0f88e
13 0000004f`828fc520 00007ffb`e0063ba3 0x00007ffb`82389736
14 0000004f`828fc760 00007ffb`dff4f4fe coreclr!CallDescrWorkerInternal+0x83 [E:\A\_work\164\s\src\vm\amd64\CallDescrWorkerAMD64.asm @ 101] 
15 0000004f`828fc7a0 00007ffb`dff4f9fb coreclr!CallDescrWorkerReflectionWrapper+0x1a [e:\a\_work\164\s\src\vm\reflectioninvocation.cpp @ 740] 
16 0000004f`828fc7f0 00007ffb`d94ee791 coreclr!RuntimeMethodHandle::InvokeMethod+0x4bb [e:\a\_work\164\s\src\vm\reflectioninvocation.cpp @ 1355] 
17 0000004f`828fce30 00007ffb`816c173b System_Private_CoreLib!System.Reflection.RuntimeMethodInfo.Invoke(System.Object, System.Reflection.BindingFlags, System.Reflection.Binder, System.Object[], System.Globalization.CultureInfo)$##60034D3+0xb1
18 0000004f`828fcea0 00007ffb`816c1000 0x00007ffb`816c173b
19 0000004f`828fcee0 00007ffb`816c012f 0x00007ffb`816c1000
1a 0000004f`828fcf70 00007ffb`816bfdd7 0x00007ffb`816c012f
1b 0000004f`828fcff0 00007ffb`816bf9e4 0x00007ffb`816bfdd7
1c 0000004f`828fd060 00007ffb`816bf4f9 0x00007ffb`816bf9e4
1d 0000004f`828fd0d0 00007ffb`816bf47d 0x00007ffb`816bf4f9
1e 0000004f`828fd140 00007ffb`816bec15 0x00007ffb`816bf47d
1f 0000004f`828fd1b0 00007ffb`816be679 0x00007ffb`816bec15
20 0000004f`828fd210 00007ffb`816be31d 0x00007ffb`816be679
21 0000004f`828fd280 00007ffb`816bde59 0x00007ffb`816be31d
22 0000004f`828fd2e0 00007ffb`816bd01f 0x00007ffb`816bde59
23 0000004f`828fd3d0 00007ffb`816bccfc 0x00007ffb`816bd01f
24 0000004f`828fd450 00007ffb`816ba722 0x00007ffb`816bccfc
25 0000004f`828fd4d0 00007ffb`816ba2af 0x00007ffb`816ba722
26 0000004f`828fd5d0 00007ffb`816ba1b7 0x00007ffb`816ba2af
27 0000004f`828fd650 00007ffb`816b9f18 0x00007ffb`816ba1b7
28 0000004f`828fd6d0 00007ffb`816b9da9 0x00007ffb`816b9f18
29 0000004f`828fd780 00007ffb`816b9d2d 0x00007ffb`816b9da9
2a 0000004f`828fd7f0 00007ffb`816b91f6 0x00007ffb`816b9d2d
2b 0000004f`828fd850 00007ffb`816b8ff9 0x00007ffb`816b91f6
2c 0000004f`828fd8e0 00007ffb`816b8f7d 0x00007ffb`816b8ff9
2d 0000004f`828fd950 00007ffb`816b8cde 0x00007ffb`816b8f7d
2e 0000004f`828fd9c0 00007ffb`816b89bf 0x00007ffb`816b8cde
2f 0000004f`828fda30 00007ffb`816b88a9 0x00007ffb`816b89bf
30 0000004f`828fdab0 00007ffb`816b83d7 0x00007ffb`816b88a9
31 0000004f`828fdb30 00007ffb`816b78ef 0x00007ffb`816b83d7
32 0000004f`828fdc20 00007ffb`816b77e7 0x00007ffb`816b78ef
33 0000004f`828fdca0 00007ffb`816b629f 0x00007ffb`816b77e7
34 0000004f`828fdd20 00007ffb`816b5eff 0x00007ffb`816b629f
35 0000004f`828fdde0 00007ffb`816b5df7 0x00007ffb`816b5eff
36 0000004f`828fde60 00007ffb`816b58b5 0x00007ffb`816b5df7
37 0000004f`828fdee0 00007ffb`816b5518 0x00007ffb`816b58b5
38 0000004f`828fdf60 00007ffb`816b4eef 0x00007ffb`816b5518
39 0000004f`828fdfd0 00007ffb`816b4de7 0x00007ffb`816b4eef
3a 0000004f`828fe050 00007ffb`816b4879 0x00007ffb`816b4de7
3b 0000004f`828fe0d0 00007ffb`816b459f 0x00007ffb`816b4879
3c 0000004f`828fe180 00007ffb`816b4497 0x00007ffb`816b459f
3d 0000004f`828fe200 00007ffb`816b2917 0x00007ffb`816b4497
3e 0000004f`828fe270 00007ffb`816b211f 0x00007ffb`816b2917
3f 0000004f`828fe300 00007ffb`816b2017 0x00007ffb`816b211f
40 0000004f`828fe380 00007ffb`816b0b29 0x00007ffb`816b2017
41 0000004f`828fe410 00007ffb`816b039f 0x00007ffb`816b0b29
42 0000004f`828fe4c0 00007ffb`816b0297 0x00007ffb`816b039f
43 0000004f`828fe540 00007ffb`816afa66 0x00007ffb`816b0297
44 0000004f`828fe5c0 00007ffb`816af6ef 0x00007ffb`816afa66
45 0000004f`828fe650 00007ffb`816af5e7 0x00007ffb`816af6ef
46 0000004f`828fe6d0 00007ffb`816ade30 0x00007ffb`816af5e7
47 0000004f`828fe750 00007ffb`816ad61f 0x00007ffb`816ade30
48 0000004f`828fe800 00007ffb`816ad427 0x00007ffb`816ad61f
49 0000004f`828fe880 00007ffb`816acb85 0x00007ffb`816ad427
4a 0000004f`828fe900 00007ffb`d9549e3c 0x00007ffb`816acb85
4b 0000004f`828fe930 00007ffb`d945cda9 System_Private_CoreLib!System.Threading.Tasks.Task`1[System.__Canon].InnerInvoke()$##600286D+0x4c
4c 0000004f`828fe980 00007ffb`d950f8e7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
4d 0000004f`828fea00 00007ffb`d950f674 System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)$##600294B+0x197
4e 0000004f`828feaa0 00007ffb`d957c4cb System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteEntry()$##6002948+0x84
4f 0000004f`828feae0 00007ffb`816aca97 System_Private_CoreLib!System.Threading.Tasks.SynchronizationContextTaskScheduler+<>c.<.cctor>b__8_0(System.Object)$##6002AB3+0x2b
50 0000004f`828feb10 00007ffb`d945cda9 0x00007ffb`816aca97
51 0000004f`828feb60 00007ffb`816aabe7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
52 0000004f`828febe0 00007ffb`8169bc34 0x00007ffb`816aabe7
53 0000004f`828fec30 00007ffb`d945cda9 0x00007ffb`8169bc34
54 0000004f`828fec80 00007ffb`d950f8e7 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
55 0000004f`828fed00 00007ffb`d945cda9 System_Private_CoreLib!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)$##600294B+0x197
56 0000004f`828feda0 00007ffb`e0063ba3 System_Private_CoreLib!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)$##60027BC+0x89
57 0000004f`828fee20 00007ffb`dff29839 coreclr!CallDescrWorkerInternal+0x83 [E:\A\_work\164\s\src\vm\amd64\CallDescrWorkerAMD64.asm @ 101] 
58 (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x53 [e:\a\_work\164\s\src\vm\callhelpers.cpp @ 78] 
59 0000004f`828fee60 00007ffb`e004fe03 coreclr!MethodDescCallSite::CallTargetWorker+0x2b5 [e:\a\_work\164\s\src\vm\callhelpers.cpp @ 628] 
5a (Inline Function) --------`-------- coreclr!StressLog::InlinedStressLogOn+0x5 [e:\a\_work\164\s\src\utilcode\stresslog.cpp @ 594] 
5b (Inline Function) --------`-------- coreclr!StressLog::LogOn+0x5 [e:\a\_work\164\s\src\utilcode\stresslog.cpp @ 628] 
5c 0000004f`828fefb0 00007ffb`dff293d1 coreclr!ThreadNative::KickOffThread_Worker+0x143 [e:\a\_work\164\s\src\vm\comsynchronizable.cpp @ 260] 
5d (Inline Function) --------`-------- coreclr!ManagedThreadBase_DispatchInner+0xf0 [e:\a\_work\164\s\src\vm\threads.cpp @ 8850] 
5e 0000004f`828ff120 00007ffb`dff291c3 coreclr!ManagedThreadBase_DispatchMiddle+0x179 [e:\a\_work\164\s\src\vm\threads.cpp @ 8901] 
5f 0000004f`828ff250 00007ffb`e005cbb7 coreclr!ManagedThreadBase_DispatchOuter+0xaf [e:\a\_work\164\s\src\vm\threads.cpp @ 9140] 
60 0000004f`828ff2f0 00007ffb`dff8c64a coreclr!ManagedThreadBase_FullTransitionWithAD+0x2f [e:\a\_work\164\s\src\vm\threads.cpp @ 9200] 
61 (Inline Function) --------`-------- coreclr!ManagedThreadBase::KickOff+0x20 [e:\a\_work\164\s\src\vm\threads.cpp @ 9234] 
62 0000004f`828ff350 00007ffb`dff8c516 coreclr!ThreadNative::KickOffThread+0x10a [e:\a\_work\164\s\src\vm\comsynchronizable.cpp @ 380] 
63 0000004f`828ff430 00007ffc`2acd3034 coreclr!Thread::intermediateThreadProc+0x86 [e:\a\_work\164\s\src\vm\threads.cpp @ 2255] 
64 0000004f`828ffdf0 00007ffc`2afc1461 kernel32!BaseThreadInitThunk+0x14
65 0000004f`828ffe20 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Thread 2, what looks like a background thread running a DllMain (ntdll!LdrpCallInitRoutine is what invokes DllMain):

0:076> kn
 # Child-SP          RetAddr           Call Site
00 0000004f`8487f268 00007ffc`27372f1d ntdll!NtYieldExecution+0x14
01 0000004f`8487f270 00007ffb`ba4b882d KERNELBASE!SwitchToThread+0x1d [minkernel\kernelbase\thread.c @ 3327] 
02 0000004f`8487f2a0 00007ffb`ba4b69a8 MklImports+0x882d
03 0000004f`8487f2e0 00007ffb`ba4b1132 MklImports+0x69a8
04 0000004f`8487f7c0 00007ffb`be8679d4 MklImports+0x1132
05 0000004f`8487f7f0 00007ffc`2af84053 MklImports!vslDeleteStream+0x43a6274
06 0000004f`8487f850 00007ffc`2af8167f ntdll!LdrpCallInitRoutine+0x6b
07 0000004f`8487f8c0 00007ffc`2afc14ce ntdll!LdrShutdownThread+0x16f
08 0000004f`8487f9c0 00007ffc`2acd303c ntdll!RtlExitUserThread+0x3e
09 0000004f`8487fa00 00007ffc`2afc1461 kernel32!BaseThreadInitThunk+0x1c
0a 0000004f`8487fa30 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Note that Thread 1 is calling MklImports!cblas_sgemm, and eventually MklImports is calling LoadLibraryA. Printing the variables at LoadLibraryA:

0:024> dv
  lpLibFileName = 0x00007ffb`bea0eb2c "libittnotify.dll"
      StrLength = <value unavailable>
      pszBuffer = 0x00007ffb`bee40148 ""
           hMod = 0x00000000`00000000

There are other threads running at this point (and some of those threads are spawning new threads as well). I assume there is some race condition happening with MklImports loading while other threads are doing other work.

To attempt to fix this, I am loading MklImports very early in the tests (in the base test class static initializer, I am calling into MklImports to ensure it is loaded). This appears to fix the issue - I've run the tests 30 times without it hanging on my machine.

I also have a few .dmp files, if anyone wants to investigate themselves, you can ping me.

Repro steps

  1. Ensure your Windows machine can build ML.NET - https://github.com/dotnet/machinelearning/blob/master/docs/building/windows-instructions.md
  2. git clone https://github.com/dotnet/machinelearning.git
  3. cd machinelearning
  4. git checkout 70b3c3b
  5. .\build.cmd
  6. cd test\Microsoft.ML.Tests
  7. ..\..\Tools\dotnetcli\dotnet.exe test
    a. The tests take a little over a minute. But it doesn’t repro every time, so you need to run it a few times. If the test hangs for over 2 minutes, you know you have a deadlock. Attach a debugger to investigate.
  8. Or if using powershell:
    a. for ($i=0; $i -lt 20;$i++) {..\..\Tools\dotnetcli\dotnet.exe test}
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Sep 27, 2018
The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock.

Also, undoing the previous test workarounds now that we've narrowed this hang down.

Workaround dotnet#1073
eerhardt added a commit to eerhardt/machinelearning that referenced this issue Sep 27, 2018
The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock.

Also moving the test queue back to the hosted windows pool.

Workaround dotnet#1073
eerhardt added a commit that referenced this issue Oct 1, 2018
* Add a workaround for the tests hanging while loading MKL.

The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock.

Workaround #1073

Another deadlock also occurs when running TestAutoInference and TestPipelineSweeper in parallel. Marking these tests to not run in parallel anymore.

Workaround #1095

Moving back to the Azure Hosted VS2017 pool to run the tests now that we've narrowed the deadlocks down.
@Ivanidzo4ka Ivanidzo4ka added the bug Something isn't working label Oct 19, 2018
@Zruty0
Copy link
Contributor

Zruty0 commented Oct 29, 2018

@eerhardt , are you tracking this bug for the proper solution (involving Intel)?

@eerhardt
Copy link
Member Author

Yes, this is tracking the Intel issue. Hopefully the underlying bug gets fixed by MKL and this will tracks pulling in that new version with the bug fix (if deemed necessary).

Here is the Intel support request: https://supporttickets.intel.com/requestdetail?id=5000P00000kDdkQQAS&lang=null

Request: 03677610

@eerhardt eerhardt removed their assignment Feb 27, 2019
@codemzs
Copy link
Member

codemzs commented Jun 30, 2019

closing as this not seem to be happening anymore.

@codemzs codemzs closed this as completed Jun 30, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants