-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests hang due to MKL loading blocking all threads #1073
Comments
The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock. Also, undoing the previous test workarounds now that we've narrowed this hang down. Workaround dotnet#1073
The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock. Also moving the test queue back to the hosted windows pool. Workaround dotnet#1073
* Add a workaround for the tests hanging while loading MKL. The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock. Workaround #1073 Another deadlock also occurs when running TestAutoInference and TestPipelineSweeper in parallel. Marking these tests to not run in parallel anymore. Workaround #1095 Moving back to the Azure Hosted VS2017 pool to run the tests now that we've narrowed the deadlocks down.
@eerhardt , are you tracking this bug for the proper solution (involving Intel)? |
Yes, this is tracking the Intel issue. Hopefully the underlying bug gets fixed by MKL and this will tracks pulling in that new version with the bug fix (if deemed necessary). Here is the Intel support request: https://supporttickets.intel.com/requestdetail?id=5000P00000kDdkQQAS&lang=null Request: 03677610 |
closing as this not seem to be happening anymore. |
We are seeing some tests hanging randomly in CI on Windows.
I was able to catch a few hangs (~10-15) on my machine as well. Every time the process was hung, it was during loading of MklImports, which was calling
LoadLibraryA("libittnotify")
, and then on a different thread MklImports was also on the stack. I am unable to get debugging symbols for MKL, but the two stacks look like this:Thread 1, calling into MKL from ML.NET:
Thread 2, what looks like a background thread running a
DllMain
(ntdll!LdrpCallInitRoutine
is what invokesDllMain
):Note that Thread 1 is calling
MklImports!cblas_sgemm
, and eventually MklImports is callingLoadLibraryA
. Printing the variables atLoadLibraryA
:There are other threads running at this point (and some of those threads are spawning new threads as well). I assume there is some race condition happening with MklImports loading while other threads are doing other work.
To attempt to fix this, I am loading
MklImports
very early in the tests (in the base test class static initializer, I am calling into MklImports to ensure it is loaded). This appears to fix the issue - I've run the tests 30 times without it hanging on my machine.I also have a few .dmp files, if anyone wants to investigate themselves, you can ping me.
Repro steps
..\..\Tools\dotnetcli\dotnet.exe test
a. The tests take a little over a minute. But it doesn’t repro every time, so you need to run it a few times. If the test hangs for over 2 minutes, you know you have a deadlock. Attach a debugger to investigate.
a.
for ($i=0; $i -lt 20;$i++) {..\..\Tools\dotnetcli\dotnet.exe test}
The text was updated successfully, but these errors were encountered: