Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reboots on haswell system #2524

Closed
rgauges opened this issue Mar 21, 2020 · 24 comments
Closed

reboots on haswell system #2524

rgauges opened this issue Mar 21, 2020 · 24 comments

Comments

@rgauges
Copy link

rgauges commented Mar 21, 2020

Hi,

for a few weeks I am experiencing funny behaviour which I think stems from openblas.
If I run machine learning workloads using numpy or tensorflow, my haswell system at home
reboots whenever I run certain tasks. This seems to depend on the environment I am running in.
When I set up a conda environment with numpy linked against mkl, the programs run just fine.
But when I run them in an environment where numpy is linked against openblas, my machine
reboots.
As the machine immediately reboots, I do not see any error messages and I running it in a debugger
won't help either.
Running the code within a kvm virtual machine on the same host also reboots the host upon running the code.

All my other systems are not affected by this. So I first thought it was the CPU that is breaking down or overheating, but other CPU intensive workloads run just fine.

My system is:

Ubuntu 19.10 x64
Haswell i7-4790K
Different conda environments with numpy either from conda (mkl) or from pip (openblas). It's the pip installations that always make my machine reboot.

Any idea of how I could pinpoint this problem so I can send a useful bug report?

Thanks

Ralph

@Diazonium
Copy link
Contributor

This is extremely unlikely to be an OpenBLAS issue, you either have a hardware issue, bios setting issue or kernel issue.
Try running Prime 95 for a few hours, both with small and large FFTs, make sure you enable AVX.
You can also try running the Intel Linpack benckmark for a few hours, it uses MKL to perform an LU factorization and it is a good CPU+RAM stress test.

@martin-frbg
Copy link
Collaborator

Anything in the system logs ? Normally a non-priviledged userspace program should never be able to bring down the kernel, though things might get a bit "interesting" if you run out of memory. Perhaps run memtest to make sure the hardware is healthy - though I'd expect you'd see similar failures with other software if memory or power supply had gone bad.

@rgauges
Copy link
Author

rgauges commented Mar 21, 2020

Thanks a lot for the fast response.
I also thought about thermal issues and memory issues first and I did run a memory check as well as several CPU stress tests. E.g. handbrake transcoding or stress-ng. And even though CPU utilisation is at 100% on all cores for hours, the system runs stable.
Temperature goes up to 75°C after some time, but it takes a while and the system is not throttling or anything. So I guess, cooling is working.
I haven't run prime95 yet, so I will try that next, enabling AVX. But I guess Handbrake would have used AVX as well.

Yes, I also thought that user space programs shouldn't be able to crash the machine, but unfortunately it does so anyway.
Unfortunately nothing in the logs either. In the case of overheating, I would have expected some message, but I can't find any.
If this was a normal crash/segfault, It should be fairly easy to pinpoint.
My last guess was that it might use some CPU instructions that might not be supported on my hardware and I tried to set the OPENBLAS_CORETYPE to haswell, but this didn't help either.

Here is a rather short example that reproducebly reboots my machine:

import numpy as np
from sklearn.decomposition import PCA

minst.data is 70000 x 784

data = np.random.rand(70000,784)

not setting solver or setting to auto crashes

pca = PCA(n_components=0.95,svd_solver="full")
pca.fit(data)

A few weeks back, I could even trace it down to the svd_solver, i.e. running numpy.linalg.svd rebooted my machine. Trying it earlier however didn't cause a crash.

Any ideas besides running more stress tests?

Thanks

Ralph

@martin-frbg
Copy link
Collaborator

Can you check if smaller problem sizes do not crash the machine ? And even an unsupported opcode should just get your program killed with "bus error" or similar, and it is extremely unlikely that such a thing sneaked into the Haswell kernels.

@brada4
Copy link
Contributor

brada4 commented Mar 21, 2020

It has nothing to do with OpenBLAS, it is thermal setup failing.

  • Update BIOS and reset settings to defaults
    If it still fails check mainboard/cpu sensors for extreme temperatures. It is overclockable CPU, if it trips temperature it is not a problem with PIP, BLAS, game, solver, minesweeper, Windows 7 updates, etc, it is certainly in hardware setup.

ref:

Temperature goes up to 75°C after some time

https://forums.tomshardware.com/threads/i7-4790k-temperatures.1881378/post-13037492

@wjc404
Copy link
Contributor

wjc404 commented Mar 22, 2020

@rgauges Have you tried using only 2 threads (export OPENBLAS_NUM_THREADS=2)?

@rgauges
Copy link
Author

rgauges commented Mar 22, 2020

Good morning everbody,

as suggested, I have been running prime95 over night. It ran happily for more than 12 hours.
When I came to the PC this morning, sensors were reporting around 80°C on the cores and I
could see that it was only running at 4 GHz instead of 4.4 GHz. SO it was throttling a bit to
keep the temperature.
Prime95 was using almost all of the 32GB main memory, so I guess, this got tested as well.

@brada4 Yes, it is a K Version of the CPU, but I am not doing any overclocking and the BIOS settings concerning the CPU are at their defaults. Thermal issues were also my first guess, but I can run all kinds of stress tests without any problems. And it's not as if it took a while to heat up the CPU and then crash. It usually crashes right when I fit the PCA. Sometimes it takes two to three seconds, but most of the time, the reboot is immediate.
Second, I do have conda environments where I can run this problem even with 70000x800 without problems on the same machine.
So all in all, I am ready to dump the thermal problem theory and I think it is related to the environment in which I run it. So the most likely theory in my eyes right now is that it is a software problem. Although I have no clue how a user space application is able to trash the while machine.
The one thing that actually seems to change anything is whether numpy uses mkl or openblas.
Sorry, I know this is all wild guessing, but that's all the hints I have right now.

@wjc404 I tried with only two threads, but didn't change anything, still reboots.

@martin-frbg I could get the problem size down to 3000x300 to crash reliably. I can get my machine to reboot with the following two liner:

import numpy as np
np.linalg.svd(np.random.rand(3000,300))

The problem size seems to depend on whether I have run the code successfully before on a very small array.

Thanks for all the help.

Ralph

@wjc404
Copy link
Contributor

wjc404 commented Mar 22, 2020

Then it should have nothing to do with thermal throttling. May I ask you which LAPACK function was called in np.linalg.svd? Is it SGESDD?
Can you figure out which version of OpenBLAS is included in the numpy package?

@brada4
Copy link
Contributor

brada4 commented Mar 22, 2020

@martin-frbg
Copy link
Collaborator

Maybe rhymes with #2526 that just came in... though I still do not get why there would be a reboot. Can you check which version of OpenBLAS you are calling, and would it be possible for you to replace it with a build from source (where you could remove the #define USE_SGEMM_KERNEL_DIRECT 1 in param.h) ?

@rgauges
Copy link
Author

rgauges commented Mar 22, 2020

Hi,

sorry for the delay. I have been trying to single step through to find what is going wrong.
Unfortunately pdb would reboot at a certain step again and setting a breakpoint in gdb
proved to difficult for me as the libraries don't seem to contain debug information.
So if anybody knows how to set a breakpoint on any function in a specific dynamic library,
any hint would be welcome.

So the only thing I found was in the documentation of numpy's svd which says:

The decomposition is performed using LAPACK routine _gesdd.

Which I guess is basically the same that brada4 has already mentioned.

Looking at the library names, I seem to be running on OpenBlas 0.3.9
(libopenblasp-r0.3.9.so). Conda confirms that libopenblas 0.3.9 is installed.
What I find confusing is that the installation also mentions libblas and libcblas
in version 3.8.0 but they also seem to be from the same openblas 0.3.9.

The one that seems to get pulled into python via numpy seems to be libcblas.so
which is a symbolic link to libopenblasp-r0.3.9.so.

@martin-frbg Yes, I think I can try to compile my own version of openblas. Question is how far the compile environment / compiler settings will influence this problem. Since I don't know, which compiler and which flags were used to build the conda version, I can't exactly reproduce their build.

Thanks

Ralph

@rgauges
Copy link
Author

rgauges commented Mar 22, 2020

Hi again,

just got the source code from github and preparing to compile it.
Anything I should watch out for? Should I compile a Debug Version, should I compile it for a certain CPU. Per default it seems to detect the CPU and make settings according to what it finds.

Cheers

Ralph

@Diazonium
Copy link
Contributor

@rgauges Have you tried running Prime95 with small in-place AVX FFTs? From the memory usage you reported, you have ran P95 with very large FFTs, which is good, but does not put maximum stress onto the CPU cores.

@rgauges
Copy link
Author

rgauges commented Mar 22, 2020

@Diazonium I did run the Blend stress test which uses very small, small and large FFTs.
There does not seem to be an option where I could specifically say that I want small in-place AVX FFTs. Could you please elaborate a bit on how I could test these specifically?

@martin-frbg Just did a standard compile with USE_SGEMM_KERNEL_DIRECT commented out and I still get the rebooting crash with that. I can also set breakpoints and I can confirm that the function dgesdd_ is hit twice in the call to linalg.svd in numpy. The crash seems to occur after it is hit the second time. If I continue in gdb from there, my machine is gone.
As I didn't do a debug build, gdb does not let me single step. 's' after hitting the breakpoint the second time immediately crashes.

Gosh, I really hope I am not wasting everbody's time here and the problem is somehere else entirely.

Cheers

Ralph

@wjc404
Copy link
Contributor

wjc404 commented Mar 22, 2020

@rgauges Building OpenBLAS with "make DEBUG=1" may help.

@rgauges
Copy link
Author

rgauges commented Mar 22, 2020

Hi again,

I have been single stepping with gdb all afternoon using a debug version of OpenBLAS 0.3.9.
Unfortunately this bug is quite elusive. I now have a set of breakpoints that get me close to the point where it crashes most of the time.

break dgesdd.f:892
break gemm.c:450
break level3_thread.c:770
break level3_thread.c:713
break blas_server.c:768
disa 2 3 4
ignore 5 3011

I enable these breakpoints one after the other since breakpoints 2,3,4 and 5 are hit several times before the crash usually appears.

I think I could get rid of the first four breakpoints as the number 3012 for breakpoint 5 seems to be quite stable. It's somewhere in this blas_server routine starting at line 768 where I get the crash
(actually close to line 809 which was the last line that didn't crash last time I ran it.)
What makes this hard to pinpoint is that with the debug version it doesn't crash every time. I actually managed to run through the code in gdb and get a result 3-4 times out of maybe 50 tries.
And sometimes it crashes before the first breakpoint above.

On the one hand, the fact that it does not crash every time and that the crashes sometime happen before the first breakpoint make a hardware problem likely again. Since it only happens at a certain problem size, I was thinking if it might be a problem with the cache, either hardware or software.
But many other programs are using the cache as well and I didn't ever experience something similar there.
On the other hand it could also be some kind of race condition in the parallel code.

Maybe these breakpoints above can give a clue to someone who is more familiar with the code.
For today, I give up.

Cheers

Ralph

@martin-frbg
Copy link
Collaborator

Not reproduced so far on i7-8700K that uses the same Haswell kernels. Need to see if I have enough bits and pieces of an actual i7-47xx floating around to reassemble a working system.

@wjc404
Copy link
Contributor

wjc404 commented Mar 22, 2020

It sounds like some undefined behaviors in parallel execution.
How about a TARGET=SANDYBRIDGE build?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Mar 23, 2020

There must be hundreds of systems like that out there, given that all the refreshes, Kaby Lake etc. run Haswell kernels. If anything this would have to be a very recent bug, or numpy runs would be crashing systems all over the world. Perhaps building 0.3.7 for comparison would be an option. I still like to think of this as a hardware problem, faulty power supply or whatever, although I can only offer some vague handwaving to "explain" why it only happens with OpenBLAS.
Edited to add: valgrind is not showing any anomalities so far.

@rgauges
Copy link
Author

rgauges commented Mar 23, 2020

Hi,

it's me again. I just reapplied the thermal paste on my CPU and to test if it was OK, I ran prime95 again. This time I did not run all tests, but only chose the small FFTs as suggested by @Diazonium .
Guess what, he seems to be right. This resets my machine reliably as well.
I have no clue why the blended test, which according to the docs includes the small FFTs, didn't
trigger it before. I guess, it is time for a new PC then.

I am sorry, I did waste your time after all.

Greetings

Ralph

@Diazonium
Copy link
Contributor

Diazonium commented Mar 23, 2020

Hi,

it's me again. I just reapplied the thermal paste on my CPU and to test if it was OK, I ran prime95 again. This time I did not run all tests, but only chose the small FFTs as suggested by @Diazonium .
Guess what, he seems to be right. This resets my machine reliably as well.
I have no clue why the blended test, which according to the docs includes the small FFTs, didn't
trigger it before. I guess, it is time for a new PC then.

I am sorry, I did waste your time after all.

Greetings

Ralph

Before you throw your PC in the dumpster, you could try to get a new power supply. It is a common component that slowly degrades over time in terms of output capacity, thus resulting in instability under peak load.

@brada4
Copy link
Contributor

brada4 commented Mar 23, 2020

In principle all mentioned software uses AVX2 extensively, which means that CPU is pushed to thermal and power margin.

@rgauges
Copy link
Author

rgauges commented Mar 23, 2020

@brada4 Yes, I know and I am using lots of software that stress the CPU to 100% and that use AVX2, but so far OpenBLAS has been the only software to trigger the reset behavior.
I am still a bit clueless as to why the blended run with prime did not trigger it, only directly choosing small FFTs did the trick. I could have tried that right away and again I am sorry for wasting
your time.

@Diazonium Yes thanks, gut idea. On the other hand, this was a good excuse to finally order some new parts. (-:
I guess the new CPU will be a good stress test for the PSU. And as a positive side effect, I won't have to wait as long on results in the future.

Thanks again for all the help and hint.

Should I close this thread?

Greetings

Ralph

@martin-frbg
Copy link
Collaborator

Well, I'd have assumed that swapping the power supply would be the easiest and cheapest thing to do... anyway, we can always reopen this thread if there is again evidence for a software problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants