-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reboots on haswell system #2524
Comments
This is extremely unlikely to be an OpenBLAS issue, you either have a hardware issue, bios setting issue or kernel issue. |
Anything in the system logs ? Normally a non-priviledged userspace program should never be able to bring down the kernel, though things might get a bit "interesting" if you run out of memory. Perhaps run memtest to make sure the hardware is healthy - though I'd expect you'd see similar failures with other software if memory or power supply had gone bad. |
Thanks a lot for the fast response. Yes, I also thought that user space programs shouldn't be able to crash the machine, but unfortunately it does so anyway. Here is a rather short example that reproducebly reboots my machine: import numpy as np minst.data is 70000 x 784data = np.random.rand(70000,784) not setting solver or setting to auto crashespca = PCA(n_components=0.95,svd_solver="full") A few weeks back, I could even trace it down to the svd_solver, i.e. running numpy.linalg.svd rebooted my machine. Trying it earlier however didn't cause a crash. Any ideas besides running more stress tests? Thanks Ralph |
Can you check if smaller problem sizes do not crash the machine ? And even an unsupported opcode should just get your program killed with "bus error" or similar, and it is extremely unlikely that such a thing sneaked into the Haswell kernels. |
It has nothing to do with OpenBLAS, it is thermal setup failing.
ref:
https://forums.tomshardware.com/threads/i7-4790k-temperatures.1881378/post-13037492 |
@rgauges Have you tried using only 2 threads (export OPENBLAS_NUM_THREADS=2)? |
Good morning everbody, as suggested, I have been running prime95 over night. It ran happily for more than 12 hours. @brada4 Yes, it is a K Version of the CPU, but I am not doing any overclocking and the BIOS settings concerning the CPU are at their defaults. Thermal issues were also my first guess, but I can run all kinds of stress tests without any problems. And it's not as if it took a while to heat up the CPU and then crash. It usually crashes right when I fit the PCA. Sometimes it takes two to three seconds, but most of the time, the reboot is immediate. @wjc404 I tried with only two threads, but didn't change anything, still reboots. @martin-frbg I could get the problem size down to 3000x300 to crash reliably. I can get my machine to reboot with the following two liner: import numpy as np
np.linalg.svd(np.random.rand(3000,300)) The problem size seems to depend on whether I have run the code successfully before on a very small array. Thanks for all the help. Ralph |
Then it should have nothing to do with thermal throttling. May I ask you which LAPACK function was called in np.linalg.svd? Is it SGESDD? |
Maybe rhymes with #2526 that just came in... though I still do not get why there would be a reboot. Can you check which version of OpenBLAS you are calling, and would it be possible for you to replace it with a build from source (where you could remove the |
Hi, sorry for the delay. I have been trying to single step through to find what is going wrong. So the only thing I found was in the documentation of numpy's svd which says: The decomposition is performed using LAPACK routine Which I guess is basically the same that brada4 has already mentioned. Looking at the library names, I seem to be running on OpenBlas 0.3.9 The one that seems to get pulled into python via numpy seems to be libcblas.so @martin-frbg Yes, I think I can try to compile my own version of openblas. Question is how far the compile environment / compiler settings will influence this problem. Since I don't know, which compiler and which flags were used to build the conda version, I can't exactly reproduce their build. Thanks Ralph |
Hi again, just got the source code from github and preparing to compile it. Cheers Ralph |
@rgauges Have you tried running Prime95 with small in-place AVX FFTs? From the memory usage you reported, you have ran P95 with very large FFTs, which is good, but does not put maximum stress onto the CPU cores. |
@Diazonium I did run the Blend stress test which uses very small, small and large FFTs. @martin-frbg Just did a standard compile with USE_SGEMM_KERNEL_DIRECT commented out and I still get the rebooting crash with that. I can also set breakpoints and I can confirm that the function dgesdd_ is hit twice in the call to linalg.svd in numpy. The crash seems to occur after it is hit the second time. If I continue in gdb from there, my machine is gone. Gosh, I really hope I am not wasting everbody's time here and the problem is somehere else entirely. Cheers Ralph |
@rgauges Building OpenBLAS with "make DEBUG=1" may help. |
Hi again, I have been single stepping with gdb all afternoon using a debug version of OpenBLAS 0.3.9. break dgesdd.f:892 I enable these breakpoints one after the other since breakpoints 2,3,4 and 5 are hit several times before the crash usually appears. I think I could get rid of the first four breakpoints as the number 3012 for breakpoint 5 seems to be quite stable. It's somewhere in this blas_server routine starting at line 768 where I get the crash On the one hand, the fact that it does not crash every time and that the crashes sometime happen before the first breakpoint make a hardware problem likely again. Since it only happens at a certain problem size, I was thinking if it might be a problem with the cache, either hardware or software. Maybe these breakpoints above can give a clue to someone who is more familiar with the code. Cheers Ralph |
Not reproduced so far on i7-8700K that uses the same Haswell kernels. Need to see if I have enough bits and pieces of an actual i7-47xx floating around to reassemble a working system. |
It sounds like some undefined behaviors in parallel execution. |
There must be hundreds of systems like that out there, given that all the refreshes, Kaby Lake etc. run Haswell kernels. If anything this would have to be a very recent bug, or numpy runs would be crashing systems all over the world. Perhaps building 0.3.7 for comparison would be an option. I still like to think of this as a hardware problem, faulty power supply or whatever, although I can only offer some vague handwaving to "explain" why it only happens with OpenBLAS. |
Hi, it's me again. I just reapplied the thermal paste on my CPU and to test if it was OK, I ran prime95 again. This time I did not run all tests, but only chose the small FFTs as suggested by @Diazonium . I am sorry, I did waste your time after all. Greetings Ralph |
Before you throw your PC in the dumpster, you could try to get a new power supply. It is a common component that slowly degrades over time in terms of output capacity, thus resulting in instability under peak load. |
In principle all mentioned software uses AVX2 extensively, which means that CPU is pushed to thermal and power margin. |
@brada4 Yes, I know and I am using lots of software that stress the CPU to 100% and that use AVX2, but so far OpenBLAS has been the only software to trigger the reset behavior. @Diazonium Yes thanks, gut idea. On the other hand, this was a good excuse to finally order some new parts. (-: Thanks again for all the help and hint. Should I close this thread? Greetings Ralph |
Well, I'd have assumed that swapping the power supply would be the easiest and cheapest thing to do... anyway, we can always reopen this thread if there is again evidence for a software problem. |
Hi,
for a few weeks I am experiencing funny behaviour which I think stems from openblas.
If I run machine learning workloads using numpy or tensorflow, my haswell system at home
reboots whenever I run certain tasks. This seems to depend on the environment I am running in.
When I set up a conda environment with numpy linked against mkl, the programs run just fine.
But when I run them in an environment where numpy is linked against openblas, my machine
reboots.
As the machine immediately reboots, I do not see any error messages and I running it in a debugger
won't help either.
Running the code within a kvm virtual machine on the same host also reboots the host upon running the code.
All my other systems are not affected by this. So I first thought it was the CPU that is breaking down or overheating, but other CPU intensive workloads run just fine.
My system is:
Ubuntu 19.10 x64
Haswell i7-4790K
Different conda environments with numpy either from conda (mkl) or from pip (openblas). It's the pip installations that always make my machine reboot.
Any idea of how I could pinpoint this problem so I can send a useful bug report?
Thanks
Ralph
The text was updated successfully, but these errors were encountered: