-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS 4 times slower than MKL on DDOT() #530
Comments
The vector size I used is representative of DSP code that I use frequently that takes hours to run. The Some people may suggest refactoring my code to use a matrix multiply instead of |
Another user confirmed this problem with |
@hiccup7 , thank you for the feedback. I think OpenBLAS doesn't parallelize ddot function. I will parallelize this function next week. |
@hiccup7 , could you try the develop branch? |
@xianyi , thank you for your quick update to the code. I noticed that you improved the SDOT() function also, which is very helpful. Do I understand correctly that I would need to replace only the libopenblas.dll file in my WinPython environment with a new one from the develop branch? I would be glad to do the testing, but I don't know how to do the build. I always use binary distributions, such as WinPython, and I don't have time to learn the build process with my full-time job. Options I see are:
|
The Julia team built libopenblas.dll from the develop branch on April 16th, which is long after the changes were committed to fix this issue. As I documented in JuliaLang/julia#10780, I got the same performance results as with OpenBLAS v0.2.14. Thus, this issue remains unresolved because of problems in the OpenBLAS develop branch. |
I tried this on openSUSE with default python 27 and Julia 041 on an i5-4430 |
ref JuliaLang/LinearAlgebra.jl#72 for some past performance numbers (I suspect just linking to openblas from C would show the same results, other than our build-time configuration of openblas we aren't doing anything unusual at runtime in Julia here), though that current test case is segfaulting due to JuliaLang/julia#697 |
Surprisingly, in native Julia code with threading we get up to 8x faster dot products as compared to OpenBLAS, ref https://discourse.julialang.org/t/innefficient-paralellization-need-some-help-optimizing-a-simple-dot-product/9723/20 I thought it'd be good to revive this issue then! |
|
Concerning 1: The comment right after the one I linked to is performed on a Linux machine. And 2: On Linux with Intel Xeon E3-1230 v5 I get the results listed below for a threaded dot product with doubles / Float64's. Measurements done with BenchmarkTools.jl. Size
|
1 thread | 2 threads | 4 threads | |
---|---|---|---|
OpenBLAS | 7.169 ms | 7.209 ms | 7.183 ms |
Julia | 7.170 ms | 6.109 ms | 6.058 ms |
Size n = 1 000 000
Julia w/o threading: 551.457 μs
1 thread | 2 threads | 4 threads | |
---|---|---|---|
OpenBLAS | 565.068 μs | 565.750 μs | 574.344 μs |
Julia | 552.558 μs | 397.631 μs | 371.471 μs |
Size n = 100 000
Julia w/o threading: 21.637 μs
1 thread | 2 threads | 4 threads | |
---|---|---|---|
OpenBLAS | 22.788 μs | 22.823 μs | 22.793 μs |
Julia | 23.185 μs | 12.093 μs | 6.392 μs |
Size n = 10 000
Julia w/o threading: 1.649 μs
1 thread | 2 threads | 4 threads | |
---|---|---|---|
OpenBLAS | 1.679 μs | 1.570 μs | 1.533 μs |
Julia | 1.884 μs | 1.250 μs | 1.080 μs |
So for vector sizes in the range 100 000 to 1 000 000 there's a lot to gain with threading on my architecture!
|
You can try the hack from my PR if you like :-) |
28 seconds for OpenBLAS in Julia:
7.5 seconds for MKL in Python:
Tested environment is WinPython-64bit-3.4.3.2FlavorJulia at http://sourceforge.net/projects/winpython/files/WinPython_3.4/3.4.3.2/flavors/
The same Python time was measured in 64-bit Anaconda3 v2.1.0.
From
versioninfo(true)
in Julia:Julia Version 0.3.7
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
The text was updated successfully, but these errors were encountered: