Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking compatibility with fortran-only BLAS interfaces #137

Closed
staticfloat opened this issue Aug 19, 2014 · 22 comments
Closed

Breaking compatibility with fortran-only BLAS interfaces #137

staticfloat opened this issue Aug 19, 2014 · 22 comments

Comments

@staticfloat
Copy link
Member

We seem to have broken compatibility with fortran-interface-only BLAS libraries. Specifically, we use cblas_zdotc_sub and cblas_cdotc_sub, which don't exist in some BLAS implementations such as CentOS 7's default blas package. Is this something we want to work around, revert, or just live with?

@tkelman
Copy link

tkelman commented Aug 19, 2014

Are there any cblas symbols in that package? I thought most distribution blas packages were actually cblas these days. The difficulty is the calling convention return type from Fortran {cz}dot{cu}_. Can you get the right answer from a complex-returning dot product from that library?

Can we solve JuliaLang/julia#5283 in a portable way with only Fortran interfaces? Should we add fcall JuliaLang/julia#2167 to the 0.4 todo list?

@staticfloat
Copy link
Member Author

There are no cblas symbols in the package, and there isn't a cblas package I can install or anything like that, as far as I can tell. (I'm definitely not an expert on CentOS however so perhaps there's some analogue to a PPA or something that I should be adding)

@andreasnoack
Copy link
Member

I don't think this is something we need to consider a problem because reference BLAS is so slow. How did you encounter the problem @staticfloat?.

(I also think we see many issues with CentOS.)

@tkelman
Copy link

tkelman commented Aug 19, 2014

Looks like this actually works okay on my RHEL5 system:

|__/                   |  x86_64-redhat-linux6E

julia> z1 = complex(randn(3), randn(3))
3-element Array{Complex{Float64},1}:
  -1.37362-0.897327im
 -0.689447-0.362496im
  -1.12321+0.0945708im

julia> z2 = complex(randn(3), randn(3))
3-element Array{Complex{Float64},1}:
  0.456156+0.213202im
 -0.228481+1.52632im
 -0.307064+1.53459im

julia> n = 3; inc1 = 1; inc2 = 1;

julia> Base.LinAlg.BLAS.dotc(n, z1, inc1, z2, inc2)
-0.7236351977356769 - 2.7132943224059685im

julia> ccall((:zdotc_, :libblas), Complex128, (Ptr{Int32}, Ptr{Complex128}, Ptr{Int32}, Ptr{Complex128}, Ptr{Int32}), &n, z1, &inc1, z2, &inc2)
-0.7236351977356767 - 2.713294322405968im

So it might just be an issue with Accelerate using the f2c calling convention? Care to try the above on Mac?

@andreasnoack
Copy link
Member

It crashes, but it also used to be a problem with MKL when we built with gfortran. However, if everything is built with Intel compilers, it shouldn't be a problem to use the Fortran version.

@tkelman
Copy link

tkelman commented Aug 20, 2014

Isn't this what gfortblas.c is supposed to fix? Or we could have a look at @mcg1969's https://github.com/mcg1969/vecLibFort if what we have now isn't sufficient.

@andreasnoack
Copy link
Member

I'm not sure if it covers completely. For a long time, complex dotdidn't call BLAS at all.

I just remembered this comment JuliaLang/julia#5283 (comment) saying that we still cannot rely on complex return on 32 bit systems. If you look here, you can see that the 32 bit is left out.

@mcg1969
Copy link

mcg1969 commented Aug 20, 2014

If there is anything useful to use within vecLibFort, I would say that it would be preferable to duplicate it using Julia's native C interface capability rather than linking to my code. And I would be happy to help with that.

@mcg1969
Copy link

mcg1969 commented Aug 20, 2014

The core issues are this: 1) Accelerate returns double-precision results from single-precision functions; and 2) Accelerate uses the f2c calling convention for complex values (the first argument is a pointer got the result).

@vtjnash
Copy link
Member

vtjnash commented Aug 21, 2014

gfortblas.c doesn't care about LAPACK functions, since we always compile our own updated version of LAPACK anyways, so it is a bit smaller than vecLibFort, but otherwise exactly identical.

The 32-bit issue should be fixed fairly soon, when JuliaLang/julia#7906 merges

@andreasnoack
Copy link
Member

Thank you for the clarifications. Then it appears that we can avoid cblas almost for free. The only casualty is gfortran+MKL which, I think, we don't recommend anyway.

@tkelman
Copy link

tkelman commented Aug 21, 2014

I tried gfortran+MKL somewhat recently when Viral was cleaning things up for icc/ifort and it's pretty broken right now anyway. Probably because we're setting things up to use libmkl_rt instead of libmkl_gf_...

@tkelman
Copy link

tkelman commented Sep 14, 2016

Maybe we should revisit whether it's worth calling blas for dot?

@ViralBShah
Copy link
Member

Pretty sure we can avoid calling blas for dot.

@andreasnoack
Copy link
Member

On my machine, a Julia dot seems to be uniformly faster for Float64 relative to OpenBLAS but slightly slower for Complex{Float64}.

julia> x, y = complex(randn(n), randn(n)), complex(randn(n), randn(n));

julia> @benchmark BLAS.dotc($x, $y)
BenchmarkTools.Trial:
  samples:          7014
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  96.00 bytes
  allocs estimate:  1
  minimum time:     685.01 μs (0.00% GC)
  median time:      698.54 μs (0.00% GC)
  mean time:        711.15 μs (0.00% GC)
  maximum time:     2.35 ms (0.00% GC)

julia> @benchmark mydot($x, $y)
BenchmarkTools.Trial:
  samples:          5424
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     867.69 μs (0.00% GC)
  median time:      884.09 μs (0.00% GC)
  mean time:        919.84 μs (0.00% GC)
  maximum time:     2.27 ms (0.00% GC)

It's not much and hopefully we can improve the Julia version a bit so think it's fine to switch.

@mcg1969
Copy link

mcg1969 commented Sep 14, 2016

Is the native Julia dot implementation multithreaded?

@KristofferC
Copy link
Member

No, but neither is OpenBlas's IIRC

@mcg1969
Copy link

mcg1969 commented Sep 14, 2016

Ah, fair enough. You might want to ask the Numba developers though---I seem to remember they found circumstances where BLAS dot had some advantages, at least for certain vector sizes. I can't recall why.

@andreasnoack
Copy link
Member

andreasnoack commented Sep 15, 2016

I don't think the complex Julia version vectorizes as well as it should.

@eschnett
Copy link
Contributor

I looked for a manually vectorized version in OpenBLAS, but couldn't find any. I assume that it just uses a plain Fortran loop, and that that Fortran code is then well vectorized. Is that a difference between GCC and LLVM? Or is there something in our complex multiplication or complex conjugation definitions that prevents vectorization? Does it vectorize if we use @fastmath?

@Sacha0
Copy link
Member

Sacha0 commented Dec 22, 2016

Unlikely to receive attention prior to 0.6. Best!

@ViralBShah
Copy link
Member

Referring to the original post, I think we are going to keep what we have and it hasn't been an issue for a while. Closing and suggest reopening if necessary.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants