Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ABI conflicts due to 64-bit libopenblas.so #4923

Closed
stevengj opened this issue Nov 25, 2013 · 106 comments · Fixed by #8734
Closed

ABI conflicts due to 64-bit libopenblas.so #4923

stevengj opened this issue Nov 25, 2013 · 106 comments · Fixed by #8734
Labels
building Build system, or building Julia or its dependencies needs decision A decision on this change is needed

Comments

@stevengj
Copy link
Member

Julia compiles OpenBLAS to libopenblas.so. This may be a problem for calling libraries that link to a system libopenblas.so, because the runtime linker may substitute Julia's version instead. The problem is that Julia's version is compiled with a 64-bit interface, which is not the default, and so if an external library calls it expecting a 32-bit interface, a crash may result.

We encountered what appears to have been this problem n @alanedelman's machine (julia.mit.edu). He recently started experiencing crashes in PyPlot.plot that, with the help of valgrind, I tracked down to apparently:

==17855== Use of uninitialised value of size 8
==17855==    at 0xA8B6890: dgemm_beta_NEHALEM (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0xA082D72: dgemm_nn (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0x9F558C8: cblas_dgemm (in /home/edelman/julia/usr/lib/libopenblas.so)
==17855==    by 0x16430CA5: dotblas_matrixproduct (_dotblas.c:809)
==17855==    by 0x14BAB5D4: PyEval_EvalFrameEx (in /usr/lib/libpython2.7.so.1.0)

Apparently, Matplotlib is calling OpenBLAS (via NumPy: _dotblas.c is a NumPy file) with the 32-bit interface, but is getting linked at runtime into Julia's openblas library, which is compiled with a 64-bit interface. Recompiling Julia and openblas with USE_BLAS64=0 worked around the problem, but it would be better to avoid the conflict.

Can we just rename our libopenblas.so file to avoid any possible conflict in the runtime linker?

@stevengj
Copy link
Member Author

Or is the problem worse than that? If I ccall a library that in turn calls cblas_dgemm, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g. libblas.so)?

In that case, we might have to hack OpenBLAS to rename its exported functions (e.g. cblas_dgemm64 etcetera) since we changed the ABI.

@stevengj
Copy link
Member Author

@xianyi, is there a way to tell OpenBLAS to add a prefix or suffix (e.g. 64) to all its exported symbols, to make it possible to link both the 32-bit and 64-bit ABI in the same executable?

@stevengj
Copy link
Member Author

See also numpy/numpy#3916

@StefanKarpinski
Copy link
Member

Wouldn't it make more sense to put the 64 after the cblas part – as in cblas64_dgemm?

@ViralBShah
Copy link
Member

The ideal solution would be to have a separate 64-bit ABI and build both 32 and 64 bit versions in the same library.

@staticfloat
Copy link
Member

@ViralBShah that is actually the best solution here. That would be wonderful!

@stevengj
Copy link
Member Author

@StefanKarpinski, note that there is a Fortran dgemm ABI too, and to avoid conflicts you need to rename both C and Fortran (unless we are not linking the Fortran ABI?). But I don't think it really matters what the name looks like, as long as there is a simple deterministic rule and it can be implemented as automatically as possible in the openblas source code. I was just thinking that a suffix might be easier to automate for both C and Fortran ABIs.

@ViralBShah
Copy link
Member

Currently we use the fortran abi only.

@ViralBShah
Copy link
Member

I wonder if we can somehow make matplotlib use its own blas. While we may be able to do all sorts of gymnastics with openblas, it will be difficult to do the same with vendor provided BLAS.

@stevengj
Copy link
Member Author

The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.

@stevengj
Copy link
Member Author

@ViralBShah, does MKL provide the 64-bit ABI?

@StefanKarpinski
Copy link
Member

The other alternative would be to recompile our own numpy, but that makes installing PyCall much more of pain.

Not to mention that the amount of stuff we compile ourselves is getting slightly ridiculous. But it's hard to avoid.

@ViralBShah
Copy link
Member

I believe MKL does have a 64-bit ABI - but not 100% sure. @andreasnoackjensen ?

@ViralBShah
Copy link
Member

I thought about recompiling numpy, but that is even more inconvenient.

@andreasnoack
Copy link
Member

I am not sure what exactly ABI mean, but MKL has 32 bit integers in the *lp64 libraries and 64 bit integers in the *ilp64 libraries. The symbols have the same names.

@xianyi
Copy link

xianyi commented Nov 27, 2013

It's easy to add a prefix or suffix for 64-bit (ilp64) ABI. However, I am not sure OpenBLAS can support lp64 and ilp64 in one binary.

For MKL, you need link the application with different interface layer library, e.g. libmkl_intel_lp64.so or libmkl_intel_ilp64.so.

@stevengj
Copy link
Member Author

I think adding a prefix or suffix to the ilp64 OpenBLAS interface would already be a big help. @xianyi, assuming that such a suffix were added, what would go wrong if both the 32- and 64-bit OpenBLAS libraries were linked simultaneously?

@stevengj
Copy link
Member Author

stevengj commented Jan 3, 2014

@xianyi, is there any hope of progress on this?

@ViralBShah
Copy link
Member

Would naming the 64-bit version something like libopenblas_ilp64.so solve this?

@stevengj
Copy link
Member Author

stevengj commented Jan 4, 2014

@ViralBShah, I'm not sure, but I doubt it. If you load two shared libraries which export the same symbol (e.g. dgemm_) but with a different ABI, aren't there still going to be conflicts even if the libraries have different names? (At least if the libraries are loaded with RTLD_GLOBAL?)

@ViralBShah
Copy link
Member

The easier thing then for now would be to just use the 32-bit version of openblas with IJulia, if that works.

@stevengj
Copy link
Member Author

stevengj commented Jan 4, 2014

Nassty 32-bit limits, we hates them forever!

Anyway, it's not just IJulia, since PyCall and Numpy can be used anywhere. And 32-bit vector size limits cause their own problems.

@tkelman
Copy link
Contributor

tkelman commented Mar 16, 2014

+1, we ran into a very similar issue here too: jump-dev/Ipopt.jl#1 (comment)
This was an instance of (here dcopy_ instead of cblas_dgemm, but same idea)

If I ccall a library that in turn calls cblas_dgemm, will it end up calling our OpenBLAS version even if it was originally linked to a completely different BLAS library (e.g. libblas.so)?

Any library linking to any LP64 shared library Blas/Lapack/etc can run into name shadowing and segfaults or other incorrect behavior when ccalled by Julia due to ILP64 openblas. Statically linking LP64 reference blas/lapack into the dependency library solves the issue in the case of Ipopt, but is not an ideal solution.

Since #5291 was merged there are now a handful of calls to cblas functions, otherwise I was going to suggest we could try co-opting OpenBlas' mechanism for handling trailing underscores as a potential way of attempting this.

@stevengj
Copy link
Member Author

We could always just patch the openblas source with a global s/cblas/jl_cblas/ substitution.

@mlubin
Copy link
Member

mlubin commented Mar 17, 2014

Isn't this mostly a visibility issue? Can we restrict openblas's symbols to not be visible to dlopen'ed shared libraries?

@stevengj
Copy link
Member Author

@mlubin, you're right that this would be the simplest option, if we can do it on all the relevant platforms. Is there a magic linker flag for this (analogous to RTLD_LOCAL in dlopen)?

@pao
Copy link
Member

pao commented Mar 17, 2014

Looks like if you want to avoid patching you need to use a linker script.

@nalimilan
Copy link
Member

Yeah, weird idea... It's also true that searching for e.g. dgemv_64_ doesn't give results other than the SuiteSparse file, so it doesn't look so popular.

@tkelman
Copy link
Contributor

tkelman commented Oct 19, 2014

And people using ILP64 BLAS libraries from Fortran have to worry about compiler-dependent name mangling.

Anyway, I've got step 2 from #4923 (comment) mostly done, I think I'll post a WIP PR soon so people can look at it.

@stevengj
Copy link
Member Author

Since there is no technical reason to prefer one suffix over another as far as I can see, any little thing tips the scales, so I would go with the SUN64 convention.

@nbecker
Copy link

nbecker commented Oct 22, 2014

A good read on usage of versioned elf shared libs:

http://www.akkadia.org/drepper/dsohowto.pdf

On Sun, Oct 19, 2014 at 10:56 AM, Steven G. Johnson <
notifications@github.com> wrote:

Since there is no technical reason to prefer one suffix over another as
far as I can see, any little thing tips the scales, so I would go with the
SUN64 convention.


Reply to this email directly or view it on GitHub
#4923 (comment).

Those who don't understand recursion are doomed to repeat it

@tkelman
Copy link
Contributor

tkelman commented Jan 21, 2015

Footnote here - for a similar setup on Linux, apparently RTLD_LAZY | RTLD_LOCAL | RTLD_DEEPBIND works.

http://list.coin-or.org/pipermail/ipopt/2015-January/003931.html

@stevengj
Copy link
Member Author

Does anyone have contacts at Intel who they can bug about offering a similar renamed-symbol option for MKL?

I'm tired of people having crashes when using numpy or PyPlot from Julia MKL builds.

@ViralBShah
Copy link
Member

I do. Not sure how much it will help.

@stevengj
Copy link
Member Author

At least they should be made aware of it — it's a huge flaw in their current offerings because it makes MKL non-composable.

@stevengj
Copy link
Member Author

Also, maybe MKL builds of Julia should default to using the 32-bit interface. The number of people who need the 64-bit interface is probably dwarfed by the number of people who will be bitten by library conflicts if they link other code that uses BLAS.

@ViralBShah
Copy link
Member

That is a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
building Build system, or building Julia or its dependencies needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.