Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

Closed
larsoner opened this issue Dec 18, 2020 · 15 comments
Closed

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

larsoner opened this issue Dec 18, 2020 · 15 comments

Comments

@larsoner
Copy link
Contributor

On latest master:

$ make
...
 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... x86_64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblas_haswellp-r0.3.13.dev.a (Multi-threading; Max num-threads is 8)

Then

$ export OPENBLAS_NUM_THREADS=2
$ python
>>> import numpy as np, scipy.linalg as linalg
>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesdd')
...  # works
$ export OPENBLAS_NUM_THREADS=1
$ python
>>> import numpy as np, scipy.linalg as linalg
>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesdd')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/scipy/scipy/linalg/decomp_svd.py", line 129, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

And also:

>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesvd')
...  # works

So there is something about the combination of using 1 thread with the GESDD driver on this 204x204 array that causes it to fail.

@martin-frbg
Copy link
Collaborator

Unusual that a bug would appear in single-threaded mode when there should be less to go wrong. Is "this 204x204 array" available somewhere ? (And are you sure that your numpy actually picked up the 0.3.13 you built, rather than whatever Ubuntu has linked through its alternatives mechanism ?)

@larsoner
Copy link
Contributor Author

larsoner commented Dec 18, 2020

Agreed it's weird. Sorry I forgot to upload the problematic file:

204.txt

I make then sudo make installed OpenBLAS on commit b26e32c to the default /opt/OpenBLAS, then built NumPy from source with site.cfg as:

[openblas]
libraries = openblas
library_dirs = /opt/OpenBLAS/lib
include_dirs = /opt/OpenBLAS/include
runtime_library_dirs = /opt/OpenBLAS/lib

And it looks like everything is in order:

$ ldd numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so 
...
	libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007fbd980fc000)
...
$ ls -al /opt/OpenBLAS/lib/libopenblas.so.0
lrwxrwxrwx 1 root root 35 Dec 18 08:38 /opt/OpenBLAS/lib/libopenblas.so.0 -> libopenblas_haswellp-r0.3.13.dev.so
$ ls -al /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.13.dev.so 
-rwxr-xr-x 1 root root 14140208 Dec 18 08:37 /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.13.dev.so
$ python -c "import numpy; numpy.show_config()"
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
    extra_compile_args = ['-march=native']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
    extra_compile_args = ['-march=native']

@larsoner
Copy link
Contributor Author

And note that I have the same problem with NumPy's linalg calls directly (I just showed SciPy's linalg because it allows switching betwen GESDD and GESVD backends, whereas NumPy just uses GESDD I think):

>>> import numpy as np
>>> np.linalg.svd(np.loadtxt('204.txt', delimiter=','))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in svd
  File "/home/larsoner/python/numpy/numpy/linalg/linalg.py", line 1660, in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
  File "/home/larsoner/python/numpy/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

@brada4
Copy link
Contributor

brada4 commented Dec 18, 2020

Try to import with dtype="double", at least it converges like a charm in R (where all floats are doubles) like svd(read.csv("file.txt",heading=FALSE)$d

@larsoner
Copy link
Contributor Author

@brada4 what CPU do you have, in case it matters? Mine is:

$ cat /proc/cpuinfo | grep 'model name' | uniq 
model name	: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
$ cat /sys/devices/cpu/caps/pmu_name
skylake

Maybe it's a problem that I end up with a libopenblas_haswellp build given that I'm on Skylake?

@brada4
Copy link
Contributor

brada4 commented Dec 18, 2020

I tried on a broadwell E5-2620
Skylake v1 is served by same HASWELL kernels for sure, it has no AVX512 to qualify as SkylakeX
Only thing you need to upgrade microcode at least once in a lifetime on those CPUs to get hyperthreading right:
https://lists.debian.org/debian-devel/2017/06/msg00351.html

@larsoner
Copy link
Contributor Author

Only thing you need to upgrade microcode at least once in a lifetime on those CPUs to get hyperthreading right:

I have never (manually) upgraded CPU microcode -- are you suggesting that this might be causing the bug? From a quick naive search I already have intel-microcode installed and see:

$ sudo dmesg | grep microcode
[    0.000000] microcode: microcode updated early to revision 0xde, date = 2020-05-26
[    0.844975] microcode: sig=0x906e9, pf=0x2, revision=0xde
[    0.870779] microcode: Microcode Update Driver: v2.2.

So I think it might be up to date already?

@brada4
Copy link
Contributor

brada4 commented Dec 18, 2020

date = 2020-05-26

It is perfect, lets deal with software bug ;-)

@martin-frbg
Copy link
Collaborator

Cannot reproduce this so far with older versions of gcc

@martin-frbg
Copy link
Collaborator

Not reproducible here with gcc 10.2 either (Haswell target, python 3.6, numpy 1.14 in a CentOS 8.2 VM on i7-7500U)

@larsoner
Copy link
Contributor Author

numpy 1.14 in a CentOS 8.2 VM

How did you get 1.14 to use latest OpenBLAS master? Did you build 1.14 from scratch for some reason? Or did you pip install it in which case it's probably using the version it shipped with/vendored?

I could try to figure out what version of OpenBLAS 1.14 used (it's almost 2 years old) and try using that version of OpenBLAS. If I also can't replicate there, it would give me at least some path to git bisecting the issue. Is that worth pursuing?

@martin-frbg
Copy link
Collaborator

I took their packaged numpy and overwrote the /usr/lib64/libopenblas.so it linked to (0.3.3 or something equally ancient) with my own build of the current develop branch. Will try to find time to retry with a more recent numpy on physical hardware later today

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 21, 2020

Not reproduced with a recent snapshot of numpy (1.20.0.dev0+08edcad) built from source against current developon Ryzen5-4600H now either. (Tried with both TARGET=ZEN and TARGET=HASWELL of course)

@martin-frbg
Copy link
Collaborator

Can you retry with a more recent snapshot of OpenBLAS please - in particular, the next commit after the one you built - c73d8ee worked around an oddity
seen with gcc -mfma

@larsoner
Copy link
Contributor Author

larsoner commented Jan 4, 2021

On c73d8ee I don't get any error! I also tried latest master and it's fixed there, too. And on 723776d which preceded c73d8ee, it's broken.

Any sense in adding this matrix as a test case? Either way I'll go ahead and close since it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants