BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

larsoner · 2020-12-18T13:47:39Z

On latest master:

$ make
...
 OpenBLAS build complete. (BLAS CBLAS LAPACK LAPACKE)

  OS               ... Linux             
  Architecture     ... x86_64               
  BINARY           ... 64bit                 
  C compiler       ... GCC  (cmd & version : cc (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Fortran compiler ... GFORTRAN  (cmd & version : GNU Fortran (Ubuntu 10.2.0-13ubuntu1) 10.2.0)
  Library Name     ... libopenblas_haswellp-r0.3.13.dev.a (Multi-threading; Max num-threads is 8)

Then

$ export OPENBLAS_NUM_THREADS=2
$ python
>>> import numpy as np, scipy.linalg as linalg
>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesdd')
...  # works
$ export OPENBLAS_NUM_THREADS=1
$ python
>>> import numpy as np, scipy.linalg as linalg
>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesdd')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/scipy/scipy/linalg/decomp_svd.py", line 129, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

And also:

>>> linalg.svd(np.loadtxt('204.txt', delimiter=','), lapack_driver='gesvd')
...  # works

So there is something about the combination of using 1 thread with the GESDD driver on this 204x204 array that causes it to fail.

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-12-18T14:21:08Z

Unusual that a bug would appear in single-threaded mode when there should be less to go wrong. Is "this 204x204 array" available somewhere ? (And are you sure that your numpy actually picked up the 0.3.13 you built, rather than whatever Ubuntu has linked through its alternatives mechanism ?)

larsoner · 2020-12-18T14:42:38Z

Agreed it's weird. Sorry I forgot to upload the problematic file:

204.txt

I make then sudo make installed OpenBLAS on commit b26e32c to the default /opt/OpenBLAS, then built NumPy from source with site.cfg as:

[openblas]
libraries = openblas
library_dirs = /opt/OpenBLAS/lib
include_dirs = /opt/OpenBLAS/include
runtime_library_dirs = /opt/OpenBLAS/lib

And it looks like everything is in order:

$ ldd numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so 
...
	libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007fbd980fc000)
...
$ ls -al /opt/OpenBLAS/lib/libopenblas.so.0
lrwxrwxrwx 1 root root 35 Dec 18 08:38 /opt/OpenBLAS/lib/libopenblas.so.0 -> libopenblas_haswellp-r0.3.13.dev.so
$ ls -al /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.13.dev.so 
-rwxr-xr-x 1 root root 14140208 Dec 18 08:37 /opt/OpenBLAS/lib/libopenblas_haswellp-r0.3.13.dev.so
$ python -c "import numpy; numpy.show_config()"
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
    extra_compile_args = ['-march=native']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/OpenBLAS/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/OpenBLAS/lib']
    extra_compile_args = ['-march=native']

larsoner · 2020-12-18T14:44:30Z

And note that I have the same problem with NumPy's linalg calls directly (I just showed SciPy's linalg because it allows switching betwen GESDD and GESVD backends, whereas NumPy just uses GESDD I think):

>>> import numpy as np
>>> np.linalg.svd(np.loadtxt('204.txt', delimiter=','))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in svd
  File "/home/larsoner/python/numpy/numpy/linalg/linalg.py", line 1660, in svd
    u, s, vh = gufunc(a, signature=signature, extobj=extobj)
  File "/home/larsoner/python/numpy/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge

brada4 · 2020-12-18T20:28:58Z

Try to import with dtype="double", at least it converges like a charm in R (where all floats are doubles) like svd(read.csv("file.txt",heading=FALSE)$d

larsoner · 2020-12-18T20:35:48Z

@brada4 what CPU do you have, in case it matters? Mine is:

$ cat /proc/cpuinfo | grep 'model name' | uniq 
model name	: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
$ cat /sys/devices/cpu/caps/pmu_name
skylake

Maybe it's a problem that I end up with a libopenblas_haswellp build given that I'm on Skylake?

brada4 · 2020-12-18T20:46:16Z

I tried on a broadwell E5-2620
Skylake v1 is served by same HASWELL kernels for sure, it has no AVX512 to qualify as SkylakeX
Only thing you need to upgrade microcode at least once in a lifetime on those CPUs to get hyperthreading right:
https://lists.debian.org/debian-devel/2017/06/msg00351.html

larsoner · 2020-12-18T20:57:55Z

Only thing you need to upgrade microcode at least once in a lifetime on those CPUs to get hyperthreading right:

I have never (manually) upgraded CPU microcode -- are you suggesting that this might be causing the bug? From a quick naive search I already have intel-microcode installed and see:

$ sudo dmesg | grep microcode
[    0.000000] microcode: microcode updated early to revision 0xde, date = 2020-05-26
[    0.844975] microcode: sig=0x906e9, pf=0x2, revision=0xde
[    0.870779] microcode: Microcode Update Driver: v2.2.

So I think it might be up to date already?

brada4 · 2020-12-18T21:32:43Z

date = 2020-05-26

It is perfect, lets deal with software bug ;-)

martin-frbg · 2020-12-20T19:02:04Z

Cannot reproduce this so far with older versions of gcc

martin-frbg · 2020-12-20T21:25:47Z

Not reproducible here with gcc 10.2 either (Haswell target, python 3.6, numpy 1.14 in a CentOS 8.2 VM on i7-7500U)

larsoner · 2020-12-21T12:52:58Z

numpy 1.14 in a CentOS 8.2 VM

How did you get 1.14 to use latest OpenBLAS master? Did you build 1.14 from scratch for some reason? Or did you pip install it in which case it's probably using the version it shipped with/vendored?

I could try to figure out what version of OpenBLAS 1.14 used (it's almost 2 years old) and try using that version of OpenBLAS. If I also can't replicate there, it would give me at least some path to git bisecting the issue. Is that worth pursuing?

martin-frbg · 2020-12-21T13:21:09Z

I took their packaged numpy and overwrote the /usr/lib64/libopenblas.so it linked to (0.3.3 or something equally ancient) with my own build of the current develop branch. Will try to find time to retry with a more recent numpy on physical hardware later today

martin-frbg · 2020-12-21T19:03:52Z

Not reproduced with a recent snapshot of numpy (1.20.0.dev0+08edcad) built from source against current developon Ryzen5-4600H now either. (Tried with both TARGET=ZEN and TARGET=HASWELL of course)

martin-frbg · 2020-12-22T18:17:19Z

Can you retry with a more recent snapshot of OpenBLAS please - in particular, the next commit after the one you built - c73d8ee worked around an oddity
seen with gcc -mfma

larsoner · 2021-01-04T17:31:40Z

On c73d8ee I don't get any error! I also tried latest master and it's fixed there, too. And on 723776d which preceded c73d8ee, it's broken.

Any sense in adding this matrix as a test case? Either way I'll go ahead and close since it's fixed.

larsoner closed this as completed Jan 4, 2021

larsoner mentioned this issue Jan 25, 2021

LinAlgError: SVD did not converge when plotting data (raw.plot()) mne-tools/mne-python#8784

Closed

larsoner mentioned this issue May 17, 2024

Project CSP patterns to source mne-tools/mne-bids-pipeline#950

Draft

mlxd mentioned this issue Jun 10, 2024

Bugfix: SVD fallback exception from numba compiled function jcmgray/quimb#238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

larsoner commented Dec 18, 2020

martin-frbg commented Dec 18, 2020

larsoner commented Dec 18, 2020 •

edited

Loading

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

martin-frbg commented Dec 20, 2020

martin-frbg commented Dec 20, 2020

larsoner commented Dec 21, 2020

martin-frbg commented Dec 21, 2020

martin-frbg commented Dec 21, 2020 •

edited

Loading

martin-frbg commented Dec 22, 2020

larsoner commented Jan 4, 2021 •

edited

Loading

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

BUG: GESDD fails when GESVD succeeds, depends on number of threads #3044

Comments

larsoner commented Dec 18, 2020

martin-frbg commented Dec 18, 2020

larsoner commented Dec 18, 2020 • edited Loading

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

larsoner commented Dec 18, 2020

brada4 commented Dec 18, 2020

martin-frbg commented Dec 20, 2020

martin-frbg commented Dec 20, 2020

larsoner commented Dec 21, 2020

martin-frbg commented Dec 21, 2020

martin-frbg commented Dec 21, 2020 • edited Loading

martin-frbg commented Dec 22, 2020

larsoner commented Jan 4, 2021 • edited Loading

larsoner commented Dec 18, 2020 •

edited

Loading

martin-frbg commented Dec 21, 2020 •

edited

Loading

larsoner commented Jan 4, 2021 •

edited

Loading