Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault of matrix dot product when use openblas on windows 10 #1110

Closed
zhongyi51 opened this issue Nov 9, 2021 · 10 comments
Closed

segfault of matrix dot product when use openblas on windows 10 #1110

zhongyi51 opened this issue Nov 9, 2021 · 10 comments
Labels

Comments

@zhongyi51
Copy link

zhongyi51 commented Nov 9, 2021

Hello... I am trying to statically link openblas to my rust project with ndarray. However, I found when the matrix size is higher than 64, a segfault error will be thrown.

The source code is:

extern crate blas_src;
use ndarray::{Array2, Array};
use std::time::SystemTime;
fn main() {
    let mut aa =Array2::<f64>::zeros((5000,5000));
    let mut ab =Array2::<f64>::zeros((5000,5000));
    for i in 0..5000_usize{
        for j in 0..5000_usize{
            aa[(i,j)]=0.14_f64;
            ab[(j,i)]=0.21_f64;
        }
    }

    let t0=SystemTime::now();
    for i in 0..10{
        let res=aa.dot(&ab);
        println!("{}",res[(2,3)]);
    }
    println!("spend {} ms",t0.elapsed().unwrap().as_millis());
}

When I change the matrix size below 64, anything will be fine; however, this error will be thrown when matrix size is higher than 64:
The error massage is:
error: process didn't exit successfully: target\release\classgameground.exe (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)

I am so grateful for any help!

=============================================================
os: windows 10 professional 19042.1288
vcpkg version: 2021-11-02-af04ebf6274fd6f7a941bff4662b3955c64f6f42 (newest from github)
openblas-src version: 0.10 (vcpkg: openblas_x64-windows-static-md)
dependencies of project:
[dependencies] rand="0.8.4" ndarray = { version = "0.15.3", features = ["blas"] } blas-src = { version = "0.8", features = ["openblas"] } openblas-src = { version = "0.10", features = ["cblas","lapacke","system","static"] }

@bluss
Copy link
Member

bluss commented Nov 9, 2021

Not sure. A backtrace maybe could help. Does it work with ndarray 0.14.x? We have a few significant changes. (See readme for versions).

@zhongyi51
Copy link
Author

zhongyi51 commented Nov 13, 2021

Hi, sorry for the delay. I used the debug tool to check the detail of the error, and I found this issue happens at function exec_blas in blas_server_win32.c, line 475, as shown below:
`
/* Execute Threads */
int exec_blas(BLASLONG num, blas_queue_t *queue){

#if defined(SMP_SERVER) && defined(OS_CYGWIN_NT)
// Handle lazy re-init of the thread-pool after a POSIX fork
if (unlikely(blas_server_avail == 0)) blas_thread_init();
#endif

#ifndef ALL_THREADED
int (*routine)(blas_arg_t *, void *, void *, double *, double *, BLASLONG);
#endif

if ((num <= 0) || (queue == NULL)) return 0;

if ((num > 1) && queue -> next) exec_blas_async(1, queue -> next);

routine = queue -> routine;

if (queue -> mode & BLAS_LEGACY) {
legacy_exec(routine, queue -> mode, queue -> args, queue -> sb);
} else
if (queue -> mode & BLAS_PTHREAD) {
void (*pthreadcompat)(void *) = queue -> routine;
(pthreadcompat)(queue -> args);
} else
(routine)(queue -> args, queue -> range_m, queue -> range_n,
queue -> sa, queue -> sb, 0);

if ((num > 1) && queue -> next) exec_blas_async_wait(num - 1, queue -> next);

return 0;
}
`

The error message is:
Exception 0xc0000005 encountered at address 0x7ffa0f493416: Access violation writing location 0x00000024.

Because this error happened in function exec_blas_async, so I tried to make the calculation with only single thread by set env variable OPENBLAS_NUM_THREADS=1, but this error still exists.

I am wondering whether this is the error from openblas... So I am quite grateful for any help.

@bluss
Copy link
Member

bluss commented Nov 13, 2021

Hi, the error is certainly being raised inside openblas, but it could be that we for some reason pass it wrong information or other configuration is wrong. Does it work with other blas backends? What about ndarray 0.14?

@zhongyi51
Copy link
Author

Hi, I did the test on another win10 computer with ndarray=0.14.0, but this error still occurred.
I also tried to compile and run the same code on Linux (ubuntu20.04) and it works fine.
I thought it might be the issue from the different implementations of blas_server.c or the queue structure? If you could run this code with openblas on windows7 or other windows systems, this might be the problem of openblas...

@bluss
Copy link
Member

bluss commented Nov 13, 2021

Thanks for testing. Ndarray 0.14 to ndarray 0.15 changed how we link to blas-src, so then we maybe can rule out that change.

Is this issue fixed? blas-lapack-rs/openblas-src#80

I would look into two ideas, but don't really know:

@emmatyping
Copy link
Contributor

emmatyping commented Nov 16, 2021

Okay so I was able to reproduce this. I'm fairly certain this is not an ndarray issue, I made the following minimal repro which also crashes, which (AFAIK correctly) calls directly into cblas_dgemm:

extern crate blas_src;

const SIZE: usize = 65;

use cblas_sys as blas_sys;
use cblas_sys::{CblasNoTrans, CBLAS_LAYOUT};

fn main() {
    let aa: Vec<f64> = vec![0.; SIZE * SIZE];
    let ab: Vec<f64> = vec![0.; SIZE * SIZE];
    let mut res = vec![0.; SIZE * SIZE];
    let m = SIZE;
    let n = SIZE;
    let k = SIZE;
    unsafe {
        blas_sys::cblas_dgemm(
            CBLAS_LAYOUT::CblasRowMajor,
            CblasNoTrans,
            CblasNoTrans,
            m as i32,
            n as i32,
            k as i32,
            1.0,
            aa.as_ptr(),
            SIZE as i32,
            ab.as_ptr(),
            SIZE as i32,
            0.0,
            res.as_mut_ptr(),
            SIZE as i32,
        )
    };
    println!("{}", res[2 * SIZE + 3]);
}

Setting SIZE to 63 is ok, 65 leads to access violation. I'll look a bit closer at the stack to figure out what is going on but I think the bug is in either vcpkg's port, openblas-src, or openblas itself.

EDIT: Further progress, if I link to the official binaries https://github.com/xianyi/OpenBLAS/releases it works fine, so it is almost certainly a bug in vcpkg or openblas-src/cargo-vcpkg. Next up I'll be building a C project against both vcpkg and the official binaries to check where the actual issue is, but I am going to sleep for now :)

@bluss
Copy link
Member

bluss commented Nov 16, 2021

Great. So the question is what can we do, where to send this :) i guess workarounds can be recommended, avoid this configuration? Build some blas src?

@emmatyping
Copy link
Contributor

emmatyping commented Nov 17, 2021

It turns out that openblas compiled with vcpkg is built against ilp64, so since cblas-sys uses lp64, there are issues. I think the *-src packages probably should add support for ilp64, and the patches to cblas-sys around ilp64 should be merged too. But I think in the end this is pretty much #133, and maybe some work in openblas-src.

Edit: @zhongyi51 as a temporary workaround, you can add the following hack to your Cargo.toml

[target.'cfg(target_os = "windows")'.patch.crates-io]
cblas-sys = { git = "https://github.com/steabert/cblas-sys.git", features = ["ilp64"] }

of course, I make no promises about this as a solution :)

@bluss
Copy link
Member

bluss commented Nov 17, 2021

Nice find! I think we should close this - this is one of the challenges when setting up blas. Compiling blas from blas-src will give a more predictable result, using system packages as a shortcut is useful, but has challenges. Good luck. As noted, cblas-sys absolutely does only support lp64, so that's what we support.

@bluss bluss closed this as completed Nov 17, 2021
@bluss bluss added the blas label Nov 17, 2021
@emmatyping
Copy link
Contributor

Compiling blas from blas-src will give a more predictable result

Unfortunately this is incredibly difficult due to a lack of production Fortran compilers on Windows that are not ifort :/

I had openblas-src use vcpkg on Windows by default for exactly this reason. I will look into adding compiling from source though, I think flang can work.

I do agree this is not an issue for ndarray, and I will follow up on openblas-src.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants