-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra: cusparse misalignment error in spmv on subvectors #11926
Comments
@trilinos/tpetra |
@trilinos/kokkos-kernels |
Hi @maartenarnst Thanks for reporting this. I'm having trouble replicating on CUDA 11.6 and 12.0 though, could you say which version you are using? |
Also, in the cudaErrorMisalignedAddress error message, I assume the pointer it gives is actually aligned to 8 bytes and not something smaller? Just to rule out any incorrect casting from a smaller type like float somewhere. |
Hi @brian-kelley, We're using Cuda 12.1. The environment variable with the Cusparse version is The error message begins like
It's referring to
I'm not sure what you mean with the alignment of the pointer in the message. It seems there is no address in the message. Do you know what I should do to get this address to rule out the incorrect casting? |
Hi @maartenarnst Never mind about the offending pointer - I thought the Cuda misaligned error message included it, but it doesn't. I'm talking to a cuSPARSE expert about to try to understand the alignment requirements of SpMV, and why I'm not replicating it on CUDA 12.0. I think a likely fix for this will have 2 parts though:
Together, these changes should make it so the error never happens, but you still get the performance of cuSPARSE in the vast majority of cases. |
After trying some things, I think the first change (add padding in Tpetra) will only be possible after Kokkos core addresses kokkos/kokkos#2995. Right now, you can create a mirror of a padded view, but the mirror will be contiguous (meaning you can't deep copy between it and the original, as the strides are different). Several things in KokkosKernels require copying data between host and device, so currently those would break with padded views. |
Hi @maartenarnst I was never able to replicate this in the end, but I did talk back and forth with the cuSPARSE developers to understand when 16-byte alignment should be necessary, and put in the patch #12004. Would you mind checking that this fixes the original issue in Anasazi, or your smaller reproducer? Thanks! |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
Fixed via #12004 |
Bug Report
@tpetra, @kokkos-kernels, @csiefer2
Description
We're using Anasazi's GeneralizedDavidson solver to solve an eigenproblem with an odd number of rows.
On CPU, it works. However, in a cuda build with cusparse, the solver aborts with a "cudaErrorMisalignedAddress" error.
We tracked down the issue to an spmv call on the following line
Trilinos/packages/anasazi/src/AnasaziGeneralizedDavidson.hpp
Line 1147 in 0a376ba
What appears to be happening is that Anasazi creates two multivectors with multiple columns (
d_V
andd_AV
). The first multivector stores certain vectors. The second multivector serves to store the result of multiplying the matrix with these vectors. The issue appears to arise because Anasazi wants to do such spmv on subsets of the vectors (V_new
andAV_new
). In particular, we see the abort when it extracts from each multivector the second vector and then wants to do the spmv on those (V_new
contains the second vector fromd_V
andAV_new
contains the second vector fromd_AV
). It appears that when the eigenproblem has an odd number of rows,V_new
andAV_new
are not aligned in a way that cusparse expects. I wasn't able to find in the cusparse doc what exactly cusparse expects. But certain other cusparse functions expect alignement to 16 bytes, and, if that's the case here, then, if there is an odd number of rows, even if the multivector is aligned to 16 bytes, the second vector in it will be aligned to 8 bytes.We're able to reproduce the issue using just functions from Tpetra (
getVectorNonConst
andapply
). So even though the issue arises in a computation with Anasazi, it appears this is not really an Anasazi issue, but rather an issue concerning Tpetra's multivector alignment in connection with cusparse spmv.We tried passing
Kokkos::AllowPadding
to the constructor of the (dual) view, but this didn't solve the problem.We are unsure how to analyze/solve the issue further.
The text was updated successfully, but these errors were encountered: