-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: an illegal memory access was encountered #61
Comments
Getting the same. Any fix? @FactoDeepLearning @HawkAaron EDIT EDIT2 |
After some debugging, I think there might be a bug in this library @HawkAaron. I am printing the cost at this line here https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L37 and the Also this issue is fixed when running on |
Same issues. |
I think #64 will fix this issue. |
My version is latest. When using warp-transducer in espnet, the error still exist as "CUDA error: an illegal memory access was encountered". I discuss it in espnet project. But they think it is a problem of transducer. My warp-transducer version is as follows.
|
@housebaby which kind of GPU did you use? |
Tesla V100 It will not always fail. In some cases, either using 4 or 8 cards, it works. |
Same issue. |
Oh, right, there's an overflow issue at // 0 <= col < batch * T * U
int col = blockIdx.x;
// col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
Tp logpk = denom[col] + acts[col * alphabet_size + idx]; cuda-memcheck seems to catch such problem with |
Cool . |
I don't know if this is related but after upgrading to Tensorflow 2.5.0 (and therefore to CUDA 11.1) I am seeing this when training RNN-based transducer models. The loss either gets
It's possible that this has nothing to do with https://github.com/HawkAaron/warp-transducer but it's the only external library I am using in combination with Tensorflow. See also tensorflow/tensorflow#50326 |
Hi @stefan-falk, |
Hello,
I'm facing the following error when using your package. It appears randomly after some epochs. Do you have an idea about where it could come from ?
CentOS-7
CUDA 10.0
python 3.6.9
torch 1.2
gcc 7.3.0
GPU : Tesla P100-PCIE-12GB
The text was updated successfully, but these errors were encountered: