Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #61

Open
FactoDeepLearning opened this issue Mar 26, 2020 · 12 comments
Open

CUDA error: an illegal memory access was encountered #61

FactoDeepLearning opened this issue Mar 26, 2020 · 12 comments

Comments

@FactoDeepLearning
Copy link

Hello,
I'm facing the following error when using your package. It appears randomly after some epochs. Do you have an idea about where it could come from ?

File "main_rnnt.py", line 86, in <module>
    model.train()
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 174, in train
    batch_metrics = self.train_batch(x, y)
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 286, in train_batch
    loss = loss_func(pred, y.permute(1, 0).contiguous(), x_len, y_len)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 100, in forward
    return self.loss(acts, labels, act_lens, label_lens, self.blank, self.reduction)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 40, in forward
    grads /= minibatch_size
RuntimeError: CUDA error: an illegal memory access was encountered

CentOS-7
CUDA 10.0
python 3.6.9
torch 1.2
gcc 7.3.0
GPU : Tesla P100-PCIE-12GB

@LearnedVector
Copy link

LearnedVector commented Mar 31, 2020

Getting the same. Any fix? @FactoDeepLearning @HawkAaron

EDIT
This was due to me not putting acts, labels, input_len, and label_len to .cuda() in pytorch. Fix now.

EDIT2
I'm still getting it now. It'll train at first then get this error after X iterations.

@LearnedVector
Copy link

LearnedVector commented Apr 1, 2020

After some debugging, I think there might be a bug in this library @HawkAaron. I am printing the cost at this line here https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L37 and the RuntimeError: CUDA error: an illegal memory access was encountered only happens when cost is printing out as 0.. I am assuming that the loss_fn https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L27 is not updating the cost of gradients causing it to error out. Any ideas?

Also this issue is fixed when running on cpu and there are no 0 costs.

@funcwj
Copy link

funcwj commented Apr 20, 2020

Same issues.

@jaesong
Copy link
Contributor

jaesong commented Apr 22, 2020

I think #64 will fix this issue.

@housebaby
Copy link

My version is latest. When using warp-transducer in espnet, the error still exist as "CUDA error: an illegal memory access was encountered". I discuss it in espnet project. But they think it is a problem of transducer.

espnet/espnet#1860 (comment)

My warp-transducer version is as follows.
Merge: c1a265f 5098002
Author: Mingkun Huang mingkunhuang95@gmail.com
Date: Mon Apr 27 23:07:35 2020 +0800

Merge pull request #66 from kamo-naoyuki/pt1.5

Support pytorch1.5

@HawkAaron
Copy link
Owner

@housebaby which kind of GPU did you use?

@housebaby
Copy link

housebaby commented Jul 2, 2020

@housebaby which kind of GPU did you use?

Tesla V100

It will not always fail. In some cases, either using 4 or 8 cards, it works.
But when I just change the batch size of the successful case ( or learning_rate) , it fail.
It is confusing

@oshindow
Copy link

oshindow commented Jul 3, 2020

Same issue.
When the batchsizes=3, it passed. When the batchsizes is set higher, it failed.

@jaesong
Copy link
Contributor

jaesong commented Jul 6, 2020

Oh, right, there's an overflow issue at compute_grad_kernel:

    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];

cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 * 2 * 20000 > INT_MAX).
I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

@housebaby
Copy link

housebaby commented Jul 7, 2020

Oh, right, there's an overflow issue at compute_grad_kernel:

    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];

cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 * 2 * 20000 > INT_MAX).
I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

Cool .
Then how should we solve this overflow problem. And will modification on this problem be updated to warp-transducer soon? @HawkAaron @jaesong

@stefan-falk
Copy link

stefan-falk commented Jun 17, 2021

I don't know if this is related but after upgrading to Tensorflow 2.5.0 (and therefore to CUDA 11.1) I am seeing this when training RNN-based transducer models. The loss either gets nan or I see the following error:

2021-06-17 17:23:44.905116: E tensorflow/stream_executor/dnn.cc:729] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1990): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-06-17 17:23:44.905169: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1560 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 768, 768, 1, 29, 41, 768]
2021-06-17 17:23:44.906664: I tensorflow/stream_executor/stream.cc:1404] [stream=0x55774c2eb680,impl=0x5577394acab0] did not wait for [stream=0x55774c2eb410,impl=0x5577266661f0]
2021-06-17 17:23:44.906810: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906826: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906841: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906859: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:721] failed to record completion event; therefore, failed to create inter-stream dependency
2021-06-17 17:23:44.906872: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906888: E tensorflow/stream_executor/stream.cc:334] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2021-06-17 17:23:44.906903: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
2021-06-17 17:23:44.906911: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906920: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fec7589a700; host src: 0x7fec55458200; size: 4=0x4
2021-06-17 17:23:44.906934: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted2021-06-17 17:23:44.906946: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fed1b838100; host src: 0x7fe28e26b040; size: 24531156=0x17650d4


Thread 0x00007fec57a63700 (most recent call first):
  File "2021-06-17 17:23:44.906960: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fecaa6b1b00; host src: 0x7fec55457a00; size: 164=0xa4
/home/sfalk2021-06-17 17:23:44.906974: E tensorflow/stream_executor/cuda/cuda_driver.cc:1182] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7fec5545af00; GPU src: 0x7fe75f100d00; size: 31980=0x7cec
/minicon2021-06-17 17:23:44.906987: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
da3/Fatal Python error: eAborted nvs/asr2/lib/python3.8/multiprocessingFatal Python error: /Abortedpool.py"Aborted (core dumped)

It's possible that this has nothing to do with https://github.com/HawkAaron/warp-transducer but it's the only external library I am using in combination with Tensorflow.

See also tensorflow/tensorflow#50326

@yufang67
Copy link

yufang67 commented Aug 2, 2022

Hi @stefan-falk,
did you resolve the issue ?
i have similar problem with tf 2.8.2 + cuda11.2 + warp+rnnt. issue occurs only on multiGPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants