CUDA error: an illegal memory access was encountered #61

FactoDeepLearning · 2020-03-26T11:17:51Z

Hello,
I'm facing the following error when using your package. It appears randomly after some epochs. Do you have an idea about where it could come from ?

File "main_rnnt.py", line 86, in <module>
    model.train()
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 174, in train
    batch_metrics = self.train_batch(x, y)
  File "/gpfs1/dlocal/run/7027505/pytorch/rnnt/RNNT.py", line 286, in train_batch
    loss = loss_func(pred, y.permute(1, 0).contiguous(), x_len, y_len)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 100, in forward
    return self.loss(acts, labels, act_lens, label_lens, self.blank, self.reduction)
  File "/gpfs1/home/2017018/dcoque01/pytorch/lib/python3.6/site-packages/warprnnt_pytorch-0.1-py3.6-linux-x86_64.egg/warprnnt_pytorch/__init__.py", line 40, in forward
    grads /= minibatch_size
RuntimeError: CUDA error: an illegal memory access was encountered

CentOS-7
CUDA 10.0
python 3.6.9
torch 1.2
gcc 7.3.0
GPU : Tesla P100-PCIE-12GB

The text was updated successfully, but these errors were encountered:

LearnedVector · 2020-03-31T21:48:56Z

Getting the same. Any fix? @FactoDeepLearning @HawkAaron

EDIT
This was due to me not putting acts, labels, input_len, and label_len to .cuda() in pytorch. Fix now.

EDIT2
I'm still getting it now. It'll train at first then get this error after X iterations.

LearnedVector · 2020-04-01T21:08:28Z

After some debugging, I think there might be a bug in this library @HawkAaron. I am printing the cost at this line here https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L37 and the RuntimeError: CUDA error: an illegal memory access was encountered only happens when cost is printing out as 0.. I am assuming that the loss_fn https://github.com/HawkAaron/warp-transducer/blob/master/pytorch_binding/warprnnt_pytorch/__init__.py#L27 is not updating the cost of gradients causing it to error out. Any ideas?

Also this issue is fixed when running on cpu and there are no 0 costs.

funcwj · 2020-04-20T04:56:43Z

Same issues.

jaesong · 2020-04-22T09:26:03Z

I think #64 will fix this issue.

housebaby · 2020-07-01T09:32:37Z

My version is latest. When using warp-transducer in espnet, the error still exist as "CUDA error: an illegal memory access was encountered". I discuss it in espnet project. But they think it is a problem of transducer.

espnet/espnet#1860 (comment)

My warp-transducer version is as follows.
Merge: c1a265f 5098002
Author: Mingkun Huang mingkunhuang95@gmail.com
Date: Mon Apr 27 23:07:35 2020 +0800

Merge pull request #66 from kamo-naoyuki/pt1.5

Support pytorch1.5

HawkAaron · 2020-07-01T14:34:35Z

@housebaby which kind of GPU did you use?

housebaby · 2020-07-02T03:32:26Z

@housebaby which kind of GPU did you use?

Tesla V100

It will not always fail. In some cases, either using 4 or 8 cards, it works.
But when I just change the batch size of the successful case （ or learning_rate） , it fail.
It is confusing

oshindow · 2020-07-03T02:37:07Z

Same issue.
When the batchsizes=3, it passed. When the batchsizes is set higher, it failed.

jaesong · 2020-07-06T08:32:17Z

Oh, right, there's an overflow issue at compute_grad_kernel:

    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];

cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 * 2 * 20000 > INT_MAX).
I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

housebaby · 2020-07-07T07:06:11Z

Oh, right, there's an overflow issue at compute_grad_kernel:
    // 0 <= col < batch * T * U
    int col = blockIdx.x;

    // col * alphabet_size can be > 2**31 - 1 = INT_MAX, but its type is int
    Tp logpk = denom[col] + acts[col * alphabet_size + idx];
cuda-memcheck seems to catch such problem with batch=1, src=53688, tgt=1+1, vocab=20000 (53688 * 2 * 20000 > INT_MAX).
I also suspect that there are similar overflow issues at ReduceHelper, but I haven't checked them properly.

Cool .
Then how should we solve this overflow problem. And will modification on this problem be updated to warp-transducer soon? @HawkAaron @jaesong

stefan-falk · 2021-06-17T16:35:15Z

I don't know if this is related but after upgrading to Tensorflow 2.5.0 (and therefore to CUDA 11.1) I am seeing this when training RNN-based transducer models. The loss either gets nan or I see the following error:

2021-06-17 17:23:44.905116: E tensorflow/stream_executor/dnn.cc:729] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1990): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-06-17 17:23:44.905169: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at cudnn_rnn_ops.cc:1560 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 768, 768, 1, 29, 41, 768]
2021-06-17 17:23:44.906664: I tensorflow/stream_executor/stream.cc:1404] [stream=0x55774c2eb680,impl=0x5577394acab0] did not wait for [stream=0x55774c2eb410,impl=0x5577266661f0]
2021-06-17 17:23:44.906810: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906826: E tensorflow/stream_executor/cuda/cuda_driver.cc:1085] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906841: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906859: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:721] failed to record completion event; therefore, failed to create inter-stream dependency
2021-06-17 17:23:44.906872: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906888: E tensorflow/stream_executor/stream.cc:334] Error recording event in stream: Error recording CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2021-06-17 17:23:44.906903: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
2021-06-17 17:23:44.906911: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-17 17:23:44.906920: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fec7589a700; host src: 0x7fec55458200; size: 4=0x4
2021-06-17 17:23:44.906934: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted2021-06-17 17:23:44.906946: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fed1b838100; host src: 0x7fe28e26b040; size: 24531156=0x17650d4


Thread 0x00007fec57a63700 (most recent call first):
  File "2021-06-17 17:23:44.906960: E tensorflow/stream_executor/cuda/cuda_driver.cc:1202] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7fecaa6b1b00; host src: 0x7fec55457a00; size: 164=0xa4
/home/sfalk2021-06-17 17:23:44.906974: E tensorflow/stream_executor/cuda/cuda_driver.cc:1182] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; host dst: 0x7fec5545af00; GPU src: 0x7fe75f100d00; size: 31980=0x7cec
/minicon2021-06-17 17:23:44.906987: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
da3/Fatal Python error: eAborted nvs/asr2/lib/python3.8/multiprocessingFatal Python error: /Abortedpool.py"Aborted (core dumped)

It's possible that this has nothing to do with https://github.com/HawkAaron/warp-transducer but it's the only external library I am using in combination with Tensorflow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: an illegal memory access was encountered #61

CUDA error: an illegal memory access was encountered #61

FactoDeepLearning commented Mar 26, 2020

LearnedVector commented Mar 31, 2020 •

edited

Loading

LearnedVector commented Apr 1, 2020 •

edited

Loading

funcwj commented Apr 20, 2020

jaesong commented Apr 22, 2020

housebaby commented Jul 1, 2020

HawkAaron commented Jul 1, 2020

housebaby commented Jul 2, 2020 •

edited

Loading

oshindow commented Jul 3, 2020

jaesong commented Jul 6, 2020

housebaby commented Jul 7, 2020 •

edited

Loading

stefan-falk commented Jun 17, 2021 •

edited

Loading

yufang67 commented Aug 2, 2022 •

edited

Loading

CUDA error: an illegal memory access was encountered #61

CUDA error: an illegal memory access was encountered #61

Comments

FactoDeepLearning commented Mar 26, 2020

LearnedVector commented Mar 31, 2020 • edited Loading

LearnedVector commented Apr 1, 2020 • edited Loading

funcwj commented Apr 20, 2020

jaesong commented Apr 22, 2020

housebaby commented Jul 1, 2020

HawkAaron commented Jul 1, 2020

housebaby commented Jul 2, 2020 • edited Loading

oshindow commented Jul 3, 2020

jaesong commented Jul 6, 2020

housebaby commented Jul 7, 2020 • edited Loading

stefan-falk commented Jun 17, 2021 • edited Loading

yufang67 commented Aug 2, 2022 • edited Loading

LearnedVector commented Mar 31, 2020 •

edited

Loading

LearnedVector commented Apr 1, 2020 •

edited

Loading

housebaby commented Jul 2, 2020 •

edited

Loading

housebaby commented Jul 7, 2020 •

edited

Loading

stefan-falk commented Jun 17, 2021 •

edited

Loading

yufang67 commented Aug 2, 2022 •

edited

Loading