Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: misaligned address #88

Open
xiongsiheng opened this issue Jun 11, 2020 · 0 comments
Open

RuntimeError: CUDA error: misaligned address #88

xiongsiheng opened this issue Jun 11, 2020 · 0 comments

Comments

@xiongsiheng
Copy link

Hi everyone! I ran into a strange bug which confused me several days. Sometimes the model will run into this error after dozens of epochs(like 40, 80 or 100). Sometimes this error disappears. When the model is resumed from the checkpoints saved before the error, this error may or may not appear again. Does anyone know the situation? Any reply will be appreciated.

When I use py36+torch1.4+cuda10.0, it shows:
Traceback (most recent call last):
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/nn/functional.py", line 1370, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: misaligned address

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address (insert_events at /opt/conda/conda-bld/pytorch_1579027003190/work/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f172f1a1627 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x1ab04 (0x7f172f3e1b04 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x1cbd1 (0x7f172f3e3bd1 in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f172f18eb9d in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x6871fa (0x7f17606161fa in /xxx/xxx/xxx/anaconda3/envs/ASL3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #20: __libc_start_main + 0xe7 (0x7f1772122b97 in /lib/x86_64-linux-gnu/libc.so.6)

When I use py35+torch0.4+cuda9.0, it shows:
Traceback (most recent call last):
File "main.py", line 329, in
main()
File "main.py", line 128, in main
train(train_loader, model, criterion, optimizer, epoch, log_training)
File "main.py", line 170, in train
output = model(input_var)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/models.py", line 220, in forward
base_out = self.base_model(input.view((-1, sample_len) + input.size()[-2:]))
File "/xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/xxx/xxx/xxx/project/TRN-pytorch/model_zoo/bninception/pytorch_load.py", line 57, in forward
data_dict[op[2]] = torch.cat(tuple(data_dict[x] for x in op[-1]), 1)
RuntimeError: cuda runtime error (74) : misaligned address at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingHostAllocator.cpp:271
terminate called after throwing an instance of 'at::Error'
what(): CUDA error: invalid device pointer (CudaCachingDeleter at /opt/conda/conda-bld/pytorch_1533672544752/work/aten/src/THC/THCCachingAllocator.cpp:498)
frame #0: THStorage_free + 0x44 (0x7fc7bba51a04 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #1: THTensor_free + 0x2f (0x7fc7bbaff66f in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2.so)
frame #2: at::CUDAFloatTensor::~CUDAFloatTensor() + 0x9 (0x7fc7a64ac609 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/lib/libcaffe2_gpu.so)
frame #3: torch::autograd::Variable::Impl::~Impl() + 0x1f7 (0x7fc7bd6c62d7 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #4: torch::autograd::Variable::Impl::~Impl() + 0x9 (0x7fc7bd6c6429 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #5: + 0x6e8a44 (0x7fc7bd6dda44 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)
frame #6: + 0x6e8b24 (0x7fc7bd6ddb24 in /xxx/xxx/xxx/anaconda3/envs/TRN/lib/python3.5/site-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so)

frame #23: __libc_start_main + 0xe7 (0x7fc7cec3bb97 in /lib/x86_64-linux-gnu/libc.so.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant