Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

torch classify many task failed with error code -11 #1173

Closed
lukeyeager opened this issue Oct 17, 2016 · 12 comments · Fixed by #1332
Closed

torch classify many task failed with error code -11 #1173

lukeyeager opened this issue Oct 17, 2016 · 12 comments · Fixed by #1332

Comments

@lukeyeager
Copy link
Member

Torch jobs on TravisCI have been failing for a while now (>1 week, <2 weeks):

https://travis-ci.org/NVIDIA/DIGITS/jobs/167788079
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736840
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736180

Things all the test failures have in common:

  1. Framework is Torch
  2. Test class is *LeNet*
    • These are the only tests which use a "real" network (i.e. one with convolution layers)
  3. Test name is *classify_many* or *top_n*
    • These are the only tests which do inference on multiple inputs
  4. They all fail with this message: torch classify many task failed with error code -11
    • In Python, when a script returns a negative error code, it means the script was killed with a signal. Signal 11 means "SIGSEGV Invalid memory reference".
@lukeyeager
Copy link
Member Author

Trying to find the offending change is difficult. Here's what has changed in the last two weeks:

torch/distro@5c7c762...bd5e664
torch/nn@c40e59e...a8e63f2
torch/torch7@b1ce165...4f7843e

What's worse is I can't reproduce it locally yet.

@soumith does the numbered list above make you think of any potential breaking changes that may have gone into Torch recently?

@soumith
Copy link

soumith commented Oct 17, 2016

the bug seems to be around not able to load libcudnn.so

i'm digging into it a bit, but i wonder if anything wrt deploying cudnn onto these build boxes has changed recently

@lukeyeager
Copy link
Member Author

I think these messages are just warnings:

Failed to load cudnn backend (is libcudnn.so in your library path?)
Failed to load cunn backend (is CUDA installed?)
Falling back to legacy nn backend

Thanks for taking a look! I was just hoping you'd think of something off the top of your head - I wasn't asking you to actually dig into it.

@soumith
Copy link

soumith commented Oct 17, 2016

@lukeyeager i could prob think of the offending change if i knew what the error message that is emmitted from the torch side. Looks like that isn't being logged right now?

@lukeyeager
Copy link
Member Author

No, not right now. Once I figure out how to reproduce it, I'll post back here.

@gheinrich
Copy link
Contributor

Since it's a segmentation fault there is probably no error message from Torch. I'll see if I can isolate the particular line of code that is causing the crash.

@gheinrich
Copy link
Contributor

Hi @soumith it seems like setting the number of threads to 1 (torch.setnumthreads(1)) as in #1179 - as opposed to the default 8 - "solves" this issue. Are you aware of any concurrency issues when running convolutions on CPU-only systems? Thanks for your offer to help by the way!

@soumith
Copy link

soumith commented Oct 18, 2016

@gheinrich definitely not aware of issues with convolutions having concurrency issues.
This is because the logic to compute convolution layers simply delegates to the underlying BLAS library.

What i suspect is that you might be using OpenBLAS and not setting NO_AFFINITY=1 which is a known openblas issue with concurrency i think. https://github.com/torch/distro/blob/master/install-deps#L18

If you link Torch against MKL, or if OpenBLAS was properly compiled, no such issues should occur.

@gheinrich
Copy link
Contributor

Thanks for the pointer Soumith. We're installing libopenblas-dev package with apt so it is impractical for us to rebuild OpenBLAS. However doing export OPENBLAS_MAIN_FREE=1 at run-time also seems to address the issue. Cheers!

@lukeyeager
Copy link
Member Author

Thanks for investigating @gheinrich!

@lukeyeager
Copy link
Member Author

lukeyeager commented Oct 19, 2016

@lukeyeager
Copy link
Member Author

Hmm...

Torch installs OpenBLAS to it's default location of /opt/OpenBLAS (this line tells torch where to find it).

We cache the torch build in DIGITS, but not the OpenBLAS build. Since ToT OpenBLAS has the same SONAME as the version installed with deb packages on 14.04, Torch loads the library just fine.

But it's possible that running a build of torch that was built for ToT OpenBLAS is unable to use the old version that comes on 14.04, and that's why we're seeing memory corruption issues.

Need to rewrite some stuff tomorrow to test this theory...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants