torch classify many task failed with error code -11 #1173

lukeyeager · 2016-10-17T20:19:44Z

Torch jobs on TravisCI have been failing for a while now (>1 week, <2 weeks):

https://travis-ci.org/NVIDIA/DIGITS/jobs/167788079
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736840
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736180

Things all the test failures have in common:

Framework is Torch
Test class is *LeNet*
- These are the only tests which use a "real" network (i.e. one with convolution layers)
Test name is *classify_many* or *top_n*
- These are the only tests which do inference on multiple inputs
They all fail with this message: torch classify many task failed with error code -11
- In Python, when a script returns a negative error code, it means the script was killed with a signal. Signal 11 means "SIGSEGV Invalid memory reference".

The text was updated successfully, but these errors were encountered:

lukeyeager · 2016-10-17T20:26:26Z

Trying to find the offending change is difficult. Here's what has changed in the last two weeks:

torch/distro@5c7c762...bd5e664
torch/nn@c40e59e...a8e63f2
torch/torch7@b1ce165...4f7843e

What's worse is I can't reproduce it locally yet.

@soumith does the numbered list above make you think of any potential breaking changes that may have gone into Torch recently?

soumith · 2016-10-17T21:03:18Z

the bug seems to be around not able to load libcudnn.so

i'm digging into it a bit, but i wonder if anything wrt deploying cudnn onto these build boxes has changed recently

lukeyeager · 2016-10-17T21:09:18Z

I think these messages are just warnings:

Failed to load cudnn backend (is libcudnn.so in your library path?)
Failed to load cunn backend (is CUDA installed?)
Falling back to legacy nn backend

Thanks for taking a look! I was just hoping you'd think of something off the top of your head - I wasn't asking you to actually dig into it.

soumith · 2016-10-17T21:45:49Z

@lukeyeager i could prob think of the offending change if i knew what the error message that is emmitted from the torch side. Looks like that isn't being logged right now?

lukeyeager · 2016-10-17T21:48:28Z

No, not right now. Once I figure out how to reproduce it, I'll post back here.

gheinrich · 2016-10-18T15:08:59Z

Since it's a segmentation fault there is probably no error message from Torch. I'll see if I can isolate the particular line of code that is causing the crash.

gheinrich · 2016-10-18T19:13:02Z

Hi @soumith it seems like setting the number of threads to 1 (torch.setnumthreads(1)) as in #1179 - as opposed to the default 8 - "solves" this issue. Are you aware of any concurrency issues when running convolutions on CPU-only systems? Thanks for your offer to help by the way!

soumith · 2016-10-18T19:23:35Z

@gheinrich definitely not aware of issues with convolutions having concurrency issues.
This is because the logic to compute convolution layers simply delegates to the underlying BLAS library.

What i suspect is that you might be using OpenBLAS and not setting NO_AFFINITY=1 which is a known openblas issue with concurrency i think. https://github.com/torch/distro/blob/master/install-deps#L18

If you link Torch against MKL, or if OpenBLAS was properly compiled, no such issues should occur.

gheinrich · 2016-10-18T20:06:53Z

Thanks for the pointer Soumith. We're installing libopenblas-dev package with apt so it is impractical for us to rebuild OpenBLAS. However doing export OPENBLAS_MAIN_FREE=1 at run-time also seems to address the issue. Cheers!

lukeyeager · 2016-10-18T20:48:30Z

Thanks for investigating @gheinrich!

lukeyeager · 2016-10-19T19:37:24Z

It looks like #1179 may not have fixed this entirely.
https://travis-ci.org/lukeyeager/DIGITS/builds/169015952
https://travis-ci.org/lukeyeager/DIGITS/jobs/169032948

lukeyeager · 2016-11-29T01:33:26Z

Hmm...

Torch installs OpenBLAS to it's default location of /opt/OpenBLAS (this line tells torch where to find it).

We cache the torch build in DIGITS, but not the OpenBLAS build. Since ToT OpenBLAS has the same SONAME as the version installed with deb packages on 14.04, Torch loads the library just fine.

But it's possible that running a build of torch that was built for ToT OpenBLAS is unable to use the old version that comes on 14.04, and that's why we're seeing memory corruption issues.

Need to rewrite some stuff tomorrow to test this theory...

lukeyeager added bug torch labels Oct 17, 2016

lukeyeager mentioned this issue Oct 18, 2016

Fix visualization when palette is None #1177

Merged

gheinrich mentioned this issue Oct 18, 2016

Disable BLAS threading affinity in Travis #1179

Merged

lukeyeager closed this as completed in #1179 Oct 18, 2016

lukeyeager reopened this Oct 19, 2016

lukeyeager mentioned this issue Nov 16, 2016

[TravisCI] Use Atlas instead of OpenBLAS #1278

Closed

lukeyeager mentioned this issue Nov 29, 2016

[TravisCI] Allow torch failures #1305

Closed

lukeyeager mentioned this issue Dec 5, 2016

[TravisCI] Cache local OpenBLAS build #1332

Merged

lukeyeager closed this as completed in #1332 Dec 5, 2016

gheinrich mentioned this issue Jan 19, 2017

Torch classification job fails intermittently with ERROR: error code -11 #1403

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch classify many task failed with error code -11 #1173

torch classify many task failed with error code -11 #1173

lukeyeager commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

soumith commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

soumith commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

gheinrich commented Oct 18, 2016

gheinrich commented Oct 18, 2016

soumith commented Oct 18, 2016

gheinrich commented Oct 18, 2016

lukeyeager commented Oct 18, 2016

lukeyeager commented Oct 19, 2016 •

edited

Loading

lukeyeager commented Nov 29, 2016

torch classify many task failed with error code -11 #1173

torch classify many task failed with error code -11 #1173

Comments

lukeyeager commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

soumith commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

soumith commented Oct 17, 2016

lukeyeager commented Oct 17, 2016

gheinrich commented Oct 18, 2016

gheinrich commented Oct 18, 2016

soumith commented Oct 18, 2016

gheinrich commented Oct 18, 2016

lukeyeager commented Oct 18, 2016

lukeyeager commented Oct 19, 2016 • edited Loading

lukeyeager commented Nov 29, 2016

lukeyeager commented Oct 19, 2016 •

edited

Loading