-
Notifications
You must be signed in to change notification settings - Fork 1.4k
torch classify many task failed with error code -11 #1173
Comments
Trying to find the offending change is difficult. Here's what has changed in the last two weeks: torch/distro@5c7c762...bd5e664 What's worse is I can't reproduce it locally yet. @soumith does the numbered list above make you think of any potential breaking changes that may have gone into Torch recently? |
the bug seems to be around not able to load libcudnn.so i'm digging into it a bit, but i wonder if anything wrt deploying cudnn onto these build boxes has changed recently |
I think these messages are just warnings:
Thanks for taking a look! I was just hoping you'd think of something off the top of your head - I wasn't asking you to actually dig into it. |
@lukeyeager i could prob think of the offending change if i knew what the error message that is emmitted from the torch side. Looks like that isn't being logged right now? |
No, not right now. Once I figure out how to reproduce it, I'll post back here. |
Since it's a segmentation fault there is probably no error message from Torch. I'll see if I can isolate the particular line of code that is causing the crash. |
@gheinrich definitely not aware of issues with convolutions having concurrency issues. What i suspect is that you might be using OpenBLAS and not setting NO_AFFINITY=1 which is a known openblas issue with concurrency i think. https://github.com/torch/distro/blob/master/install-deps#L18 If you link Torch against MKL, or if OpenBLAS was properly compiled, no such issues should occur. |
Thanks for the pointer Soumith. We're installing |
Thanks for investigating @gheinrich! |
It looks like #1179 may not have fixed this entirely. |
Hmm... Torch installs OpenBLAS to it's default location of We cache the torch build in DIGITS, but not the OpenBLAS build. Since ToT OpenBLAS has the same SONAME as the version installed with deb packages on 14.04, Torch loads the library just fine. But it's possible that running a build of torch that was built for ToT OpenBLAS is unable to use the old version that comes on 14.04, and that's why we're seeing memory corruption issues. Need to rewrite some stuff tomorrow to test this theory... |
Torch jobs on TravisCI have been failing for a while now (>1 week, <2 weeks):
https://travis-ci.org/NVIDIA/DIGITS/jobs/167788079
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736840
https://travis-ci.org/lukeyeager/DIGITS/jobs/167736180
Things all the test failures have in common:
*LeNet*
*classify_many*
or*top_n*
torch classify many task failed with error code -11
The text was updated successfully, but these errors were encountered: