You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Operating System: Linux
Compiler: gcc 5.4.0
Package used (Python/R/Scala/Julia): C++
MXNet version: 703e8ee (current master)
Error Message:
segmentation fault or free: invalid pointer
Steps to reproduce
set cuda, cudnn, openblas, opencv flags, and USE_CPP_PACKAGE = 1 in config.mk
make -j4
run LD_LIBRARY_PATH=$LD_LIBRARY_PATH:lib valgrind ./build/cpp-package/example/lenet
a memory error will be detected after training start.
Without valgrind, the segmentation fault appears randomly, so I use valgrind to help immediately catch the code causes the segmentation fault when it happened.
After running Valgrind I found that the segmentation fault is caused by ThreadedOpr::Delete(threaded_opr)
being called multiple times, makes the destructor execute twice and (possibly) mess up the object pool.
I also observed that placing a mutex in ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr) prevents the segmentation fault, makes Valgrind happy and MXNet never crashes again,
although I believe this is not the correct solution.
I traced the code and still have no idea why the OnComplete callback will be called more than once using same threaded_opr, so I decided to open the issue and hope someone can help resolve the problem. Thank you very much.
The text was updated successfully, but these errors were encountered:
@lx75249 could you try to stress the machine using something like stress -i 10000 when training the network?
I have tested this issue on two machines, one of the them do not have the problem of segmentation fault under normal conditions, but mxnet immediately crashes when system is under heavy load.
Thank you for testing this out.
Environment info
Operating System: Linux
Compiler: gcc 5.4.0
Package used (Python/R/Scala/Julia): C++
MXNet version: 703e8ee (current master)
Error Message:
segmentation fault
orfree: invalid pointer
Steps to reproduce
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:lib valgrind ./build/cpp-package/example/lenet
Without valgrind, the segmentation fault appears randomly, so I use valgrind to help immediately catch the code causes the segmentation fault when it happened.
After running Valgrind I found that the segmentation fault is caused by
ThreadedOpr::Delete(threaded_opr)
being called multiple times, makes the destructor execute twice and (possibly) mess up the object pool.
Valgrind log: https://gist.github.com/sifmelcara/cef9f8a4d7e4d7f8de1520419476f4b0#file-log-L127
I also observed that placing a mutex in
ThreadedEngine::OnComplete(ThreadedOpr* threaded_opr)
prevents the segmentation fault, makes Valgrind happy and MXNet never crashes again,although I believe this is not the correct solution.
I traced the code and still have no idea why the
OnComplete
callback will be called more than once using samethreaded_opr
, so I decided to open the issue and hope someone can help resolve the problem. Thank you very much.The text was updated successfully, but these errors were encountered: