Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparent data race in onnxruntime on aarch64 #32899

Closed
dan131riley opened this issue Feb 12, 2021 · 16 comments
Closed

Apparent data race in onnxruntime on aarch64 #32899

dan131riley opened this issue Feb 12, 2021 · 16 comments

Comments

@dan131riley
Copy link

As mentioned in #31123 we see occasional thread-related crashes in onnxruntime on aarch64. Unlike many threading crashes, this one is reproducible in gdb at the few percent level. The stack trace shows that the other threads in the same routine are blocked on a mutex, so there may be a bug in the mutex logic. Unfortunately, valgrind on aarch64 doesn't understand the instruction used by nsync_mu_semaphore_p (it looks like this uses standard C++ atomics, so there may be a general issue with valgrind and C++ atomics).

Thread 5 (Thread 0x3ff4ece83c0 (LWP 15980)):
#0  0x000003ffb5d9a6e0 in syscall () from /lib64/libc.so.6
#1  0x000003fe09ec9540 in nsync::nsync_mu_semaphore_p(nsync::nsync_semaphore_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#2  0x000003fe09ec88fc in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#3  0x000003fe09ec8a1c in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#4  0x000003fe09d01e44 in onnxruntime::SessionState::UpdateMemoryPatternGroupCache(std::vector<std::reference_wrapper<onnxruntime::TensorShape const>, std::allocator<std::reference_wrapper<onnxruntime::TensorShape const> > > const&, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> >) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#5  0x000003fe09d31280 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#6  0x000003fe09d1f638 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#7  0x000003fe09d20d24 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#8  0x000003fe09922380 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#9  0x000003fe098f3478 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#10 0x000003fe0a093728 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#11 0x000003fe0a0f48e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

Thread 4 (Thread 0x3ff4f6f83c0 (LWP 15979)):
#0  0x000003ffb5d9a6e0 in syscall () from /lib64/libc.so.6
#1  0x000003fe09ec9540 in nsync::nsync_mu_semaphore_p(nsync::nsync_semaphore_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#2  0x000003fe09ec88fc in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#3  0x000003fe09ec8a1c in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#4  0x000003fe09d01e44 in onnxruntime::SessionState::UpdateMemoryPatternGroupCache(std::vector<std::reference_wrapper<onnxruntime::TensorShape const>, std::allocator<std::reference_wrapper<onnxruntime::TensorShape const> > > const&, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> >) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#5  0x000003fe09d31280 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#6  0x000003fe09d1f638 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#7  0x000003fe09d20d24 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#8  0x000003fe09922380 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#9  0x000003fe098f3478 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#10 0x000003fe0a093728 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#11 0x000003fe0a0f48e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

Thread 3 (Thread 0x3ff501083c0 (LWP 15978)):
#0  std::local_Rb_tree_decrement (__x=0x3ffffff29d0) at ../../../../../libstdc++-v3/src/c++98/tree.cc:110
#1  std::local_Rb_tree_decrement (__x=0x3ffffff29d0) at ../../../../../libstdc++-v3/src/c++98/tree.cc:95
#2  0x000003fe09d01950 in std::_Rb_tree<long, std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > >, std::_Select1st<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > >, std::less<long>, std::allocator<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > > >::_M_get_insert_hint_unique_pos(std::_Rb_tree_const_iterator<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > >, long const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#3  0x000003fe09d01c1c in std::_Rb_tree_iterator<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > > std::_Rb_tree<long, std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > >, std::_Select1st<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > >, std::less<long>, std::allocator<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > > >::_M_emplace_hint_unique<std::piecewise_construct_t const&, std::tuple<long const&>, std::tuple<> >(std::_Rb_tree_const_iterator<std::pair<long const, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> > > >, std::piecewise_construct_t const&, std::tuple<long const&>&&, std::tuple<>&&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#4  0x000003fe09d01ee4 in onnxruntime::SessionState::UpdateMemoryPatternGroupCache(std::vector<std::reference_wrapper<onnxruntime::TensorShape const>, std::allocator<std::reference_wrapper<onnxruntime::TensorShape const> > > const&, std::unique_ptr<onnxruntime::MemoryPatternGroup, std::default_delete<onnxruntime::MemoryPatternGroup> >) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#5  0x000003fe09d31280 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#6  0x000003fe09d1f638 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#7  0x000003fe09d20d24 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#8  0x000003fe09922380 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#9  0x000003fe098f3478 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#10 0x000003fe0a093728 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#11 0x000003fe0a0f48e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

Thread 1 (Thread 0x3ffb55d0000 (LWP 15679)):
#0  0x000003ffb5d9a6e0 in syscall () from /lib64/libc.so.6
#1  0x000003fe09ec9540 in nsync::nsync_mu_semaphore_p(nsync::nsync_semaphore_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#2  0x000003fe09ec88fc in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#3  0x000003fe09ec8a1c in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#4  0x000003fe09cffe7c in onnxruntime::SessionState::GetMemoryPatternGroup(std::vector<std::reference_wrapper<onnxruntime::TensorShape const>, std::allocator<std::reference_wrapper<onnxruntime::TensorShape const> > > const&, std::vector<int, std::allocator<int> > const&) const () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#5  0x000003fe09cea760 in onnxruntime::ExecutionFrame::ExecutionFrame(std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::SessionState const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#6  0x000003fe09d2f478 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#7  0x000003fe09d1f638 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#8  0x000003fe09d20d24 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#9  0x000003fe09922380 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#10 0x000003fe098f3478 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-05-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.3.0
#11 0x000003fe0a093728 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#12 0x000003fe0a0f48e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02666/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-04-2300/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so
@cmsbuild
Copy link
Contributor

A new Issue was created by @dan131riley Dan Riley.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core, reconstruction

FYI @riga @mialiu149

@cmsbuild
Copy link
Contributor

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Should we inform ONNXRuntime developers?

@davidlange6
Copy link
Contributor

davidlange6 commented Feb 12, 2021 via email

@riga
Copy link
Contributor

riga commented Feb 12, 2021

We can start upgrading ONNXRuntime next week from 1.3.0 to 1.6.0 and see if the error persists. Also, depending on whether the current onnx models in cmssw are compatible, this could have a positive effect on the inference performance (see e.g. #32883).

@dan131riley
Copy link
Author

With the updated ONNXRuntime aarch64 is getting assertion failures and exceptions:

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
terminate called recursively
what():  /home/cmsbld/jenkins_b/workspace/build-any-ib/w/BUILD/slc7_aarch64_gcc9/external/onnxruntime/1.6.0-95dd9e8bf3d79a46d82a996c30428bf2/onnxruntime-1.6.0/onnxruntime/core/framework/bfc_arena.cc:473 void onnxruntime::BFCArena::RemoveFreeChunkFromBin(onnxruntime::BFCArena::ChunkHandle) BinFromIndex(c->bin_num)->free_chunks.erase(h) > 0 was false. Could not find chunk in bin

with stack traces:

#13 0x000001009820d518 in onnxruntime::BFCArena::Free(void*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#14 0x000001009826516c in onnxruntime::Tensor::~Tensor() () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#15 0x0000010098215278 in void onnxruntime::Delete<onnxruntime::Tensor>(void*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#16 0x0000010097daf988 in std::_Sp_counted_deleter<void*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#17 0x0000010098223fcc in onnxruntime::IExecutionFrame::ReleaseMLValueImpl(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#18 0x0000010098224200 in onnxruntime::ExecutionFrame::ReleaseMLValueImpl(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#19 0x0000010098222b3c in onnxruntime::IExecutionFrame::ReleaseMLValue(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#20 0x000001009828c458 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#21 0x000001009827b010 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#22 0x000001009827c6f0 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#23 0x0000010097df179c in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#24 0x0000010097db8e34 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#25 0x0000010097d235e0 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02670/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-28-0000/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#26 0x0000010097ca48e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02670/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-28-0000/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

or just a segmentation fault:

#5  0x000001003d735b38 in std::_Rb_tree_insert_and_rebalance (__insert_left=<optimized out>, __x=<optimized out>, __p=<optimized out>, __header=...) at ../../../../../libstdc++-v3/src/c++98/tree.cc:282
#6  0x00000101e87cb410 in std::pair<std::_Rb_tree_iterator<unsigned long>, bool> std::_Rb_tree<unsigned long, unsigned long, std::_Identity<unsigned long>, onnxruntime::BFCArena::Bin::ChunkComparator, std::allocator<unsigned long> >::_M_insert_unique<unsigned long const&>(unsigned long const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#7  0x00000101e87cb53c in onnxruntime::BFCArena::InsertFreeChunkIntoBin(unsigned long) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#8  0x00000101e87cc994 in onnxruntime::BFCArena::FreeAndMaybeCoalesce(unsigned long) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#9  0x00000101e87ccbd0 in onnxruntime::BFCArena::DeallocateRawInternal(void*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#10 0x00000101e87cd4e8 in onnxruntime::BFCArena::Free(void*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#11 0x00000101e882516c in onnxruntime::Tensor::~Tensor() () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#12 0x00000101e87d5278 in void onnxruntime::Delete<onnxruntime::Tensor>(void*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#13 0x00000101e836f988 in std::_Sp_counted_deleter<void*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#14 0x00000101e87e3fcc in onnxruntime::IExecutionFrame::ReleaseMLValueImpl(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#15 0x00000101e87e4200 in onnxruntime::ExecutionFrame::ReleaseMLValueImpl(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#16 0x00000101e87e2b3c in onnxruntime::IExecutionFrame::ReleaseMLValue(int) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#17 0x00000101e884c458 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#18 0x00000101e883b010 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#19 0x00000101e883c6f0 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#20 0x00000101e83b179c in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#21 0x00000101e8378e34 in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /cvmfs/cms-ib.cern.ch/week0/slc7_aarch64_gcc9/cms/cmssw-patch/CMSSW_11_3_X_2021-02-28-2300/external/slc7_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#22 0x00000101e82e35e0 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /cvmfs/cms-ib.cern.ch/nweek-02670/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-28-0000/lib/slc7_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#23 0x00000101e82648e0 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02670/slc7_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-02-28-0000/lib/slc7_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

In these there are frequently multiple threads raising the same exception.

@dan131riley
Copy link
Author

I have a theory about the onnxruntime crashes, which I'm going to throw out here for comments while I try to reproduce the problem (as the first step to implementing a fix). Many of the crashes and assertion failures point back to IExecutionFrame::ReleaseMLValueImpl:

Status IExecutionFrame::ReleaseMLValueImpl(int ort_value_idx) {
  if (ort_value_idx == NodeIndexInfo::kInvalidEntry || static_cast<size_t>(ort_value_idx) >= all_values_size_) {
    return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "invalid index ", ort_value_idx);
  }


  // If fence is available, check whether async read has completed or not.
  Fence_t fence = GetMLValue(ort_value_idx).Fence();
  if (fence && !fence->CanRelease()) {
    // Async data reading is not done yet, defer mem release until Session.run() end.
    return Status::OK();
  }


  all_values_[ort_value_idx] = OrtValue();
  return Status::OK();
}

As I understand the use of onnxruntime (which is not very well), all the threads are sharing an execution context (not sure if that's the right term), so all_values_ can be modified by multiple threads. all_values_ is a (fixed size)

std::vector<OrtValue> all_values_;

and now the critical point, OrtValue has a std::shared_ptr<void>. Modifying the same smart pointer in different threads is not thread safe, which is why there are std::shared_ptr specializations of std::atomic_store (or in C++20, std::atomic<std::shared_ptr<T>>.

So my guess is that OrtValue needs copy constructors/assignment operators etc. that use atomic_store.

Plausible?

@dan131riley
Copy link
Author

It looks like I was correct that the problem is at least partially simultaneous access to the OrtValue from different threads, which means simultaneous access to the same shared pointer from different threads. I've opened cms-externals/onnxruntime#5 to address the issue.

@dan131riley
Copy link
Author

It looks like cms-externals/onnxruntime#6 helps, but I'm still seeing assertion failures, stack trace below. This stack trace should be impossible. There are three threads in BFCArena::Free() according to the stack trace; the assertion failure is in BFCArena::RemoveFreeChunkFromBin(), so either functions have been inlined or the compiler has done tail-call optimization to remove stack frames. The problem is that BFCArena::Free() ought to be holding a locked mutex:

void BFCArena::Free(void* p) {
  if (p == nullptr) {
    return;
  }
  std::lock_guard<OrtMutex> lock(lock_);
  auto it = reserved_chunks_.find(p);
  if (it != reserved_chunks_.end()) {
    device_allocator_->Free(it->first);
    stats_.bytes_in_use -= it->second;
    stats_.total_allocated_bytes -= it->second;
    reserved_chunks_.erase(it);
  } else {
    DeallocateRawInternal(p);
  }
}

and I have other stack traces from before the shared pointer fixes where there are threads clearly obeying the lock:

Thread 5 (Thread 0x3ff33af84a0 (LWP 40891)):
#0  0x000003ffa55be310 in syscall () from /lib64/libc.so.6
#1  0x000003ff23541a98 in nsync::nsync_mu_semaphore_p(nsync::nsync_semaphore_s_*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#2  0x000003ff23540e54 in nsync::nsync_mu_lock_slow_(nsync::nsync_mu_s_*, nsync::waiter*, unsigned int, nsync::lock_type_s*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#3  0x000003ff23540f74 in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#4  0x000003ff232b5de4 in onnxruntime::BFCArena::Free(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0

so there's some code path where the lock is getting ignored or released early.

Stack trace:

terminate called after throwing an instance of 'onnxruntime::OnnxRuntimeException'
terminate called recursively
  what():  /tmp/dsr/onnxruntime/onnxruntime/core/framework/bfc_arena.cc:472 void onnxruntime::BFCArena::RemoveFreeChunkFromBin(onnxruntime::BFCArena::ChunkHandle) !c->in_use() && (c->bin_num != kInvalidBinNum) was false. 


Thread 5 (Thread 0x3ff1c4984a0 (LWP 25261)):
#0  0x000003ff8df82c1c in raise () from /lib64/libc.so.6
#1  0x000003ff8df707a8 in abort () from /lib64/libc.so.6
#2  0x000003ff8e2b06f8 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x000003ff8e2ae41c in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x000003ff8e2ad434 in __cxa_call_terminate (ue_header=ue_header@entry=0x3fe6945eda0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x000003ff8e2adbd0 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=<optimized out>, ue_header=0x3fe6945eda0, context=0x3ff1c495770) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:677
#6  0x000003ff8e11d704 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x3fe6945eda0, context=context@entry=0x3ff1c495770, frames_p=frames_p@entry=0x3ff1c4953a8) at ../../../libgcc/unwind.inc:64
#7  0x000003ff8e11dcc4 in _Unwind_Resume (exc=0x3fe6945eda0) at ../../../libgcc/unwind.inc:241
#8  0x000003ff0bc11d48 in onnxruntime::BFCArena::Free(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#9  0x000003ff0bc6c404 in onnxruntime::Tensor::~Tensor() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#10 0x000003ff0bc19bd8 in void onnxruntime::Delete<onnxruntime::Tensor>(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#11 0x000003ff0b7eecc0 in std::_Sp_counted_deleter<void*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#12 0x000003ff0bc28f84 in onnxruntime::IExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#13 0x000003ff0bc294a0 in onnxruntime::ExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#14 0x000003ff0bc27a5c in onnxruntime::IExecutionFrame::ReleaseMLValue(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#15 0x000003ff0bc94f98 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#16 0x000003ff0bc82a00 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#17 0x000003ff0bc84d90 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#18 0x000003ff0b830ed0 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#19 0x000003ff0b7f82bc in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#20 0x000003ff0c0a3590 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#21 0x000003ff0c104780 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

Thread 4 (Thread 0x3ff1cea84a0 (LWP 25260)):
#0  0x000003ff8df82c1c in raise () from /lib64/libc.so.6
#1  0x000003ff8df707a8 in abort () from /lib64/libc.so.6
#2  0x000003ff8e2b0694 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#3  0x000003ff8e2ae41c in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x000003ff8e2ad434 in __cxa_call_terminate (ue_header=ue_header@entry=0x3fe59815160) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x000003ff8e2adbd0 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=<optimized out>, ue_header=0x3fe59815160, context=0x3ff1cea5770) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:677
#6  0x000003ff8e11d704 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x3fe59815160, context=context@entry=0x3ff1cea5770, frames_p=frames_p@entry=0x3ff1cea53a8) at ../../../libgcc/unwind.inc:64
#7  0x000003ff8e11dcc4 in _Unwind_Resume (exc=0x3fe59815160) at ../../../libgcc/unwind.inc:241
#8  0x000003ff0bc11d48 in onnxruntime::BFCArena::Free(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#9  0x000003ff0bc6c404 in onnxruntime::Tensor::~Tensor() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#10 0x000003ff0bc19bd8 in void onnxruntime::Delete<onnxruntime::Tensor>(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#11 0x000003ff0b7eecc0 in std::_Sp_counted_deleter<void*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#12 0x000003ff0bc28f84 in onnxruntime::IExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#13 0x000003ff0bc294a0 in onnxruntime::ExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#14 0x000003ff0bc27a5c in onnxruntime::IExecutionFrame::ReleaseMLValue(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#15 0x000003ff0bc94f98 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#16 0x000003ff0bc82a00 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#17 0x000003ff0bc84d90 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#18 0x000003ff0b830ed0 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#19 0x000003ff0b7f82bc in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#20 0x000003ff0c0a3590 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#21 0x000003ff0c104780 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

Thread 3 (Thread 0x3ff1d8b84a0 (LWP 25259)):
#0  0x000003ff8df82c1c in raise () from /lib64/libc.so.6
#1  0x000003ff8df707a8 in abort () from /lib64/libc.so.6
#2  0x000003ff8e2b0694 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:50
#3  0x000003ff8e2ae41c in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x000003ff8e2ad434 in __cxa_call_terminate (ue_header=ue_header@entry=0x3fed6de7c20) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x000003ff8e2adbd0 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=<optimized out>, ue_header=0x3fed6de7c20, context=0x3ff1d8b5770) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:677
#6  0x000003ff8e11d704 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x3fed6de7c20, context=context@entry=0x3ff1d8b5770, frames_p=frames_p@entry=0x3ff1d8b53a8) at ../../../libgcc/unwind.inc:64
#7  0x000003ff8e11dcc4 in _Unwind_Resume (exc=0x3fed6de7c20) at ../../../libgcc/unwind.inc:241
#8  0x000003ff0bc11d48 in onnxruntime::BFCArena::Free(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#9  0x000003ff0bc6c404 in onnxruntime::Tensor::~Tensor() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#10 0x000003ff0bc19bd8 in void onnxruntime::Delete<onnxruntime::Tensor>(void*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#11 0x000003ff0b7eecc0 in std::_Sp_counted_deleter<void*, void (*)(void*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#12 0x000003ff0bc28f84 in onnxruntime::IExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#13 0x000003ff0bc294a0 in onnxruntime::ExecutionFrame::ReleaseMLValueImpl(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#14 0x000003ff0bc27a5c in onnxruntime::IExecutionFrame::ReleaseMLValue(int) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#15 0x000003ff0bc94f98 in onnxruntime::SequentialExecutor::Execute(onnxruntime::SessionState const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<int, std::allocator<int> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, onnxruntime::logging::Logger const&) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#16 0x000003ff0bc82a00 in onnxruntime::utils::ExecuteGraphImpl(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, std::unordered_map<unsigned long, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)>, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::function<onnxruntime::common::Status (onnxruntime::TensorShape const&, OrtMemoryInfo const&, OrtValue&, bool&)> > > > const&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#17 0x000003ff0bc84d90 in onnxruntime::utils::ExecuteGraph(onnxruntime::SessionState const&, onnxruntime::FeedsFetchesManager&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<OrtValue, std::allocator<OrtValue> >&, ExecutionMode, bool const&, onnxruntime::logging::Logger const&, bool) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#18 0x000003ff0b830ed0 in onnxruntime::InferenceSession::Run(OrtRunOptions const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<OrtValue, std::allocator<OrtValue> >*, std::vector<OrtDevice, std::allocator<OrtDevice> > const*) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#19 0x000003ff0b7f82bc in OrtApis::Run(OrtSession*, OrtRunOptions const*, char const* const*, OrtValue const* const*, unsigned long, char const* const*, unsigned long, OrtValue**) () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/external/cc8_aarch64_gcc9/lib/libonnxruntime.so.1.6.0
#20 0x000003ff0c0a3590 in cms::Ort::ONNXRuntime::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<long, std::allocator<long> >, std::allocator<std::vector<long, std::allocator<long> > > > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, long) const () from /tmp/dsr/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/libPhysicsToolsONNXRuntime.so
#21 0x000003ff0c104780 in BoostedJetONNXJetTagsProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/cc8_aarch64_gcc9/cms/cmssw/CMSSW_11_3_X_2021-04-09-2300/lib/cc8_aarch64_gcc9/pluginRecoBTagONNXRuntimePlugins.so

@dan131riley
Copy link
Author

dan131riley commented Apr 15, 2021

New theory: onnxruntime uses the google nsync library for locks, not the C++ standard library implementations. For arm64 they have an assembler implementation of their synchronization primitives. The ARM synchronization primitives have a lot of undefined behavior:

Other events can clear a global exclusive monitor, but they are implementation defined and portable code must not rely on them

The Cavium ThunderX cores in the techlab systems are not based on any standard design (e.g., Cortex) and are known to have made some unconventional choices wrt implementation defined behavior. So my current guess is that the nsync primitives are relying on common choices for implementation defined behavior, and thus aren't reliable on the ThunderX. This could explain the intermittent lock failures with nominally impossible stack traces, and also why the crashes don't replicate at all on my Apple Silicon M1.

Currently testing with

#include <mutex>
#include <condition_variable>
namespace onnxruntime {
using OrtMutex = std::mutex;
using OrtCondVar = std::condition_variable;
}  // namespace onnxruntime

which replaces the nsync implementations with the ones from the standard library.

@dan131riley
Copy link
Author

cms-externals/onnxruntime#7 should fix the underlying problem that the google nsync library is not reliable on the Cavium ThunderX techlab systems. cms-externals/onnxruntime#6 only addressed the symptoms, which fortunately narrowed the problem enough to make it obvious that it was the underlying synchronization library that was broken.

@mrodozov
Copy link
Contributor

the new theory seems to be correct (🤞)
the latest Arm build which is finishing the 5of5 job (12 hours until now)
https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_12_0/2021-04-20-2300?selectedArchs=slc7_aarch64_gcc9&selectedFlavors=X&selectedStatus=failed
is having only 1 relval failing - 280, as reported in #33452
Should we notify MS people with an issue about this ?

@makortel
Copy link
Contributor

I understood it was @dan131riley's plan to make a PR (or issue) to upstream ONNXruntime. In the mean time we can probably close this issue?

@dan131riley
Copy link
Author

Yes, I think we've seen enough IBs to declare this fixed.

@makortel
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants