Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osx witness node crash when place order #582

Closed
crazybits opened this issue Jan 17, 2018 · 34 comments
Closed

osx witness node crash when place order #582

crazybits opened this issue Jan 17, 2018 · 34 comments
Assignees
Labels

Comments

@crazybits
Copy link
Contributor

Segmentation fault: 11

osx witness node crash when place order, above is the only log in console, any infor is required, please let me know.

@abitmore
Copy link
Member

If it's Ubuntu, I will run the node in gdb, trigger a segmentation fault, then get the output of thread apply all bt. Not sure if it applies to osx.

@crazybits
Copy link
Contributor Author

crazybits commented Jan 28, 2018

please find the log below for your further investigation ,thanks

Thread 12 (Thread 0x2903 of process 517):
#0 0x00007fff5c58dec2 in ?? ()
#1 0x00000001006b563d in boost::asio::detail::kqueue_reactor::run(bool, boost::asio::detail::op_queueboost::asio::detail::task_io_service_operation&) ()
#2 0x00000001006b5048 in boost::asio::detail::task_io_service::do_run_one(boost::asio::detail::scoped_lockboost::asio::detail::posix_mutex&, boost::asio::detail::task_io_service_thread_info&, boost::system::error_code const&) ()
#3 0x00000001006b4d15 in boost::asio::detail::task_io_service::run(boost::system::error_code&) ()
#4 0x00000001006b456e in fc::asio::default_io_service_scope::default_io_service_scope()::{lambda()#1}::operator()() const ()
#5 0x0000000100958664 in boost::(anonymous namespace)::thread_proxy (param=) at libs/thread/src/pthread/thread.cpp:167
#6 0x00007fff5c6c86c1 in ?? ()
#7 0x0000000000000001 in ?? ()
#8 0x0000700000676000 in ?? ()
#9 0x00000000180008ff in ?? ()
#10 0x0000000103e24180 in ?? ()
#11 0x0000700000675f50 in ?? ()
#12 0x00007fff5c6c856d in ?? ()
#13 0x0000000000000000 in ?? ()

Thread 11 (Thread 0x2803 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 10 (Thread 0x2703 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 9 (Thread 0x2603 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 8 (Thread 0x2503 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 7 (Thread 0x1703 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 6 (Thread 0x1603 of process 517):
---Type to continue, or q to quit---
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 5 (Thread 0x1503 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 4 (Thread 0x1107 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 3 (Thread 0x1003 of process 517):
#0 0x00007fff5c58ccee in ?? ()
#1 0x00007fff5c6c9662 in ?? ()
#2 0x0000000000000000 in ?? ()

Thread 2 (Thread 0xf03 of process 517):
#0 0x0000000100023f1d in graphene::app::network_broadcast_api::on_applied_block(graphene::chain::signed_block const&) ()
#1 0x0000000100430be3 in boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, graphene::chain::signed_block const&>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional >, boost::signals2::slot<void (graphene::chain::signed_block const&), boost::function<void (graphene::chain::signed_block const&)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional >, boost::signals2::slot<void (graphene::chain::signed_block const&), boost::function<void (graphene::chain::signed_block const&)> >, boost::signals2::mutex> >::dereference() const ()
#2 0x0000000100430868 in boost::signals2::detail::signal_impl<void (graphene::chain::signed_block const&), boost::signals2::optional_last_value, int, std::__1::less, boost::function<void (graphene::chain::signed_block const&)>, boost::function<void (boost::signals2::connection const&, graphene::chain::signed_block const&)>, boost::signals2::mutex>::operator()(graphene::chain::signed_block const&) ()
#3 0x00000001003ebb27 in graphene::chain::database::_apply_block(graphene::chain::signed_block const&) ()
#4 0x00000001003e02a1 in graphene::chain::database::apply_block(graphene::chain::signed_block const&, unsigned int) ()
#5 0x00000001003dd23f in graphene::chain::database::_push_block(graphene::chain::signed_block const&) ()
#6 0x00000001003dcce1 in graphene::chain::database::push_block(graphene::chain::signed_block const&, unsigned int) ()
#7 0x0000000100043677 in graphene::app::detail::application_impl::handle_block(graphene::net::block_message const&, bool, std::__1::vector<fc::ripemd160, std::__1::allocatorfc::ripemd160 >&) ()
#8 0x000000010093440b in fc::detail::functor_run<graphene::net::detail::statistics_gathering_node_delegate_wrapper::handle_block(graphene::net::block_message const&, bool, std::__1::vector<fc::ripemd160, std::__1::allocatorfc::ripemd160 >&)::$_56>::run(void*, void*) ()
#9 0x00000001006ae670 in fc::task_base::run_impl() ()
#10 0x00000001006aaa31 in fc::thread_d::run_next_task() ()
#11 0x00000001006a6978 in fc::thread_d::process_tasks() ()
#12 0x00000001006a9dea in fc::thread_d::start_process_tasks(long) ()
#13 0x000000010098432a in make_fcontext () at libs/context/src/asm/make_x86_64_sysv_macho_gas.S:64
Backtrace stopped: Cannot access memory at address 0x10b340000

@abitmore
Copy link
Member

Thanks for the report. Still not sure why it happened, but hopefully it won't happen again after #468 is done.

@abitmore abitmore added this to the Next Non-Consensus-Changing Release milestone Feb 2, 2018
@abitmore abitmore added bug build About build process labels Feb 2, 2018
@abitmore
Copy link
Member

abitmore commented Feb 2, 2018

Perhaps related to steemit/steem#2076

@abitmore
Copy link
Member

abitmore commented Feb 2, 2018

@crazybits can you check if the PR steemit/steem#2077 will fix this issue?

@cphrmky
Copy link

cphrmky commented Feb 7, 2018

i'm also seeing this issue on OS X

any time an order is created or canceled, or a withdrawal is initiated using the crypto-bridge client when it's pointed at my local node, i get a segfault...

@abitmore
Copy link
Member

@cphrmky @crazybits can you check whether #661 fixes/gets-around this issue? Thanks.

@jmjatlanta
Copy link
Contributor

I tested this on my mac with the develop and main branch, pointing to my testnet, and could not recreate the problem. Either (1) the problem has been fixed, or (2) my configuration is different, or (3) it only happens on the mainnet (which in my mind is improbable and would take a long time to for me to test). My guess is (1).

If this is still happening for someone, please provide me with more information and I'll take another look. I need the versions of OS, Boost, OpenSSL, and bitshares-core commit.

@abitmore
Copy link
Member

@jmjatlanta try mainnet please. Thanks.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Feb 27, 2018

Macbook Pro running High Sierra (10.13.3)
clang-900.0.39.2
OpenSSL 1.0.2n
Boost 1.60
Latest code from master branch (commit 96c0c27)
pointing to mainnet

The witness_node is running locally, started as:
witness_node --data-dir /my/data/dir --rpc-endpoint 127.0.0.1:8090 --max-ops-per-account 1000 --partial-operations true
and the cli_wallet running locally and pointing to my witness node.

I was able to create an order (limit order, STEALTH to BTS) without causing the witness node to crash. Please post your setup, and any details on how to recreate the issue, and I will attempt to figure out the cause. Thanks.

@abitmore
Copy link
Member

abitmore commented Feb 27, 2018

@jmjatlanta can you public your macOS binaries of the latest release? Then we can get some people to help test. For example, upload to your github repo.

@jmjatlanta
Copy link
Contributor

The binaries can be found here:
https://github.com/jmjatlanta/bitshares-core/releases/tag/mac-2.0.180226

@abitmore
Copy link
Member

@jmjatlanta got a report when running witness_node binary:

dyld: Library not loaded: /usr/local/opt/gperftools/lib/libtcmalloc.4.dylib
Reason: image not found

Seems need to statically link those libs?

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Feb 28, 2018

Looking at the cmake files, it looks as if tcmalloc is optional. I have compiled without it. See the appropriate binaries at https://github.com/jmjatlanta/bitshares-core/releases/tag/mac-2.0.180226

@abitmore
Copy link
Member

abitmore commented Mar 1, 2018

@jmjatlanta I got a report that the replay speed of new wo-tmalloc binaries is extremely slow. Any idea?

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 2, 2018

I'll attempt a static link of tcmalloc libraries, and then they can do a comparison. A comparison on my mac would be close to meaningless, as its resources are strained. I'll update here when the libraries are uploaded.

@abitmore
Copy link
Member

abitmore commented Mar 2, 2018

It's not likely, but worth mentioning: make sure the binaries are not built in Debug mode.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 2, 2018

The statically linked binaries are available at https://github.com/jmjatlanta/bitshares-core/releases/tag/mac-2.0.180226. Let me know how it goes.

BTW: none of these were with the DEBUG flag set.

@abitmore
Copy link
Member

abitmore commented Mar 2, 2018

BTW: none of these were with the DEBUG flag set.

IIRC debug mode is default. I always run cmake with
cmake -DCMAKE_BUILD_TYPE=Release ..

@jmjatlanta
Copy link
Contributor

Ok. I'll recompile with explicitly naming the build type. Coming soon...

@abitmore
Copy link
Member

abitmore commented Mar 3, 2018

@jmjatlanta I got report that your binaries crash as well, exactly as @cphrmky said:

any time an order is created or canceled, or a withdrawal is initiated using the crypto-bridge client when it's pointed at my local node, i get a segfault...

By the way, I believe the develop branch won't crash due to #661. So it would help if you could provide binaries built from develop branch.

Thank you very much.

@abitmore
Copy link
Member

abitmore commented Mar 3, 2018

@jmjatlanta I just noticed one thing: others crashed witness_node via light wallet but not cli_wallet.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 3, 2018

Ahhh. I was testing with the cli_wallet. I've provided binaries with tcmalloc statically linked for both the master branch and develop, and will now test them myself with the light wallet. Sorry for the slow response, as I'm fighting the flu with all I've got.

@cphrmky
Copy link

cphrmky commented Mar 3, 2018

ping when you've got the statically linked binaries up, happy to download and see if it's still happening

@jmjatlanta
Copy link
Contributor

They're there. https://github.com/jmjatlanta/bitshares-core/releases/tag/mac-2.0.180226
I'm having trouble syncing. I'm unsure why. It may be something unrelated. But it is preventing me from testing the original issue.

@cphrmky
Copy link

cphrmky commented Mar 3, 2018

My first attempt to sync on OS X ran fine up to a point, that at some block or other it just hung forever. By "forever" I mean multiple days of not moving beyond that one block. Despite politely killing the process and restarting, brutally killing the process, restarting the computer, switching to multiple different network connections, etc.

Eventually I just did rm -rf ./* on the data directory, start it again, and it synced up without error, albeit slowly'ish (but it's a pretty big blockchain, so no huge surprise there).

I've seen basically this exact same behavior with unrelated chains, always on OS X. I've had it happen when syncing a full Ethereum node to a Mac using geth (more than one Mac).

I've had it happen when syncing Electroneum (which is a fork of Monero) to a Mac.

If this syncing is the actually the same issue across these different projects (it could be straight coincidental) it the data would point to an issue with the the mac implementation of some c++ lib they all have in common.

@abitmore
Copy link
Member

abitmore commented Mar 4, 2018

@jmjatlanta the binaries you provided are not actually static linked with tcmalloc, as it still reports "dyld: Library not loaded" errors again. I think the wo-tcmalloc ones built in release mode would work.

@jmjatlanta
Copy link
Contributor

jmjatlanta commented Mar 4, 2018

No matter what I tell (or don't tell) cmake, it seems to always want to use the dynamic libraries. I'm going to delete the shared libraries, and see what it decides to do then (fly meet sledgehammer)!

In the meantime, I compiled the develop branch without tcmalloc and in Release mode. Those files are available at the usual spot.

https://github.com/jmjatlanta/bitshares-core/releases/tag/mac-2.0.180226

I was able to replicate the issue placing a simple order using the light client and the master branch.

Using the build without tcmalloc and the develop branch, my mac slowed to a crawl before the chain was sync'd. Now trying the tcmalloc version...

@jmjatlanta
Copy link
Contributor

Develop branch, using the light client, canceled an order, segfault.

582958ms th_a       application.cpp:549           handle_transaction   ] Got 6 transactions from network
585033ms th_a       application.cpp:549           handle_transaction   ] Got 4 transactions from network
585601ms th_a       application.cpp:499           handle_block         ] Got block: #24978226 time: 2018-03-06T02:09:45 latency: 601 ms from: witness.hiblockchain  irreversible: 24978205 (-21)
Segmentation fault: 11

@abitmore abitmore modified the milestones: 201803 Non-Consensus-Changing Release, Next Non-Consensus-Changing Release Mar 6, 2018
@abitmore
Copy link
Member

abitmore commented Mar 6, 2018

So #661 didn't help. It's strange. Can you try get a backtrace? Check the 2nd and 3rd comments in this issue.

By the way I think priority of this issue if not very high (removed from next milestone), perhaps put it aside if it's too hard or takes too much time.

@abitmore
Copy link
Member

abitmore commented Mar 8, 2018

Perhaps due to unsafe threading mentioned by @pmconrad in #703.

@abitmore
Copy link
Member

abitmore commented Apr 4, 2018

@jmjatlanta I guess #813 will fix this issue. Can you help check? Perhaps make a new binary with the patch so other people can help test.

@abitmore abitmore removed the build About build process label May 21, 2018
@abitmore
Copy link
Member

@jmjatlanta please close this issue if can confirm it's fixed by #813.

@abitmore
Copy link
Member

Assuming that #813 fixed this issue. Closing. Please feel free to reopen if it's not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants