Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash regression from #10362 #10901

Closed
rgs1 opened this issue Apr 22, 2020 · 12 comments
Closed

Crash regression from #10362 #10901

rgs1 opened this issue Apr 22, 2020 · 12 comments
Assignees
Labels
bug no stalebot Disables stalebot from closing an issue
Milestone

Comments

@rgs1
Copy link
Member

rgs1 commented Apr 22, 2020

We synced with master yesterday and are now getting a crash on startup (after a hot restart).

Here's the stack trace (not super useful):

StreamRuntime gRPC config stream closed: 14, no healthy upstream
Unable to establish new stream
StreamRuntime gRPC config stream closed: 14, no healthy upstream
Unable to establish new stream
Caught Segmentation fault, suspect faulting address 0x10
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: 0/1.15.0-dev//RELEASE/BoringSSL
#0: __restore_rt [0x7f7cf5486890] ??:0
#1: Envoy::ThreadLocal::InstanceImpl::Bookkeeper::get() [0x562156ddfa63] ??:0

I am suspecting either #10362 or #10842.

The full list of changes that we picked up:

Remove hardcoded type urls Part.2 #10848
upstream: fix panic on grpc unknown_service status on healthchecks #10863
Fix Windows compilation of test sources #10822
conn_pool: unifying status codes #10854
Windows compilation: enable compiling expanded list of extensions in envoy-static #10542
logger: Make log prefix configurable #10693
stream_info: Collapse constructors #10691
coverage: revert workarounds that are no longer neccessary #10837
Update LuaJIT patch - remove MAP_32BIT #10867
filter: postgres statistics network filter #10642
api/faq: add initial API versioning FAQ entries. #10829
Catch exception and return false in cases where std::regex_match throws. #10861
redis: Fix stack-use-after-scope in test #10840
http: downstream connect support #10720
init: order dynamic resource initialization to make RTDS always be first #10362
[test] fix fuzz tests that might crash on duplicate settings params #10779
Fix clang-tidy in source/common/http/conn_manager_config.h #10860
fix: upstream grpc stats on trailers only #10842
ip tagging: remember tags as builtins #10856
Remove vendor specific dynamo filter use from HCM config test #10858
router: allow retry of streaming/incomplete requests #10725
Update filter_chain_benchmark_test.cc #10850
[admin] extract stats handlers to separate file #10750

@rgs1
Copy link
Member Author

rgs1 commented Apr 22, 2020

Maybe #10362 :

[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] init manager RTDS initialized, notifying RDTS
[2020-04-22 17:42:30.844][20][info][runtime] [external/envoy/source/common/runtime/runtime_impl.cc:519] RTDS has finished initialization
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:187] continue initializing secondary clusters
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:125] maybe finish initialize state: 4
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:133] maybe finish initialize primary init clusters empty: true
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:141] maybe finish initialize secondary init clusters empty: true
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:164] maybe finish initialize cds api ready: true
[2020-04-22 17:42:30.844][20][info][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:170] cm init: all clusters initialized
[2020-04-22 17:42:30.844][20][info][main] [external/envoy/source/server/server.cc:557] all clusters initialized. initializing init manager
[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/manager_impl.cc:45] init manager Server contains no targets
[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] init manager Server initialized, notifying RunHelper
[2020-04-22 17:42:30.844][20][info][config] [external/envoy/source/server/listener_manager_impl.cc:796] all dependencies initialized. starting workers
[2020-04-22 17:42:30.844][20][debug][config] [external/envoy/source/server/listener_manager_impl.cc:807] starting worker 0
[2020-04-22 17:42:30.844][20][debug][config] [external/envoy/source/server/listener_manager_impl.cc:807] starting worker 1
[2020-04-22 17:42:30.844][32][debug][main] [external/envoy/source/server/worker_impl.cc:111] worker entering dispatch loop
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x10
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 0/1.15.0-dev//RELEASE/BoringSSL
[2020-04-22 17:42:30.844][33][debug][main] [external/envoy/source/server/worker_impl.cc:111] worker entering dispatch loop
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: __restore_rt [0x7f918b769890] ??:0

cc: @yanavlasov

@mattklein123 mattklein123 added this to the 1.15.0 milestone Apr 22, 2020
@mattklein123
Copy link
Member

@rgs1 if possible can you get a proper stack trace with debug symbols?

@rgs1
Copy link
Member Author

rgs1 commented Apr 22, 2020

Reverting #10362 indeed made it go away. Hmm. I might not have time to debug more of this today, so a better stack trace might need to wait unless someone else can get a repro of this first (if not I'll follow-up later/tomorrow).

@mattklein123
Copy link
Member

OK we should probably revert #10362 in the meantime. cc @yanavlasov

@mattklein123 mattklein123 changed the title new crash Crash regression from #10362 Apr 22, 2020
@rgs1
Copy link
Member Author

rgs1 commented Apr 23, 2020

Is the plan still to revert this? I might have time to look into this again later today, but in parallel we'd like to still stay close to master (without carrying a revert patch) so reverting would be nice.

@mattklein123
Copy link
Member

@rgs1 can you file the revert PR? If not I can do it later.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Apr 23, 2020
@rgs1
Copy link
Member Author

rgs1 commented Apr 23, 2020

Yup -- coming up.

@rgs1
Copy link
Member Author

rgs1 commented Apr 23, 2020

@mattklein123 revert here: #10919.

@yanavlasov
Copy link
Contributor

@rgs1 stack trace would be very helpful or some pointers on how to reproduce this.

@rgs1
Copy link
Member Author

rgs1 commented Apr 24, 2020

For others following along:

#0: __restore_rt [0x7f840a2ec890] ??:0
#1: Envoy::ThreadLocal::InstanceImpl::Bookkeeper::get() [0x55f3e9a71a63] ??:0
#2: Envoy::ThreadLocal::Slot::getTyped<>() [0x55f3e9b2ddac] ??:0
#3: Envoy::Stats::ThreadLocalStoreImpl::ScopeImpl::counterFromStatNameWithTags() [0x55f3e9b2dae8] ??:0
#4: Envoy::Server::GuardDogImpl::WatchedDog::WatchedDog() [0x55f3e9a8b9a6] ??:0
#5: Envoy::Server::GuardDogImpl::createWatchDog() [0x55f3e9a8b548] ??:0
#6: std::_Function_handler<>::_M_invoke() [0x55f3e9af129a] ??:0
#7: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x55f3e9afb0fd] ??:0
#8: Envoy::Event::DispatcherImpl::run() [0x55f3e9afafbc] ??:0
#9: Envoy::Server::WorkerImpl::threadRoutine() [0x55f3e9af04a0] ??:0
#10: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::$_0::__invoke() [0x55f3ea034d35] ??:0
#11: start_thread [0x7f840a2e16db] ??:0

That's from a clean start, not a hot-restart.

@yanavlasov
Copy link
Contributor

I was able to recreate it. The conditions are:

  1. RTDS initialized via gRPC
  2. The cluster used for RTDS needs to have health checks enabled
  3. No healthy upstream hosts

This leads to some invariant of the new split initialization of the cluster manager violated. It is firing an ASSERT in cluster manager in debug builds. In release builds it continues and later crashes with the stack that @rgs1 attached, which I think is red herring at this point. I will debug this further later on to figure out how the state machine in cluster manager gets out of whack.

@mattklein123
Copy link
Member

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

3 participants