Crash regression from #10362 #10901

rgs1 · 2020-04-22T17:35:25Z

We synced with master yesterday and are now getting a crash on startup (after a hot restart).

Here's the stack trace (not super useful):

StreamRuntime gRPC config stream closed: 14, no healthy upstream
Unable to establish new stream
StreamRuntime gRPC config stream closed: 14, no healthy upstream
Unable to establish new stream
Caught Segmentation fault, suspect faulting address 0x10
Backtrace (use tools/stack_decode.py to get line numbers):
Envoy version: 0/1.15.0-dev//RELEASE/BoringSSL
#0: __restore_rt [0x7f7cf5486890] ??:0
#1: Envoy::ThreadLocal::InstanceImpl::Bookkeeper::get() [0x562156ddfa63] ??:0

I am suspecting either #10362 or #10842.

The full list of changes that we picked up:

Remove hardcoded type urls Part.2 #10848
upstream: fix panic on grpc unknown_service status on healthchecks #10863
Fix Windows compilation of test sources #10822
conn_pool: unifying status codes #10854
Windows compilation: enable compiling expanded list of extensions in envoy-static #10542
logger: Make log prefix configurable #10693
stream_info: Collapse constructors #10691
coverage: revert workarounds that are no longer neccessary #10837
Update LuaJIT patch - remove MAP_32BIT #10867
filter: postgres statistics network filter #10642
api/faq: add initial API versioning FAQ entries. #10829
Catch exception and return false in cases where std::regex_match throws. #10861
redis: Fix stack-use-after-scope in test #10840
http: downstream connect support #10720
init: order dynamic resource initialization to make RTDS always be first #10362
[test] fix fuzz tests that might crash on duplicate settings params #10779
Fix clang-tidy in source/common/http/conn_manager_config.h #10860
fix: upstream grpc stats on trailers only #10842
ip tagging: remember tags as builtins #10856
Remove vendor specific dynamo filter use from HCM config test #10858
router: allow retry of streaming/incomplete requests #10725
Update filter_chain_benchmark_test.cc #10850
[admin] extract stats handlers to separate file #10750

The text was updated successfully, but these errors were encountered:

rgs1 · 2020-04-22T17:44:14Z

Maybe #10362 :

[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] init manager RTDS initialized, notifying RDTS
[2020-04-22 17:42:30.844][20][info][runtime] [external/envoy/source/common/runtime/runtime_impl.cc:519] RTDS has finished initialization
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:187] continue initializing secondary clusters
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:125] maybe finish initialize state: 4
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:133] maybe finish initialize primary init clusters empty: true
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:141] maybe finish initialize secondary init clusters empty: true
[2020-04-22 17:42:30.844][20][debug][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:164] maybe finish initialize cds api ready: true
[2020-04-22 17:42:30.844][20][info][upstream] [external/envoy/source/common/upstream/cluster_manager_impl.cc:170] cm init: all clusters initialized
[2020-04-22 17:42:30.844][20][info][main] [external/envoy/source/server/server.cc:557] all clusters initialized. initializing init manager
[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/manager_impl.cc:45] init manager Server contains no targets
[2020-04-22 17:42:30.844][20][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] init manager Server initialized, notifying RunHelper
[2020-04-22 17:42:30.844][20][info][config] [external/envoy/source/server/listener_manager_impl.cc:796] all dependencies initialized. starting workers
[2020-04-22 17:42:30.844][20][debug][config] [external/envoy/source/server/listener_manager_impl.cc:807] starting worker 0
[2020-04-22 17:42:30.844][20][debug][config] [external/envoy/source/server/listener_manager_impl.cc:807] starting worker 1
[2020-04-22 17:42:30.844][32][debug][main] [external/envoy/source/server/worker_impl.cc:111] worker entering dispatch loop
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x10
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:92] Envoy version: 0/1.15.0-dev//RELEASE/BoringSSL
[2020-04-22 17:42:30.844][33][debug][main] [external/envoy/source/server/worker_impl.cc:111] worker entering dispatch loop
[2020-04-22 17:42:30.844][32][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:96] #0: __restore_rt [0x7f918b769890] ??:0

cc: @yanavlasov

mattklein123 · 2020-04-22T17:45:58Z

@rgs1 if possible can you get a proper stack trace with debug symbols?

rgs1 · 2020-04-22T18:48:59Z

Reverting #10362 indeed made it go away. Hmm. I might not have time to debug more of this today, so a better stack trace might need to wait unless someone else can get a repro of this first (if not I'll follow-up later/tomorrow).

mattklein123 · 2020-04-22T18:50:59Z

OK we should probably revert #10362 in the meantime. cc @yanavlasov

rgs1 · 2020-04-23T14:29:50Z

Is the plan still to revert this? I might have time to look into this again later today, but in parallel we'd like to still stay close to master (without carrying a revert patch) so reverting would be nice.

mattklein123 · 2020-04-23T17:21:45Z

@rgs1 can you file the revert PR? If not I can do it later.

rgs1 · 2020-04-23T17:43:43Z

Yup -- coming up.

rgs1 · 2020-04-23T17:45:48Z

@mattklein123 revert here: #10919.

yanavlasov · 2020-04-24T01:03:07Z

@rgs1 stack trace would be very helpful or some pointers on how to reproduce this.

rgs1 · 2020-04-24T19:40:37Z

For others following along:

#0: __restore_rt [0x7f840a2ec890] ??:0
#1: Envoy::ThreadLocal::InstanceImpl::Bookkeeper::get() [0x55f3e9a71a63] ??:0
#2: Envoy::ThreadLocal::Slot::getTyped<>() [0x55f3e9b2ddac] ??:0
#3: Envoy::Stats::ThreadLocalStoreImpl::ScopeImpl::counterFromStatNameWithTags() [0x55f3e9b2dae8] ??:0
#4: Envoy::Server::GuardDogImpl::WatchedDog::WatchedDog() [0x55f3e9a8b9a6] ??:0
#5: Envoy::Server::GuardDogImpl::createWatchDog() [0x55f3e9a8b548] ??:0
#6: std::_Function_handler<>::_M_invoke() [0x55f3e9af129a] ??:0
#7: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x55f3e9afb0fd] ??:0
#8: Envoy::Event::DispatcherImpl::run() [0x55f3e9afafbc] ??:0
#9: Envoy::Server::WorkerImpl::threadRoutine() [0x55f3e9af04a0] ??:0
#10: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::$_0::__invoke() [0x55f3ea034d35] ??:0
#11: start_thread [0x7f840a2e16db] ??:0

That's from a clean start, not a hot-restart.

yanavlasov · 2020-04-24T20:53:10Z

I was able to recreate it. The conditions are:

RTDS initialized via gRPC
The cluster used for RTDS needs to have health checks enabled
No healthy upstream hosts

This leads to some invariant of the new split initialization of the cluster manager violated. It is firing an ASSERT in cluster manager in debug builds. In release builds it continues and later crashes with the stack that @rgs1 attached, which I think is red herring at this point. I will debug this further later on to figure out how the state machine in cluster manager gets out of whack.

mattklein123 · 2020-05-11T04:09:09Z

Fixed

mattklein123 added the bug label Apr 22, 2020

mattklein123 added this to the 1.15.0 milestone Apr 22, 2020

mattklein123 assigned yanavlasov Apr 22, 2020

mattklein123 changed the title ~~new crash~~ Crash regression from #10362 Apr 22, 2020

mattklein123 added the no stalebot Disables stalebot from closing an issue label Apr 23, 2020

rgs1 mentioned this issue Apr 23, 2020

Revert "init: order dynamic resource initialization to make RTDS alwa… #10919

Merged

mattklein123 closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash regression from #10362 #10901

Crash regression from #10362 #10901

rgs1 commented Apr 22, 2020

rgs1 commented Apr 22, 2020

mattklein123 commented Apr 22, 2020

rgs1 commented Apr 22, 2020

mattklein123 commented Apr 22, 2020

rgs1 commented Apr 23, 2020

mattklein123 commented Apr 23, 2020

rgs1 commented Apr 23, 2020

rgs1 commented Apr 23, 2020

yanavlasov commented Apr 24, 2020

rgs1 commented Apr 24, 2020

yanavlasov commented Apr 24, 2020

mattklein123 commented May 11, 2020

Crash regression from #10362 #10901

Crash regression from #10362 #10901

Comments

rgs1 commented Apr 22, 2020

rgs1 commented Apr 22, 2020

mattklein123 commented Apr 22, 2020

rgs1 commented Apr 22, 2020

mattklein123 commented Apr 22, 2020

rgs1 commented Apr 23, 2020

mattklein123 commented Apr 23, 2020

rgs1 commented Apr 23, 2020

rgs1 commented Apr 23, 2020

yanavlasov commented Apr 24, 2020

rgs1 commented Apr 24, 2020

yanavlasov commented Apr 24, 2020

mattklein123 commented May 11, 2020