Bad SDS Host URL causes seg fault #903

bndw · 2017-05-05T19:33:16Z

If an unresolvable URL is specified in the Cluster Manger SDS Hosts URL, a Segmentation fault will be raised without any helpful error text, even at a debug log level.

Below is an example Dockerfile and front-envoy.config for repro. Replacing tcp://does-not-exist.amazonaws.com:80 with a valid SDS host resolves the issue.

Dockerfile

FROM lyft/envoy:latest

COPY ./front-envoy.json /etc/front-envoy.json

CMD /usr/local/bin/envoy -c /etc/front-envoy.json -l debug

front-envoy.json

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:80",
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "codec_type": "auto",
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "backend",
                  "domains": ["*"],
                  "routes": [
                    {
                      "timeout_ms": 0,
                      "prefix": "/",
                      "cluster": "service_echo"
                    }
                  ]
                }
              ]
            },
            "filters": [
              { "type": "both", "name": "health_check",
                "config": {
                  "pass_through_mode": false, "endpoint": "/healthcheck"
                }
              },
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/null",
    "address": "tcp://0.0.0.0:8001"
  },
  "cluster_manager": {
    "sds": {
      "cluster": {
        "name": "sds",
        "connect_timeout_ms": 250,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [{"url": "tcp://does-not-exist.amazonaws.com:80"}]
      },
      "refresh_delay_ms": 3000
    },
    "clusters": [
      {
        "name": "service_echo",
        "connect_timeout_ms": 250,
        "type": "sds",
        "service_name": "echoer",
        "lb_type": "round_robin",
        "health_check":  {
          "type": "http",
          "timeout_ms": 2000,
          "interval_ms": 5000,
          "interval_jitter_ms": 5000,
          "unhealthy_threshold": 2,
          "healthy_threshold": 2,
          "path": "/",
          "service_name": "echoer"
        }
      }
    ]
  }
}

The text was updated successfully, but these errors were encountered:

dnoe · 2017-05-05T20:33:11Z

Output from a debug build with backtrace decoded:

[2017-05-05 16:29:42.099][107762][warning][main] initializing epoch 0 (hot restart version=7.2490552)
[2017-05-05 16:29:42.100][107762][info][main] admin address: 0.0.0.0:8001
[2017-05-05 16:29:42.104][107762][debug][upstream] starting async DNS resolution for does-not-exist.amazonaws.com
[2017-05-05 16:29:42.104][107762][info][upstream] cm init: adding: cluster=sds primary=1 secondary=0
[2017-05-05 16:29:42.107][107762][info][upstream] cm init: adding: cluster=service_echo primary=1 secondary=1
[2017-05-05 16:29:42.107][107762][info][config] loading 1 listener(s)
[2017-05-05 16:29:42.107][107762][info][config] listener #0:
[2017-05-05 16:29:42.107][107762][info][config]   address=0.0.0.0:80
[2017-05-05 16:29:42.108][107762][info][config]   filter #0:
[2017-05-05 16:29:42.108][107762][info][config]     type: read
[2017-05-05 16:29:42.108][107762][info][config]     name: http_connection_manager
[2017-05-05 16:29:42.112][107762][info][config]     filter #0
[2017-05-05 16:29:42.112][107762][info][config]       type: both
[2017-05-05 16:29:42.112][107762][info][config]       name: health_check
[2017-05-05 16:29:42.112][107762][info][config]     filter #1
[2017-05-05 16:29:42.112][107762][info][config]       type: decoder
[2017-05-05 16:29:42.112][107762][info][config]       name: router
[2017-05-05 16:29:42.112][107762][info][config] loading tracing configuration
[2017-05-05 16:29:42.113][107762][warning][main] starting main dispatch loop
[2017-05-05 16:29:42.218][107762][debug][upstream] async DNS resolution complete for does-not-exist.amazonaws.com
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: removing: cluster=sds primary=0 secondary=1
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: initializing secondary clusters
[2017-05-05 16:29:42.218][107762][debug][upstream] starting sds refresh for cluster: service_echo
[2017-05-05 16:29:42.218][107762][debug][router] [C0][S276000705182267] cluster 'sds' match for URL '/v1/registration/echoer'
[2017-05-05 16:29:42.218][107762][debug][http] async http request response headers (end_stream=false):
[2017-05-05 16:29:42.218][107762][debug][http]   ':status':'503'
[2017-05-05 16:29:42.218][107762][debug][http]   'content-length':'19'
[2017-05-05 16:29:42.218][107762][debug][http]   'content-type':'text/plain'
[2017-05-05 16:29:42.218][107762][debug][upstream] sds refresh failure for cluster: service_echo
[2017-05-05 16:29:42.218][107762][debug][upstream] sds refresh complete for cluster: service_echo
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: removing: cluster=service_echo primary=0 secondary=0
[2017-05-05 16:29:42.218][107762][warning][main] all clusters initialized. initializing init manager
[2017-05-05 16:29:42.218][107762][warning][main] all dependencies initialized. starting workers
[2017-05-05 16:29:42.219][107764][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.219][107765][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107766][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107768][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107767][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107769][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107770][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107771][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107772][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107773][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107774][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107775][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107762][critical][backtrace] Caught Segmentation fault, suspect faulting address 0x0
[2017-05-05 16:29:42.221][107762][critical] Backtrace (most recent call first) from thread 0:
  #1 Upstream::ClusterManagerInitHelper::maybeFinishInitialize() at cluster_manager_impl.cc:99
  #2 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:76
  #3 operator() at cluster_manager_impl.cc:52
  #4 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #5 std::function<void ()>::operator()() const at functional:2440
  #6 operator() at upstream_impl.cc:454
  #7 std::_Function_handler<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&), Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::$_3>::_M_invoke(std::_Any_data const&, std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) at functional:2039
  #8 std::function<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&)>::operator()(std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) const at functional:2440
  #9 Network::DnsResolverImpl::PendingResolution::onAresHostCallback(int, hostent*) at dns_impl.cc:70
  #10 operator() at dns_impl.cc:123
  #11 __invoke at dns_impl.cc:122
  #12 next_lookup at ares_gethostbyname.c:?
  #13 host_callback at ares_gethostbyname.c:?
  #14 search_callback at ares_search.c:?
  #15 qcallback at ares_query.c:?
  #16 end_query at ares_process.c:?
  #17 process_answer at ares_process.c:?
  #18 processfds at ares_process.c:?
  #19 Network::DnsResolverImpl::onEventCallback(int, unsigned int) at dns_impl.cc:92
  #20 operator() at dns_impl.cc:110
  #21 std::_Function_handler<void (unsigned int), Network::DnsResolverImpl::onAresSocketStateChange(int, int, int)::$_2>::_M_invoke(std::_Any_data const&, unsigned int) at functional:2039
  #22 std::function<void (unsigned int)>::operator()(unsigned int) const at functional:2440
  #23 operator() at file_event_impl.cc:59
  #24 __invoke at file_event_impl.cc:43
  #25 event_persist_closure at event.c:1580
  #26  (inlined by) event_process_active_single_queue at event.c:1639
  #27 event_process_active at event.c:?
  #28  (inlined by) event_base_loop at event.c:1961
  #29 Event::DispatcherImpl::run(Event::Dispatcher::RunType) at dispatcher_impl.cc:142
  #30 Server::InstanceImpl::run() at server.cc:366
  #31 main at main.cc:65
  #32 
  #33 ?? ??:0
  #34 
  #35 _start at start.S:118
  #36

dnoe · 2017-05-05T21:33:45Z

The segfault is caused by the secondary_init_clusters_ list being modified within the for loop that traverses it inside ClusterManagerInitHelper::maybeFinishInitialize().

[2017-05-05 17:26:02.076][124413][critical] Backtrace (most recent call first) from thread 0:
  #1 unsigned long backward::details::unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::callback>(backward::StackTraceImpl<backward::system_tag::linux_tag>::callback, unsigned long) at backward.hpp:601
  #2 backward::StackTraceImpl<backward::system_tag::linux_tag>::load_here(unsigned long) at backward.hpp:617
  #3 BackwardsTrace::capture() at backtrace.h:53
  #4 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:58
  #5 operator() at cluster_manager_impl.cc:53
  #6 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #7 std::function<void ()>::operator()() const at functional:2440
  #8 Upstream::SdsClusterImpl::onFetchComplete() at sds.cc:111
  #9 Http::RestApiFetcher::requestComplete() at rest_api_fetcher.cc:59
  #10 Http::RestApiFetcher::onFailure(Http::AsyncClient::FailureReason) at rest_api_fetcher.cc:46
  #11 Http::RestApiFetcher::onSuccess(std::unique_ptr<Http::Message, std::default_delete<Http::Message> >&&) at rest_api_fetcher.cc:31
  #12 Http::AsyncRequestImpl::onComplete() at async_client_impl.cc:169
  #13 Http::AsyncRequestImpl::onData(Buffer::Instance&, bool) at async_client_impl.cc:186
  #14 Http::AsyncStreamImpl::encodeData(Buffer::Instance&, bool) at async_client_impl.cc:88
  #15 Http::Utility::sendLocalReply(Http::StreamDecoderFilterCallbacks&, Http::Code, std::string const&) at utility.cc:155
  #16 Router::Filter::sendNoHealthyUpstreamResponse() at router.cc:263
  #17 Router::Filter::decodeHeaders(Http::HeaderMap&, bool) at router.cc:213
  #18 Http::AsyncStreamImpl::sendHeaders(Http::HeaderMap&, bool) at async_client_impl.cc:107
  #19 AsyncRequestImpl at async_client_impl.cc:162
  #20 Http::AsyncClientImpl::send(std::unique_ptr<Http::Message, std::default_delete<Http::Message> >&&, Http::AsyncClient::Callbacks&, Optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) at async_client_impl.cc:41
  #21 Http::RestApiFetcher::refresh() at rest_api_fetcher.cc:54
  #22 Http::RestApiFetcher::initialize() at rest_api_fetcher.cc:26
  #23 Upstream::SdsClusterImpl::initialize() at sds.h:24
  #24 Upstream::ClusterManagerInitHelper::maybeFinishInitialize() at cluster_manager_impl.cc:105
  #25 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:80
  #26 operator() at cluster_manager_impl.cc:53
  #27 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #28 std::function<void ()>::operator()() const at functional:2440
  #29 operator() at upstream_impl.cc:454
  #30 std::_Function_handler<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&), Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::$_3>::_M_invoke(std::_Any_data const&, std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) at functional:2039
  #31 std::function<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&)>::operator()(std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) const at functional:2440
  #32 Network::DnsResolverImpl::PendingResolution::onAresHostCallback(int, hostent*) at dns_impl.cc:70
  #33 operator() at dns_impl.cc:123
  #34 __invoke at dns_impl.cc:122
  #35 next_lookup at ares_gethostbyname.c:?
  #36 host_callback at ares_gethostbyname.c:?
  #37 search_callback at ares_search.c:?
  #38 qcallback at ares_query.c:?
  #39 end_query at ares_process.c:?
  #40 process_answer at ares_process.c:?
  #41 processfds at ares_process.c:?
  #42 Network::DnsResolverImpl::onEventCallback(int, unsigned int) at dns_impl.cc:92
  #43 operator() at dns_impl.cc:110
  #44 std::_Function_handler<void (unsigned int), Network::DnsResolverImpl::onAresSocketStateChange(int, int, int)::$_2>::_M_invoke(std::_Any_data const&, unsigned int) at functional:2039
  #45 std::function<void (unsigned int)>::operator()(unsigned int) const at functional:2440
  #46 operator() at file_event_impl.cc:59
  #47 __invoke at file_event_impl.cc:43
  #48 event_persist_closure at event.c:1580
  #49  (inlined by) event_process_active_single_queue at event.c:1639
  #50 event_process_active at event.c:?
  #51  (inlined by) event_base_loop at event.c:1961
  #52 Event::DispatcherImpl::run(Event::Dispatcher::RunType) at dispatcher_impl.cc:142
  #53 Server::InstanceImpl::run() at server.cc:366
  #54 main at main.cc:65

In frame #24, we are inside the for loop over secondary clusters. In frame #4 we are in removeCluster which modifies the list of secondary clusters by removing the only element. After returning to the loop invalid memory access occurs as a result.

dnoe · 2017-05-05T21:59:54Z

I have a working fix for this specific issue but I need to understand the code better and think a bit more to decide if it's the right solution.

mattklein123 · 2017-05-06T02:44:26Z

Thanks for fixing my bug @dnoe! Let me know if you need help with anything.

A segfault arose when iterating over the list of secondary_init_clusters_ and calling initialize(). The call to initialize may result in the initialized item being removed from the secondary_init_clusters_ list while walking through the list. The initialized item can only remove itself, not any other item in the list. However, this resulted in chasing an invalidated iterator when trying to go to the next element in the list. To fix this the loop is modified so that the iterator is advanced to the next item (or end() if no more items) before calling Cluster::initialize(). A test is included which exercises this scenario Fixes envoyproxy#903

A segfault arose when iterating over the list of secondary_init_clusters_ and calling initialize(). The call to initialize may result in the initialized item being removed from the secondary_init_clusters_ list while walking through the list. The initialized item can only remove itself, not any other item in the list. However, this resulted in chasing an invalidated iterator when trying to go to the next element in the list. To fix this the loop is modified so that the iterator is advanced to the next item (or end() if no more items) before calling Cluster::initialize(). A test is included which exercises this scenario Fixes #903

omnilinguist · 2017-05-12T05:34:39Z

@dnoe may I ask how did you generate the Output from a debug build with backtrace decoded:? I am running into a similar issue with the example given in https://www.datawire.io/guide/traffic/envoy-flask-kubernetes/ on a recent latest, wondering how I can generate the same debug log.

Automatic merge from submit-queue. [DO NOT MERGE] Auto PR to update dependencies of proxy This PR will be merged automatically once checks are successful. ```release-note none ```

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

mattklein123 modified the milestone: 1.3.0 May 6, 2017

mattklein123 added the bug label May 6, 2017

mattklein123 assigned dnoe May 6, 2017

dnoe mentioned this issue May 8, 2017

Fix segfault caused by modifying list within iteration. #922

Merged

mattklein123 closed this as completed in #922 May 8, 2017

jpsim pushed a commit that referenced this issue Nov 28, 2022

release: v0.3.0.06112020 (#903)

76b2076

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

jpsim pushed a commit that referenced this issue Nov 29, 2022

release: v0.3.0.06112020 (#903)

568a451

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad SDS Host URL causes seg fault #903

Bad SDS Host URL causes seg fault #903

bndw commented May 5, 2017 •

edited

Loading

dnoe commented May 5, 2017

dnoe commented May 5, 2017

dnoe commented May 5, 2017

mattklein123 commented May 6, 2017

omnilinguist commented May 12, 2017

Bad SDS Host URL causes seg fault #903

Bad SDS Host URL causes seg fault #903

Comments

bndw commented May 5, 2017 • edited Loading

dnoe commented May 5, 2017

dnoe commented May 5, 2017

dnoe commented May 5, 2017

mattklein123 commented May 6, 2017

omnilinguist commented May 12, 2017

bndw commented May 5, 2017 •

edited

Loading