Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad SDS Host URL causes seg fault #903

Closed
bndw opened this issue May 5, 2017 · 5 comments · Fixed by #922
Closed

Bad SDS Host URL causes seg fault #903

bndw opened this issue May 5, 2017 · 5 comments · Fixed by #922
Assignees
Labels
Milestone

Comments

@bndw
Copy link
Contributor

bndw commented May 5, 2017

If an unresolvable URL is specified in the Cluster Manger SDS Hosts URL, a Segmentation fault will be raised without any helpful error text, even at a debug log level.

Below is an example Dockerfile and front-envoy.config for repro. Replacing tcp://does-not-exist.amazonaws.com:80 with a valid SDS host resolves the issue.

Dockerfile

FROM lyft/envoy:latest

COPY ./front-envoy.json /etc/front-envoy.json

CMD /usr/local/bin/envoy -c /etc/front-envoy.json -l debug

front-envoy.json

{
  "listeners": [
    {
      "address": "tcp://0.0.0.0:80",
      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {
            "codec_type": "auto",
            "stat_prefix": "ingress_http",
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "backend",
                  "domains": ["*"],
                  "routes": [
                    {
                      "timeout_ms": 0,
                      "prefix": "/",
                      "cluster": "service_echo"
                    }
                  ]
                }
              ]
            },
            "filters": [
              { "type": "both", "name": "health_check",
                "config": {
                  "pass_through_mode": false, "endpoint": "/healthcheck"
                }
              },
              {
                "type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "access_log_path": "/dev/null",
    "address": "tcp://0.0.0.0:8001"
  },
  "cluster_manager": {
    "sds": {
      "cluster": {
        "name": "sds",
        "connect_timeout_ms": 250,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [{"url": "tcp://does-not-exist.amazonaws.com:80"}]
      },
      "refresh_delay_ms": 3000
    },
    "clusters": [
      {
        "name": "service_echo",
        "connect_timeout_ms": 250,
        "type": "sds",
        "service_name": "echoer",
        "lb_type": "round_robin",
        "health_check":  {
          "type": "http",
          "timeout_ms": 2000,
          "interval_ms": 5000,
          "interval_jitter_ms": 5000,
          "unhealthy_threshold": 2,
          "healthy_threshold": 2,
          "path": "/",
          "service_name": "echoer"
        }
      }
    ]
  }
}
@dnoe
Copy link
Contributor

dnoe commented May 5, 2017

Output from a debug build with backtrace decoded:

[2017-05-05 16:29:42.099][107762][warning][main] initializing epoch 0 (hot restart version=7.2490552)
[2017-05-05 16:29:42.100][107762][info][main] admin address: 0.0.0.0:8001
[2017-05-05 16:29:42.104][107762][debug][upstream] starting async DNS resolution for does-not-exist.amazonaws.com
[2017-05-05 16:29:42.104][107762][info][upstream] cm init: adding: cluster=sds primary=1 secondary=0
[2017-05-05 16:29:42.107][107762][info][upstream] cm init: adding: cluster=service_echo primary=1 secondary=1
[2017-05-05 16:29:42.107][107762][info][config] loading 1 listener(s)
[2017-05-05 16:29:42.107][107762][info][config] listener #0:
[2017-05-05 16:29:42.107][107762][info][config]   address=0.0.0.0:80
[2017-05-05 16:29:42.108][107762][info][config]   filter #0:
[2017-05-05 16:29:42.108][107762][info][config]     type: read
[2017-05-05 16:29:42.108][107762][info][config]     name: http_connection_manager
[2017-05-05 16:29:42.112][107762][info][config]     filter #0
[2017-05-05 16:29:42.112][107762][info][config]       type: both
[2017-05-05 16:29:42.112][107762][info][config]       name: health_check
[2017-05-05 16:29:42.112][107762][info][config]     filter #1
[2017-05-05 16:29:42.112][107762][info][config]       type: decoder
[2017-05-05 16:29:42.112][107762][info][config]       name: router
[2017-05-05 16:29:42.112][107762][info][config] loading tracing configuration
[2017-05-05 16:29:42.113][107762][warning][main] starting main dispatch loop
[2017-05-05 16:29:42.218][107762][debug][upstream] async DNS resolution complete for does-not-exist.amazonaws.com
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: removing: cluster=sds primary=0 secondary=1
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: initializing secondary clusters
[2017-05-05 16:29:42.218][107762][debug][upstream] starting sds refresh for cluster: service_echo
[2017-05-05 16:29:42.218][107762][debug][router] [C0][S276000705182267] cluster 'sds' match for URL '/v1/registration/echoer'
[2017-05-05 16:29:42.218][107762][debug][http] async http request response headers (end_stream=false):
[2017-05-05 16:29:42.218][107762][debug][http]   ':status':'503'
[2017-05-05 16:29:42.218][107762][debug][http]   'content-length':'19'
[2017-05-05 16:29:42.218][107762][debug][http]   'content-type':'text/plain'
[2017-05-05 16:29:42.218][107762][debug][upstream] sds refresh failure for cluster: service_echo
[2017-05-05 16:29:42.218][107762][debug][upstream] sds refresh complete for cluster: service_echo
[2017-05-05 16:29:42.218][107762][info][upstream] cm init: removing: cluster=service_echo primary=0 secondary=0
[2017-05-05 16:29:42.218][107762][warning][main] all clusters initialized. initializing init manager
[2017-05-05 16:29:42.218][107762][warning][main] all dependencies initialized. starting workers
[2017-05-05 16:29:42.219][107764][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.219][107765][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107766][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107768][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107767][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107769][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107770][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107771][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107772][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107773][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107774][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107775][info][main] worker entering dispatch loop
[2017-05-05 16:29:42.220][107762][critical][backtrace] Caught Segmentation fault, suspect faulting address 0x0
[2017-05-05 16:29:42.221][107762][critical] Backtrace (most recent call first) from thread 0:
  #1 Upstream::ClusterManagerInitHelper::maybeFinishInitialize() at cluster_manager_impl.cc:99
  #2 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:76
  #3 operator() at cluster_manager_impl.cc:52
  #4 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #5 std::function<void ()>::operator()() const at functional:2440
  #6 operator() at upstream_impl.cc:454
  #7 std::_Function_handler<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&), Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::$_3>::_M_invoke(std::_Any_data const&, std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) at functional:2039
  #8 std::function<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&)>::operator()(std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) const at functional:2440
  #9 Network::DnsResolverImpl::PendingResolution::onAresHostCallback(int, hostent*) at dns_impl.cc:70
  #10 operator() at dns_impl.cc:123
  #11 __invoke at dns_impl.cc:122
  #12 next_lookup at ares_gethostbyname.c:?
  #13 host_callback at ares_gethostbyname.c:?
  #14 search_callback at ares_search.c:?
  #15 qcallback at ares_query.c:?
  #16 end_query at ares_process.c:?
  #17 process_answer at ares_process.c:?
  #18 processfds at ares_process.c:?
  #19 Network::DnsResolverImpl::onEventCallback(int, unsigned int) at dns_impl.cc:92
  #20 operator() at dns_impl.cc:110
  #21 std::_Function_handler<void (unsigned int), Network::DnsResolverImpl::onAresSocketStateChange(int, int, int)::$_2>::_M_invoke(std::_Any_data const&, unsigned int) at functional:2039
  #22 std::function<void (unsigned int)>::operator()(unsigned int) const at functional:2440
  #23 operator() at file_event_impl.cc:59
  #24 __invoke at file_event_impl.cc:43
  #25 event_persist_closure at event.c:1580
  #26  (inlined by) event_process_active_single_queue at event.c:1639
  #27 event_process_active at event.c:?
  #28  (inlined by) event_base_loop at event.c:1961
  #29 Event::DispatcherImpl::run(Event::Dispatcher::RunType) at dispatcher_impl.cc:142
  #30 Server::InstanceImpl::run() at server.cc:366
  #31 main at main.cc:65
  #32 
  #33 ?? ??:0
  #34 
  #35 _start at start.S:118
  #36 

@dnoe
Copy link
Contributor

dnoe commented May 5, 2017

The segfault is caused by the secondary_init_clusters_ list being modified within the for loop that traverses it inside ClusterManagerInitHelper::maybeFinishInitialize().

[2017-05-05 17:26:02.076][124413][critical] Backtrace (most recent call first) from thread 0:
  #1 unsigned long backward::details::unwind<backward::StackTraceImpl<backward::system_tag::linux_tag>::callback>(backward::StackTraceImpl<backward::system_tag::linux_tag>::callback, unsigned long) at backward.hpp:601
  #2 backward::StackTraceImpl<backward::system_tag::linux_tag>::load_here(unsigned long) at backward.hpp:617
  #3 BackwardsTrace::capture() at backtrace.h:53
  #4 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:58
  #5 operator() at cluster_manager_impl.cc:53
  #6 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #7 std::function<void ()>::operator()() const at functional:2440
  #8 Upstream::SdsClusterImpl::onFetchComplete() at sds.cc:111
  #9 Http::RestApiFetcher::requestComplete() at rest_api_fetcher.cc:59
  #10 Http::RestApiFetcher::onFailure(Http::AsyncClient::FailureReason) at rest_api_fetcher.cc:46
  #11 Http::RestApiFetcher::onSuccess(std::unique_ptr<Http::Message, std::default_delete<Http::Message> >&&) at rest_api_fetcher.cc:31
  #12 Http::AsyncRequestImpl::onComplete() at async_client_impl.cc:169
  #13 Http::AsyncRequestImpl::onData(Buffer::Instance&, bool) at async_client_impl.cc:186
  #14 Http::AsyncStreamImpl::encodeData(Buffer::Instance&, bool) at async_client_impl.cc:88
  #15 Http::Utility::sendLocalReply(Http::StreamDecoderFilterCallbacks&, Http::Code, std::string const&) at utility.cc:155
  #16 Router::Filter::sendNoHealthyUpstreamResponse() at router.cc:263
  #17 Router::Filter::decodeHeaders(Http::HeaderMap&, bool) at router.cc:213
  #18 Http::AsyncStreamImpl::sendHeaders(Http::HeaderMap&, bool) at async_client_impl.cc:107
  #19 AsyncRequestImpl at async_client_impl.cc:162
  #20 Http::AsyncClientImpl::send(std::unique_ptr<Http::Message, std::default_delete<Http::Message> >&&, Http::AsyncClient::Callbacks&, Optional<std::chrono::duration<long, std::ratio<1l, 1000l> > > const&) at async_client_impl.cc:41
  #21 Http::RestApiFetcher::refresh() at rest_api_fetcher.cc:54
  #22 Http::RestApiFetcher::initialize() at rest_api_fetcher.cc:26
  #23 Upstream::SdsClusterImpl::initialize() at sds.h:24
  #24 Upstream::ClusterManagerInitHelper::maybeFinishInitialize() at cluster_manager_impl.cc:105
  #25 Upstream::ClusterManagerInitHelper::removeCluster(Upstream::Cluster&) at cluster_manager_impl.cc:80
  #26 operator() at cluster_manager_impl.cc:53
  #27 std::_Function_handler<void (), Upstream::ClusterManagerInitHelper::addCluster(Upstream::Cluster&)::$_0>::_M_invoke(std::_Any_data const&) at functional:2039
  #28 std::function<void ()>::operator()() const at functional:2440
  #29 operator() at upstream_impl.cc:454
  #30 std::_Function_handler<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&), Upstream::StrictDnsClusterImpl::ResolveTarget::startResolve()::$_3>::_M_invoke(std::_Any_data const&, std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) at functional:2039
  #31 std::function<void (std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&)>::operator()(std::list<std::shared_ptr<Network::Address::Instance const>, std::allocator<std::shared_ptr<Network::Address::Instance const> > >&&) const at functional:2440
  #32 Network::DnsResolverImpl::PendingResolution::onAresHostCallback(int, hostent*) at dns_impl.cc:70
  #33 operator() at dns_impl.cc:123
  #34 __invoke at dns_impl.cc:122
  #35 next_lookup at ares_gethostbyname.c:?
  #36 host_callback at ares_gethostbyname.c:?
  #37 search_callback at ares_search.c:?
  #38 qcallback at ares_query.c:?
  #39 end_query at ares_process.c:?
  #40 process_answer at ares_process.c:?
  #41 processfds at ares_process.c:?
  #42 Network::DnsResolverImpl::onEventCallback(int, unsigned int) at dns_impl.cc:92
  #43 operator() at dns_impl.cc:110
  #44 std::_Function_handler<void (unsigned int), Network::DnsResolverImpl::onAresSocketStateChange(int, int, int)::$_2>::_M_invoke(std::_Any_data const&, unsigned int) at functional:2039
  #45 std::function<void (unsigned int)>::operator()(unsigned int) const at functional:2440
  #46 operator() at file_event_impl.cc:59
  #47 __invoke at file_event_impl.cc:43
  #48 event_persist_closure at event.c:1580
  #49  (inlined by) event_process_active_single_queue at event.c:1639
  #50 event_process_active at event.c:?
  #51  (inlined by) event_base_loop at event.c:1961
  #52 Event::DispatcherImpl::run(Event::Dispatcher::RunType) at dispatcher_impl.cc:142
  #53 Server::InstanceImpl::run() at server.cc:366
  #54 main at main.cc:65

In frame #24, we are inside the for loop over secondary clusters. In frame #4 we are in removeCluster which modifies the list of secondary clusters by removing the only element. After returning to the loop invalid memory access occurs as a result.

@dnoe
Copy link
Contributor

dnoe commented May 5, 2017

I have a working fix for this specific issue but I need to understand the code better and think a bit more to decide if it's the right solution.

@mattklein123 mattklein123 modified the milestone: 1.3.0 May 6, 2017
@mattklein123
Copy link
Member

Thanks for fixing my bug @dnoe! Let me know if you need help with anything.

dnoe added a commit to dnoe/envoy that referenced this issue May 8, 2017
A segfault arose when iterating over the list of
secondary_init_clusters_ and calling initialize().  The call to
initialize may result in the initialized item being removed from the
secondary_init_clusters_ list while walking through the list.  The
initialized item can only remove itself, not any other item in the
list.  However, this resulted in chasing an invalidated iterator when
trying to go to the next element in the list.

To fix this the loop is modified so that the iterator is advanced to the
next item (or end() if no more items) before calling
Cluster::initialize().

A test is included which exercises this scenario

Fixes envoyproxy#903
mattklein123 pushed a commit that referenced this issue May 8, 2017
A segfault arose when iterating over the list of
secondary_init_clusters_ and calling initialize().  The call to
initialize may result in the initialized item being removed from the
secondary_init_clusters_ list while walking through the list.  The
initialized item can only remove itself, not any other item in the
list.  However, this resulted in chasing an invalidated iterator when
trying to go to the next element in the list.

To fix this the loop is modified so that the iterator is advanced to the
next item (or end() if no more items) before calling
Cluster::initialize().

A test is included which exercises this scenario

Fixes #903
@omnilinguist
Copy link
Contributor

@dnoe may I ask how did you generate the Output from a debug build with backtrace decoded:? I am running into a similar issue with the example given in https://www.datawire.io/guide/traffic/envoy-flask-kubernetes/ on a recent latest, wondering how I can generate the same debug log.

rshriram pushed a commit to rshriram/envoy that referenced this issue Oct 30, 2018
Automatic merge from submit-queue.

[DO NOT MERGE] Auto PR to update dependencies of proxy

This PR will be merged automatically once checks are successful.
```release-note
none
```
jpsim pushed a commit that referenced this issue Nov 28, 2022
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants