[config_validation]: fixing crash in grpc client during shutdown #18532

tehnerd · 2021-10-08T18:53:20Z

Signed-off-by: Nikita V. Shirokov tehnerd@tehnerd.com

Commit Message:
this is basically copy paste of logic from main server.cc from #17403. we found that server_validation could as well crash during shutdown when it tries to drain GrpcMux queues. this diff prevents it to do so during shutdown phase

example of such crash:

(gdb) bt
#0  0x00007f9b5349297b in raise () from /usr/drte/v3/lib64/libpthread.so.0
#1  0x0000558799b71002 in Envoy::SignalAction::sigHandler(int, siginfo_t*, void*) ()
#2  <signal handler called>
#3  0x00007f9b52fbc6fb in raise () from /usr/drte/v3/lib64/libc.so.6
#4  0x00007f9b52f9e504 in abort () from /usr/drte/v3/lib64/libc.so.6
#5  0x0000558799ba8045 in Envoy::TerminateHandler::logOnTerminate() const::{lambda()#1}::operator()() const ()
#6  0x0000558799ba807e in Envoy::TerminateHandler::logOnTerminate() const::{lambda()#1}::_FUN() ()
#7  0x00007f9b531f2886 in __cxxabiv1::__terminate(void (*)()) () from /usr/drte/v3/lib64/libstdc++.so.6
#8  0x00007f9b531f28c1 in std::terminate() () from /usr/drte/v3/lib64/libstdc++.so.6
#9  0x00007f9b531f35df in __cxa_pure_virtual () from /usr/drte/v3/lib64/libstdc++.so.6
#10 0x00005587993a849c in Envoy::Grpc::Internal::sendMessageUntyped(Envoy::Grpc::RawAsyncStream*, google::protobuf::Message const&, bool) ()
#11 0x000055879936de6b in Envoy::Grpc::AsyncStream<envoy::service::discovery::v3::DiscoveryRequest>::sendMessage(google::protobuf::Message const&, bool) ()
#12 0x000055879936bb7b in Envoy::Config::GrpcStream<envoy::service::discovery::v3::DiscoveryRequest, envoy::service::discovery::v3::DiscoveryResponse>::sendMessage(envoy::service::discovery::v3::DiscoveryRequest const&) ()
#13 0x0000558799364729 in Envoy::Config::GrpcMuxImpl::sendDiscoveryRequest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#14 0x00005587993681b4 in Envoy::Config::GrpcMuxImpl::drainRequests() ()
#15 0x0000558799367b9f in Envoy::Config::GrpcMuxImpl::queueDiscoveryRequest(std::basic_string_view<char, std::char_traits<char> >) ()
#16 0x0000558799369c38 in Envoy::Config::GrpcMuxImpl::GrpcMuxWatchImpl::~GrpcMuxWatchImpl() ()
#17 0x0000558799369c84 in Envoy::Config::GrpcMuxImpl::GrpcMuxWatchImpl::~GrpcMuxWatchImpl() ()

Additional Description:
Risk Level: LOW
Testing: existing UTs
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

lambdai

Thank you for addressing the missing pieces!

lambdai · 2021-10-11T17:22:10Z

Part of #15072

davinci26

Thanks for the contribution. Is it possible to add a test to prove the regression?

davinci26 · 2021-10-11T18:25:34Z

Also, can we add a release note? in current.rst

tehnerd · 2021-10-11T23:20:14Z

Thanks for the contribution. Is it possible to add a test to prove the regression?

it feels like a race candition and only in specific use case (hence noone saw that before). something like "config should contain xDS control plane and message to it should be sent while validation server is about to shutdown (so we have it in flight during shutdown phase)" so i'm not really sure that it is possible to write a test (because even right now it is not failing all the time)

…idation

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

davinci26

I am ok with the change as it is. But I would defer to @ggreenway for final approval and merge

ggreenway · 2021-10-13T20:02:45Z

I don't think the validation mode should be making any network connections at all. It seems like a bug to be using a GrpcMux. cc @htuch

htuch · 2021-10-14T04:34:24Z

Yes, agree, this seems broken. @adisuissa could you take a look at this when you get a chance? I think this affects some of our intended use of config validation mode as well potentially. CC @stevenzzzz

adisuissa

~~LGTM~~
Looking into the mux creation.

adisuissa · 2021-10-27T18:30:59Z

@tehnerd thanks for reporting this.
I've attempted to repro when validating different configs but couldn't get the same error.
Can you share a config (anonymized) that caused this issue?

tehnerd · 2021-10-27T21:01:28Z

exact config is binary protobuff. so kinda hard to share it as is. this is part which are referencing grpc related features:

access logging

dynamic_resources {
  cds_config {
    ads {
    }
    initial_fetch_timeout {
    }
    resource_api_version: V3
  }
  ads_config {
    api_type: GRPC
    grpc_services {
      envoy_grpc {
        cluster_name: "envoy_xds"
      }
    }
    transport_api_version: V3
  }
}

layered_runtime {
  layers {
    name: "admin"
    admin_layer {
    }
  }
  layers {
    name: "rtcfg"
    rtds_layer {
      name: "rtcfg://<path>"
      rtds_config {
        api_config_source {
          api_type: GRPC
          grpc_services {
            envoy_grpc {
              cluster_name: "envoy_xds"
            }
          }
          transport_api_version: V3
        }
        resource_api_version: V3
      }
    }
  }
}

ggreenway

Under investigation; not mergable yet

adisuissa · 2021-10-28T16:04:07Z

I've tested the following config:

static_resources:
  listeners:
  - address:
      socket_address:
        protocol: tcp
        address: 0.0.0.0
        port_value: 9211
    traffic_direction: INBOUND
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: AUTO
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: dummy_cluster
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                  headers:
                  - name: content-type
                    string_match:
                      exact: application/grpc
                route:
                  cluster: dummy_cluster
              - match:
                  prefix: "/"
                route:
                  cluster: dummy_cluster
          http_filters:
          - name: envoy.filters.http.health_check
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
              pass_through_mode: true
              headers:
                - name: ":path"
                  string_match:
                    exact: "/healthcheck"
              cache_time: 2.5s
          - name: envoy.filters.http.buffer
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.buffer.v3.Buffer
              max_request_bytes: 5242880
          - name: envoy.filters.http.router
          access_log:
          - name: envoy.access_loggers.file
            filter:
              not_health_check_filter:  {}
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
              path: "ingress_http.log"
              log_format: {text_format_source: {inline_string: "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\"\n"}}
          - name: envoy.access_loggers.http_grpc
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.grpc.v3.HttpGrpcAccessLogConfig
              common_config:
                log_name: egressenvoyaccesslog
                grpc_service:
                  envoy_grpc:
                    cluster_name: dummy_cluster
                transport_api_version: V3
  clusters:
    - name: dummy_cluster
      type: STATIC
      connect_timeout: 5s
      load_assignment:
        cluster_name: dummy_cluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 45101
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options:
              {}
    - name: ads_cluster
      type: STATIC
      connect_timeout: 5s
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            validation_context:
              trusted_ca:
                filename: /usr/local/google/home/adip/git/envoy/test/config/integration/certs/upstreamcacert.pem
              match_subject_alt_names:
                - suffix: lyft.com
      load_assignment:
        cluster_name: dummy_cluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 46147
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options:
              {}
dynamic_resources:
  cds_config:
    ads:
      {}
    resource_api_version: V3
  ads_config:
    api_type: GRPC
    grpc_services:
      - envoy_grpc:
          cluster_name: ads_cluster
        timeout: 300s
    set_node_on_first_message_only: true
    transport_api_version: V3
layered_runtime:
  layers:
    - name: admin
      admin_layer: {}
    - name: rtcfg
      rtds_layer:
        name: rtcfg://my_path
        rtds_config:
          api_config_source:
            api_type: GRPC
            grpc_services:
              - envoy_grpc:
                  cluster_name: ads_cluster
            transport_api_version: V3
          resource_api_version: V3
admin:
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 0

Added a breakpoint for GrpcMuxWatchImpl::~GrpcMuxWatchImpl, but it wasn't tripped.
@tehnerd can you test this config on your end?

tehnerd · 2021-10-28T17:12:05Z

Not really. We do not have in our infra the way to use yaml configs. However I can modify binary any way you need to to get stack traces or w/e else is needed with our configs.

github-actions · 2021-12-05T00:02:16Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-12-12T04:01:19Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

[config_validation]: fixing crash in grpc client during shutdown

c77cf24

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

lambdai previously approved these changes Oct 11, 2021

View reviewed changes

davinci26 suggested changes Oct 11, 2021

View reviewed changes

davinci26 self-assigned this Oct 11, 2021

tehnerd added 2 commits October 12, 2021 17:50

Merge remote-tracking branch 'upstream/main' into fix_shutdown_in_val…

b4bb387

…idation

adding description into current.rst

1e633dd

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

tehnerd dismissed lambdai’s stale review via 1e633dd October 12, 2021 20:01

davinci26 approved these changes Oct 13, 2021

View reviewed changes

davinci26 assigned ggreenway Oct 13, 2021

ggreenway assigned adisuissa Oct 15, 2021

mattklein123 added the waiting label Oct 19, 2021

adisuissa approved these changes Oct 27, 2021

View reviewed changes

adisuissa removed the waiting label Oct 27, 2021

adisuissa added the waiting:any label Oct 27, 2021

repokitteh-read-only bot removed the waiting:any label Oct 27, 2021

ggreenway requested changes Oct 27, 2021

View reviewed changes

ggreenway added the waiting:any label Nov 4, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 5, 2021

github-actions bot closed this Dec 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[config_validation]: fixing crash in grpc client during shutdown #18532

[config_validation]: fixing crash in grpc client during shutdown #18532

tehnerd commented Oct 8, 2021 •

edited

Loading

lambdai left a comment

lambdai commented Oct 11, 2021

davinci26 left a comment

davinci26 commented Oct 11, 2021

tehnerd commented Oct 11, 2021

davinci26 left a comment

ggreenway commented Oct 13, 2021

htuch commented Oct 14, 2021

adisuissa left a comment •

edited

Loading

adisuissa commented Oct 27, 2021

tehnerd commented Oct 27, 2021 •

edited

Loading

ggreenway left a comment

adisuissa commented Oct 28, 2021

tehnerd commented Oct 28, 2021

github-actions bot commented Dec 5, 2021

github-actions bot commented Dec 12, 2021

[config_validation]: fixing crash in grpc client during shutdown #18532

[config_validation]: fixing crash in grpc client during shutdown #18532

Conversation

tehnerd commented Oct 8, 2021 • edited Loading

lambdai left a comment

Choose a reason for hiding this comment

lambdai commented Oct 11, 2021

davinci26 left a comment

Choose a reason for hiding this comment

davinci26 commented Oct 11, 2021

tehnerd commented Oct 11, 2021

davinci26 left a comment

Choose a reason for hiding this comment

ggreenway commented Oct 13, 2021

htuch commented Oct 14, 2021

adisuissa left a comment • edited Loading

Choose a reason for hiding this comment

adisuissa commented Oct 27, 2021

tehnerd commented Oct 27, 2021 • edited Loading

ggreenway left a comment

Choose a reason for hiding this comment

adisuissa commented Oct 28, 2021

tehnerd commented Oct 28, 2021

github-actions bot commented Dec 5, 2021

github-actions bot commented Dec 12, 2021

tehnerd commented Oct 8, 2021 •

edited

Loading

adisuissa left a comment •

edited

Loading

tehnerd commented Oct 27, 2021 •

edited

Loading