Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[config_validation]: fixing crash in grpc client during shutdown #18532

Closed
wants to merge 3 commits into from

Conversation

tehnerd
Copy link
Contributor

@tehnerd tehnerd commented Oct 8, 2021

Signed-off-by: Nikita V. Shirokov tehnerd@tehnerd.com

Commit Message:
this is basically copy paste of logic from main server.cc from #17403. we found that server_validation could as well crash during shutdown when it tries to drain GrpcMux queues. this diff prevents it to do so during shutdown phase

example of such crash:

(gdb) bt
#0  0x00007f9b5349297b in raise () from /usr/drte/v3/lib64/libpthread.so.0
#1  0x0000558799b71002 in Envoy::SignalAction::sigHandler(int, siginfo_t*, void*) ()
#2  <signal handler called>
#3  0x00007f9b52fbc6fb in raise () from /usr/drte/v3/lib64/libc.so.6
#4  0x00007f9b52f9e504 in abort () from /usr/drte/v3/lib64/libc.so.6
#5  0x0000558799ba8045 in Envoy::TerminateHandler::logOnTerminate() const::{lambda()#1}::operator()() const ()
#6  0x0000558799ba807e in Envoy::TerminateHandler::logOnTerminate() const::{lambda()#1}::_FUN() ()
#7  0x00007f9b531f2886 in __cxxabiv1::__terminate(void (*)()) () from /usr/drte/v3/lib64/libstdc++.so.6
#8  0x00007f9b531f28c1 in std::terminate() () from /usr/drte/v3/lib64/libstdc++.so.6
#9  0x00007f9b531f35df in __cxa_pure_virtual () from /usr/drte/v3/lib64/libstdc++.so.6
#10 0x00005587993a849c in Envoy::Grpc::Internal::sendMessageUntyped(Envoy::Grpc::RawAsyncStream*, google::protobuf::Message const&, bool) ()
#11 0x000055879936de6b in Envoy::Grpc::AsyncStream<envoy::service::discovery::v3::DiscoveryRequest>::sendMessage(google::protobuf::Message const&, bool) ()
#12 0x000055879936bb7b in Envoy::Config::GrpcStream<envoy::service::discovery::v3::DiscoveryRequest, envoy::service::discovery::v3::DiscoveryResponse>::sendMessage(envoy::service::discovery::v3::DiscoveryRequest const&) ()
#13 0x0000558799364729 in Envoy::Config::GrpcMuxImpl::sendDiscoveryRequest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#14 0x00005587993681b4 in Envoy::Config::GrpcMuxImpl::drainRequests() ()
#15 0x0000558799367b9f in Envoy::Config::GrpcMuxImpl::queueDiscoveryRequest(std::basic_string_view<char, std::char_traits<char> >) ()
#16 0x0000558799369c38 in Envoy::Config::GrpcMuxImpl::GrpcMuxWatchImpl::~GrpcMuxWatchImpl() ()
#17 0x0000558799369c84 in Envoy::Config::GrpcMuxImpl::GrpcMuxWatchImpl::~GrpcMuxWatchImpl() ()

Additional Description:
Risk Level: LOW
Testing: existing UTs
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
lambdai
lambdai previously approved these changes Oct 11, 2021
Copy link
Contributor

@lambdai lambdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the missing pieces!

@lambdai
Copy link
Contributor

lambdai commented Oct 11, 2021

Part of #15072

Copy link
Member

@davinci26 davinci26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. Is it possible to add a test to prove the regression?

@davinci26 davinci26 self-assigned this Oct 11, 2021
@davinci26
Copy link
Member

Also, can we add a release note? in current.rst

@tehnerd
Copy link
Contributor Author

tehnerd commented Oct 11, 2021

Thanks for the contribution. Is it possible to add a test to prove the regression?

it feels like a race candition and only in specific use case (hence noone saw that before). something like "config should contain xDS control plane and message to it should be sent while validation server is about to shutdown (so we have it in flight during shutdown phase)" so i'm not really sure that it is possible to write a test (because even right now it is not failing all the time)

Copy link
Member

@davinci26 davinci26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with the change as it is. But I would defer to @ggreenway for final approval and merge

@ggreenway
Copy link
Contributor

I don't think the validation mode should be making any network connections at all. It seems like a bug to be using a GrpcMux. cc @htuch

@htuch
Copy link
Member

htuch commented Oct 14, 2021

Yes, agree, this seems broken. @adisuissa could you take a look at this when you get a chance? I think this affects some of our intended use of config validation mode as well potentially. CC @stevenzzzz

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Looking into the mux creation.

@adisuissa adisuissa removed the waiting label Oct 27, 2021
@adisuissa
Copy link
Contributor

@tehnerd thanks for reporting this.
I've attempted to repro when validating different configs but couldn't get the same error.
Can you share a config (anonymized) that caused this issue?

@tehnerd
Copy link
Contributor Author

tehnerd commented Oct 27, 2021

exact config is binary protobuff. so kinda hard to share it as is. this is part which are referencing grpc related features:

  1. access logging
dynamic_resources {
  cds_config {
    ads {
    }
    initial_fetch_timeout {
    }
    resource_api_version: V3
  }
  ads_config {
    api_type: GRPC
    grpc_services {
      envoy_grpc {
        cluster_name: "envoy_xds"
      }
    }
    transport_api_version: V3
  }
}
layered_runtime {
  layers {
    name: "admin"
    admin_layer {
    }
  }
  layers {
    name: "rtcfg"
    rtds_layer {
      name: "rtcfg://<path>"
      rtds_config {
        api_config_source {
          api_type: GRPC
          grpc_services {
            envoy_grpc {
              cluster_name: "envoy_xds"
            }
          }
          transport_api_version: V3
        }
        resource_api_version: V3
      }
    }
  }
}

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under investigation; not mergable yet

@adisuissa
Copy link
Contributor

I've tested the following config:

static_resources:
  listeners:
  - address:
      socket_address:
        protocol: tcp
        address: 0.0.0.0
        port_value: 9211
    traffic_direction: INBOUND
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: AUTO
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: dummy_cluster
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                  headers:
                  - name: content-type
                    string_match:
                      exact: application/grpc
                route:
                  cluster: dummy_cluster
              - match:
                  prefix: "/"
                route:
                  cluster: dummy_cluster
          http_filters:
          - name: envoy.filters.http.health_check
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
              pass_through_mode: true
              headers:
                - name: ":path"
                  string_match:
                    exact: "/healthcheck"
              cache_time: 2.5s
          - name: envoy.filters.http.buffer
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.buffer.v3.Buffer
              max_request_bytes: 5242880
          - name: envoy.filters.http.router
          access_log:
          - name: envoy.access_loggers.file
            filter:
              not_health_check_filter:  {}
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
              path: "ingress_http.log"
              log_format: {text_format_source: {inline_string: "[%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\"\n"}}
          - name: envoy.access_loggers.http_grpc
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.grpc.v3.HttpGrpcAccessLogConfig
              common_config:
                log_name: egressenvoyaccesslog
                grpc_service:
                  envoy_grpc:
                    cluster_name: dummy_cluster
                transport_api_version: V3
  clusters:
    - name: dummy_cluster
      type: STATIC
      connect_timeout: 5s
      load_assignment:
        cluster_name: dummy_cluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 45101
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options:
              {}
    - name: ads_cluster
      type: STATIC
      connect_timeout: 5s
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
          common_tls_context:
            validation_context:
              trusted_ca:
                filename: /usr/local/google/home/adip/git/envoy/test/config/integration/certs/upstreamcacert.pem
              match_subject_alt_names:
                - suffix: lyft.com
      load_assignment:
        cluster_name: dummy_cluster
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 46147
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          explicit_http_config:
            http2_protocol_options:
              {}
dynamic_resources:
  cds_config:
    ads:
      {}
    resource_api_version: V3
  ads_config:
    api_type: GRPC
    grpc_services:
      - envoy_grpc:
          cluster_name: ads_cluster
        timeout: 300s
    set_node_on_first_message_only: true
    transport_api_version: V3
layered_runtime:
  layers:
    - name: admin
      admin_layer: {}
    - name: rtcfg
      rtds_layer:
        name: rtcfg://my_path
        rtds_config:
          api_config_source:
            api_type: GRPC
            grpc_services:
              - envoy_grpc:
                  cluster_name: ads_cluster
            transport_api_version: V3
          resource_api_version: V3
admin:
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 0

Added a breakpoint for GrpcMuxWatchImpl::~GrpcMuxWatchImpl, but it wasn't tripped.
@tehnerd can you test this config on your end?

@tehnerd
Copy link
Contributor Author

tehnerd commented Oct 28, 2021

Not really. We do not have in our infra the way to use yaml configs. However I can modify binary any way you need to to get stack traces or w/e else is needed with our configs.

@github-actions
Copy link

github-actions bot commented Dec 5, 2021

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 5, 2021
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Dec 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting:any
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants