Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/healthcheck_test.go: verify existing RPCs can proceed #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chaochn47
Copy link
Owner

@chaochn47 chaochn47 commented Aug 1, 2023

Motivation

The motivation of the PR is to investigate whether existing RPCs (streaming, unary RPCs) can still proceed when its associated subcon is already marked as TRANSIENT_FAILURE after client received the not serving status.

Mainly to address the feedback

My only concern is that based on that https://github.com/grpc/grpc-go/blob/d7f45cdf9ae720256d328fcdb6356ae58122a7f6/health/client.go#L104, it might impact the on-going requests(still in trasfering data) and force to recreate connection. (I haven't verified it yet, update it later)

etcd-io/etcd#16276 (comment)

Background

image

More in Deep Dive: gRPC - Yuxuan Li, Google and slide

Code analysis

When health server pushes the unhealthy status to client, setConnectivityState(connectivity.TransientFailure in the health client will be called as part of func (ac *addrConn) startHealthCheck in the background as long as the clientConn is not closed.

This connectivity state change will be then eventually propagated to func (b *baseBalancer) UpdateSubConnState and generate a new picker that excludes this SubConn.

So any new RPCs will not be sent to the SubConn while existing RPCs are not impacted.

Verification

To run the new test

cd test

GRPC_GO_LOG_VERBOSITY_LEVEL=2 go test -v -run "Test/HealthCheckWithAddrConnDrain"

=== RUN   Test
=== RUN   Test/HealthCheckWithAddrConnDrain
    tlogger.go:116: INFO server.go:629 [core] [Server #1] Server created  (t=+1.992198ms)
    tlogger.go:116: INFO clientconn.go:316 [core] [Channel #2] Channel created  (t=+2.084992ms)
    tlogger.go:116: INFO clientconn.go:1768 [core] [Channel #2] original dial target is: "whatever:///test.server"  (t=+2.115039ms)
    tlogger.go:116: INFO server.go:817 [core] [Server #1 ListenSocket #3] ListenSocket created  (t=+2.181237ms)
    tlogger.go:116: INFO clientconn.go:1775 [core] [Channel #2] parsed dial target is: {Scheme:whatever Authority: URL:{Scheme:whatever Opaque: User: Host: Path:/test.server RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}  (t=+2.213167ms)
    tlogger.go:116: INFO clientconn.go:1875 [core] [Channel #2] Channel authority set to "test.server"  (t=+2.236468ms)
    tlogger.go:116: INFO resolver_conn_wrapper.go:238 [core] [Channel #2] Resolver state updated: {
          "Addresses": [
            {
              "Addr": "127.0.0.1:52256",
              "ServerName": "",
              "Attributes": null,
              "BalancerAttributes": null,
              "Type": 0,
              "Metadata": null
            }
          ],
          "ServiceConfig": {
            "Config": {
              "Config": null,
              "LB": null,
              "Methods": {}
            },
            "Err": null
          },
          "Attributes": null
        } (service config updated; resolver returned new addresses)  (t=+2.484247ms)
    tlogger.go:116: INFO balancer_conn_wrappers.go:192 [core] [Channel #2] Channel switches to new LB policy "round_robin"  (t=+8.2543ms)
    tlogger.go:116: INFO balancer.go:98 [balancer] base.baseBalancer: got new ClientConn state:  {{[{Addr: "127.0.0.1:52256", ServerName: "", }] 0xc000094240 <nil>} <nil>}  (t=+8.324678ms)
    tlogger.go:116: INFO clientconn.go:897 [core] [Channel #2 SubChannel #4] Subchannel created  (t=+8.429799ms)
    tlogger.go:116: INFO roundrobin.go:50 [roundrobin] roundrobinPicker: Build called with info: {map[]}  (t=+8.48485ms)
    tlogger.go:116: INFO clientconn.go:564 [core] [Channel #2] Channel Connectivity change to CONNECTING  (t=+8.517962ms)
    tlogger.go:116: INFO clientconn.go:1283 [core] [Channel #2 SubChannel #4] Subchannel Connectivity change to CONNECTING  (t=+8.540769ms)
    tlogger.go:116: INFO clientconn.go:1396 [core] [Channel #2 SubChannel #4] Subchannel picks a new address "127.0.0.1:52256" to connect  (t=+8.600348ms)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc000010408, CONNECTING  (t=+8.614538ms)
    tlogger.go:116: INFO syscall_nonlinux.go:39 [core] CPU time info is unavailable on non-linux environments.  (t=+8.931579ms)
    tlogger.go:116: INFO clientconn.go:1283 [core] [Channel #2 SubChannel #4] Subchannel Connectivity change to READY  (t=+9.584311ms)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc000010408, READY  (t=+9.616043ms)
    tlogger.go:116: INFO roundrobin.go:50 [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:4):{{Addr: "127.0.0.1:52256", ServerName: "", }}]}  (t=+9.639118ms)
    tlogger.go:116: INFO clientconn.go:564 [core] [Channel #2] Channel Connectivity change to READY  (t=+9.661999ms)
    tlogger.go:116: INFO clientconn.go:1285 [core] [Channel #2 SubChannel #4] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection active but health check failed. status=NOT_SERVING  (t=+10.211725ms)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc000010408, TRANSIENT_FAILURE  (t=+10.235345ms)
    tlogger.go:116: INFO clientconn.go:564 [core] [Channel #2] Channel Connectivity change to TRANSIENT_FAILURE  (t=+10.247573ms)
    tlogger.go:116: INFO resolver_conn_wrapper.go:238 [core] [Channel #2] Resolver state updated: {
          "Addresses": [
            {
              "Addr": "fake address",
              "ServerName": "",
              "Attributes": null,
              "BalancerAttributes": null,
              "Type": 0,
              "Metadata": null
            }
          ],
          "ServiceConfig": {
            "Config": {
              "Config": null,
              "LB": null,
              "Methods": {}
            },
            "Err": null
          },
          "Attributes": null
        } ()  (t=+2.086912116s)
    tlogger.go:116: INFO balancer.go:98 [balancer] base.baseBalancer: got new ClientConn state:  {{[{Addr: "fake address", ServerName: "", }] 0xc000094240 <nil>} <nil>}  (t=+2.086983587s)
    tlogger.go:116: INFO clientconn.go:897 [core] [Channel #2 SubChannel #7] Subchannel created  (t=+2.087061181s)
    tlogger.go:116: INFO clientconn.go:1283 [core] [Channel #2 SubChannel #4] Subchannel Connectivity change to SHUTDOWN  (t=+2.08709784s)
    tlogger.go:116: INFO http2_client.go:1016 [transport] [client-transport 0xc00016a000] GracefulClose called  (t=+2.087120556s)
    tlogger.go:116: INFO clientconn.go:1283 [core] [Channel #2 SubChannel #7] Subchannel Connectivity change to CONNECTING  (t=+2.087134428s)
    tlogger.go:116: INFO clientconn.go:1396 [core] [Channel #2 SubChannel #7] Subchannel picks a new address "fake address" to connect  (t=+2.087192422s)
    tlogger.go:116: INFO clientconn.go:1636 [core] [Channel #2 SubChannel #4] Subchannel deleted  (t=+2.087210272s)
    tlogger.go:116: INFO clientconn.go:1454 [core] Creating new client transport to "{Addr: \"fake address\", ServerName: \"test.server\", }": connection error: desc = "transport: Error while dialing: dial tcp: address fake address: missing port in address"  (t=+2.087236461s)
    tlogger.go:116: WARNING clientconn.go:1458 [core] [Channel #2 SubChannel #7] grpc: addrConn.createTransport failed to connect to {Addr: "fake address", ServerName: "test.server", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: address fake address: missing port in address"  (t=+2.087260894s)
    tlogger.go:116: INFO clientconn.go:1285 [core] [Channel #2 SubChannel #7] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp: address fake address: missing port in address"  (t=+2.087287678s)
    tlogger.go:116: INFO clientconn.go:564 [core] [Channel #2] Channel Connectivity change to SHUTDOWN  (t=+2.097923858s)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc000010408, SHUTDOWN  (t=+2.097963049s)
    tlogger.go:116: INFO controlbuf.go:543 [transport] [client-transport 0xc00016a000] loopyWriter exiting with error: finished processing active streams while in draining mode  (t=+2.09798884s)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc00061b110, CONNECTING  (t=+2.098022146s)
    tlogger.go:116: INFO balancer.go:183 [balancer] base.baseBalancer: handle SubConn state change: 0xc00061b110, TRANSIENT_FAILURE  (t=+2.098060178s)
    tlogger.go:116: INFO balancer_conn_wrappers.go:203 [core] [Channel #2] ccBalancerWrapper: closing  (t=+2.098080934s)
    tlogger.go:116: INFO http2_client.go:959 [transport] [client-transport 0xc00016a000] Closing: connection error: desc = "error reading from server: read tcp 127.0.0.1:52257->127.0.0.1:52256: use of closed network connection"  (t=+2.098143071s)
    tlogger.go:116: INFO resolver_conn_wrapper.go:126 [core] [Channel #2] Closing the name resolver  (t=+2.098204814s)
    tlogger.go:116: INFO clientconn.go:1283 [core] [Channel #2 SubChannel #7] Subchannel Connectivity change to SHUTDOWN  (t=+2.098222445s)
    tlogger.go:116: INFO clientconn.go:1636 [core] [Channel #2 SubChannel #7] Subchannel deleted  (t=+2.09826865s)
    tlogger.go:116: INFO clientconn.go:316 [core] [Channel #2] Channel deleted  (t=+2.098288661s)
    tlogger.go:116: INFO types.go:68 [channelz] attempt to delete child with id 4 from a parent (id=2) that doesn't currently exist  (t=+2.098304318s)
    tlogger.go:116: INFO types.go:68 [channelz] attempt to delete child with id 7 from a parent (id=2) that doesn't currently exist  (t=+2.09832953s)
    tlogger.go:116: INFO types.go:68 [channelz] attempt to delete child with id 4 from a parent (id=2) that doesn't currently exist  (t=+2.09834108s)
    tlogger.go:116: INFO http2_server.go:1232 [transport] [server-transport 0xc0003b4000] Closing: EOF  (t=+2.098357493s)
    tlogger.go:116: INFO types.go:68 [channelz] attempt to delete child with id 7 from a parent (id=2) that doesn't currently exist  (t=+2.098384334s)
    tlogger.go:116: INFO controlbuf.go:543 [transport] [server-transport 0xc0003b4000] loopyWriter exiting with error: transport closed by client  (t=+2.098400391s)
    tlogger.go:116: INFO server.go:768 [core] [Server #1 ListenSocket #3] ListenSocket deleted  (t=+2.098428225s)
--- PASS: Test (2.10s)
    --- PASS: Test/HealthCheckWithAddrConnDrain (2.10s)
PASS
ok  	google.golang.org/grpc/test	2.940s

Signed-off-by: Chao Chen <chaochn@amazon.com>
@fuweid
Copy link

fuweid commented Aug 2, 2023

@chaochn47 Thanks! let me try to verify it. I was testing it with containerd. The ongoing streaming request was impacted. Maybe there is only one available service. Will update. (The API is unary)

Looks good to me. The ongoing streaming request isn't impacted. When new request is filed, the client will receive the error last connection error: connection active but health check failed. status=NOT_SERVING: unavailable if there is only one connection.

@chaochn47
Copy link
Owner Author

@fuweid Thanks for the verification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants