Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Closed
cristiangraz opened this issue Mar 1, 2019 · 18 comments
Closed

Comments

@cristiangraz
Copy link

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (go version)?

1.11.5

What operating system (Linux, Windows, …) and version?

Alpine Linux 3.9 running on Google Kubernetes Engine

What did you do?

Since upgrading, I've noticed a large increase in grpc errors leading to failed requests and service disruptions.

We don't use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It's roughly configured like this:

Client:

creds := grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{
		RootCAs:      certPool,
		Certificates: []tls.Certificate{cert},
		MinVersion:   tls.VersionTLS12,
}))
grpc.Dial("foo.foo.svc.cluster.local:80", creds)

Server:

creds := grpc.Creds(credentials.NewTLS(&tls.Config{
    Certificates:             []tls.Certificate{cert},
    ClientCAs:                certPool,
    ClientAuth:               tls.RequireAndVerifyClientCert,
    MinVersion:               tls.VersionTLS12,
    NextProtos:               []string{"h2"},
    PreferServerCipherSuites: true,
    CurvePreferences: []tls.CurveID{
        tls.CurveP256,
        tls.X25519,
    },
    CipherSuites: []uint16{
        tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
    },
}))
grpc.NewServer(creds)

This may be related to one of these other two issues
#2653
#2636

Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

I added (__GRPC CALL HERE__) at the end of the log lines below where the error was logged as a failed grpc call that we made.

WARNING: 2019/03/01 04:41:42 grpc: addrConn.createTransport failed to connect to {foo.foo.svc.cluster.local:80 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
INFO: 2019/03/01 04:41:42 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:41:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
timestamp=2019-03-01T04:41:42.887562062Z level=error error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" (__GRPC CALL HERE))
INFO: 2019/03/01 04:41:43 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:41:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 04:41:44 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:41:44 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, READY
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:25 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:35:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:27 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:35:32 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:59 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:59 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T05:38:19.972966567Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:21 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:13:05 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T16:13:05.92908361Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, READY
INFO: 2019/03/01 16:20:17 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:20:17 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:20:18 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:20:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:42:24 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:42:24 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:20:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:20:47 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:20:49 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:20:54 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:43:01 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:43:01 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:43:02 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:43:07 Subchannel Connectivity change to READY
@menghanl
Copy link
Contributor

menghanl commented Mar 1, 2019

Can you also get server side logs to see if it contains more information?

The handshake deadline exceeded error looks very suspicious. It would help if we know why it happened.

@menghanl
Copy link
Contributor

menghanl commented Mar 1, 2019

One more thing to try: can you set GRPC_GO_REQUIRE_HANDSHAKE=off on client side and see if that changes anything? (#2406)

@kazegusuri
Copy link
Contributor

@menghanl I have experienced same issue here. Here is a reproduce procedure.

Example program

client

package main

import (
	"context"
	"log"
	"time"

	"github.com/golang/glog"
	"github.com/kazegusuri/grpc-test/pb"
	"google.golang.org/grpc"
	_ "google.golang.org/grpc/grpclog/glogger"
)

func main() {
	ctx := context.Background()
	conn1, err := grpc.DialContext(ctx, "127.0.0.1:9000", grpc.WithInsecure())
	if err != nil {
		log.Fatalf("grpc.DialContext failed: %v", err)
	}
	defer func() { _ = conn1.Close() }()

	cli1 := pb.NewEchoClient(conn1)
	for {
		time.Sleep(1 * time.Millisecond)
		_, err := cli1.Echo(ctx, &pb.EchoMessage{Message: "hello"})
		if err != nil {
			glog.Infof("goroutine: echo failed: %v", err)
		} else {
			glog.Infof("goroutine: echo succeeded")
		}
	}
}

server

func main() {
	server := newServer()
	s := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{
		MaxConnectionAge: 10 * time.Second, // this kills connection 10s after connected
	}))
	pb.RegisterEchoServer(s, server)

	lis, err := net.Listen("tcp", ":9000")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}

	if err := s.Serve(lis); err != nil {
		log.Fatalf("err %v\n", err)
	}

	select {}
}

Results

With above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19.
The server kills the connection from the client 10 second after connected.
The client should reconnect to the server but its behavior is different in client versions.
With 1.18 and 1.19 the client takes over 1 second to recconect the server.

v1.17.0 (default)

I0303 02:22:59.745529    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, TRANSIENT_FAILURE
I0303 02:22:59.745566    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, CONNECTING
I0303 02:22:59.745732    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, READY

v1.18.0 (default)

I0303 02:20:58.820793    5169 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:20:58.820892    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, TRANSIENT_FAILURE
I0303 02:20:59.821028    5169 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:20:59.821189    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, CONNECTING
I0303 02:20:59.822058    5169 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:20:59.822155    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, READY

v1.18.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:25:59.393033    8568 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:25:59.393096    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, TRANSIENT_FAILURE
I0303 02:26:00.393385    8568 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:26:00.393510    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, CONNECTING
I0303 02:26:00.393937    8568 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:26:00.393979    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, READY


v1.19.0 (default)

I0303 02:28:57.655932   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, TRANSIENT_FAILURE
I0303 02:28:58.656125   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, CONNECTING
I0303 02:28:58.656489   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, READY

v1.19.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:30:30.642447   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, TRANSIENT_FAILURE
I0303 02:30:31.642680   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, CONNECTING
I0303 02:30:31.643004   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, READY

@kazegusuri
Copy link
Contributor

kazegusuri commented Mar 2, 2019

This causes performance regression on the server which provides single subconn like Google Spanner.
Spanner seems to kill connections in the server-side in an hour, then the connection becomes transient failure. For a second in transient failure until becoming ready, the connection keeps unavaiable.

@kazegusuri
Copy link
Contributor

If it's okay to try to create transport immediately after transient failure for first retry, how about adding code like this here?

if ac.backoffIdx == 0 {
	backoffFor = 0
}

@vvakame
Copy link

vvakame commented Mar 6, 2019

I have a question.
I found baseDelay = 1.0 * time.Second in here.
I think this value is little bit big. Why we use 1 sec by default?
1 sec is a too big from user-facing frontend side.

@mpipet
Copy link

mpipet commented Mar 6, 2019

I encountered the same issue using the gc bigtable go client (grpc based), bigtable grpc server seems to have a connection max age of one hour, which means that every hour, I litteraly stop handling my traffic for one second.

I agree with @vvakame this default value is too high for a lot of use cases and should be easily configurable in the other google clients based on this repo.

@menghanl
Copy link
Contributor

menghanl commented Mar 6, 2019

There's a bug in the reconnect logic, the backoff time is not reset after creating a connection.

The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks!

@dfawley
Copy link
Member

dfawley commented Mar 14, 2019

Closing. Please comment if you are still experiencing problems.

@mikesun
Copy link

mikesun commented Jun 14, 2019

Does the fix in #2669 still require GRPC_GO_REQUIRE_HANDSHAKE=off to be set?

@dfawley
Copy link
Member

dfawley commented Jun 14, 2019

I hope not, since I intend to remove that option soon (this or next week). Please let me know if you are experiencing any problems with it unset.

@mikesun
Copy link

mikesun commented Jun 14, 2019

@dfawley

It seems to still be an issue for us. I reported it to Google Cloud support which pointed me to this thread and suggested the workaround was to set GRPC_GO_REQUIRE_HANDSHAKE=off. It appears that is no longer the case.

We running a Golang Google App Engine flexible app with Go v1.11 and grpc-gio v1.20.0.

With Spanner, we see context timeouts:

spanner: code = “Canceled”, desc = “context canceled, lastErr is <<nil>>”

With Cloud Tasks, we see these RPC errors:

all SubConns are in TransientFailure, latest connection error:

These errors are not occuring consistently, only intermittently and in bursts.

@dfawley
Copy link
Member

dfawley commented Jun 14, 2019

These errors are not occuring consistently, only intermittently and in bursts.

How long do these bursts last and how often do they occur?

Does setting GRPC_GO_REQUIRE_HANDSHAKE=off help for you?

@mikesun
Copy link

mikesun commented Jun 14, 2019

These errors are not occuring consistently, only intermittently and in bursts.

How long do these bursts last and how often do they occur?

Currently, we do not have a production environment with any meaningful load, but in a trial production run that we ran for 3 days, we had one burst of errors for ~20 minutes. Otherwise, we see it less than once a day, for a minute.

@mikesun
Copy link

mikesun commented Jun 14, 2019

I turned on GRPC logging on one of our deployments and it appears I get the TRANSIENT_FAILURE errors continuously regardless if GRPC_GO_REQUIRE_HANDSHAKE=off is set.

grpc v1.21.1 and GRPC_GO_REQUIRE_HANDSHAKE=off

2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY

The reconnects seems to happening immediately and I haven't experienced any RPC failures with all SubConns are in TransientFailure, so I guess this is acceptable behavior?

We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?

@dfawley
Copy link
Member

dfawley commented Jun 14, 2019

@mikesun

I guess this is acceptable behavior?

It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.

We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?

Go HTTP server settings should not affect grpc-go - we have our own server implementation.

@mikesun
Copy link

mikesun commented Jun 17, 2019

It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.

It seems to happen in bunches (~3/sec) every few minutes (1-4 min). This is on an GAE flexible instance with basically no load:

A  INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, CONNECTING
A  INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, CONNECTING
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, READY
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, READY
A  INFO: 2019/06/17 16:10:37 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY

@dfawley
Copy link
Member

dfawley commented Jun 20, 2019

@mikesun - one last thing I can think of: make sure you aren't running a grpc-go version between #2740 (82fdf27 - 21d ago) and #2862 (a1d4c28 - 10d ago). We had a bug in that range that could potentially cause problems like this.

If it's not that, then most likely the problem is not on the client side. Logs from the server/proxy side would be needed for further debugging. Also, since your problems are not related to this issue, could you file a new issue if you suspect a problem with grpc-go?

ajwerner added a commit to ajwerner/cockroach that referenced this issue Jul 25, 2019
This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

```
ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE
```
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The `transport` subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

This change also unfortunately updates the keepalive behavior to be more spec
compliant (grpc/grpc-go#2642). This change mandates
a minimum ping time of 10s to the client. Given grpc/grpc-go#2638
this means that the rpc test for keepalives now takes over 20s.

I would be okay skipping it but leave that discussion for review.

Also updated the acceptance test to look out for an HTTP/2.0 header because
grpc now does not send RPCs until after the HTTP handshake has completed
(see grpc/grpc-go#2406).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Jul 30, 2019
39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner

This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this
upgrade is to eliminate the disconnections caused by
grpc/grpc-go#1882. These failures manifest themselves
as the following set of errors:

```
ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE
```
Which then lead to tripped breakers and general badness. I suspect that there
are several other good bug fixes in here, including some purported leaks and
correctness fixes on shutdown.

I have verified that with this upgrade I no longer see connections break in
overload scenarios which reliably reproduced the situation in the above log.

This commit removes one condition from grpcutil.IsClosedConnection which should
be subsumed by the status check above. The `transport` subpackage has not been
around for many releases.

This does not upgrade to the current release 1.22.0 because the maintainer
mentions that it contains a bug
(grpc/grpc-go#2663 (comment)).

Release note (bug fix): Upgrade grpc library to fix connection state management
bug.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@lock lock bot locked as resolved and limited conversation to collaborators Dec 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants