Transport is closing and transient failure errors since upgrading to 1.18 #2663

cristiangraz · 2019-03-01T18:07:44Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (`go version`)?

1.11.5

What operating system (Linux, Windows, …) and version?

Alpine Linux 3.9 running on Google Kubernetes Engine

What did you do?

Since upgrading, I've noticed a large increase in grpc errors leading to failed requests and service disruptions.

We don't use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It's roughly configured like this:

Client:

creds := grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{
		RootCAs:      certPool,
		Certificates: []tls.Certificate{cert},
		MinVersion:   tls.VersionTLS12,
}))
grpc.Dial("foo.foo.svc.cluster.local:80", creds)

Server:

creds := grpc.Creds(credentials.NewTLS(&tls.Config{
    Certificates:             []tls.Certificate{cert},
    ClientCAs:                certPool,
    ClientAuth:               tls.RequireAndVerifyClientCert,
    MinVersion:               tls.VersionTLS12,
    NextProtos:               []string{"h2"},
    PreferServerCipherSuites: true,
    CurvePreferences: []tls.CurveID{
        tls.CurveP256,
        tls.X25519,
    },
    CipherSuites: []uint16{
        tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
    },
}))
grpc.NewServer(creds)

This may be related to one of these other two issues
#2653
#2636

Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

I added (__GRPC CALL HERE__) at the end of the log lines below where the error was logged as a failed grpc call that we made.

WARNING: 2019/03/01 04:41:42 grpc: addrConn.createTransport failed to connect to {foo.foo.svc.cluster.local:80 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
INFO: 2019/03/01 04:41:42 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:41:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
timestamp=2019-03-01T04:41:42.887562062Z level=error error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" (__GRPC CALL HERE))
INFO: 2019/03/01 04:41:43 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:41:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 04:41:44 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:41:44 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, READY
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY

INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:25 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:35:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:27 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:35:32 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:59 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:59 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T05:38:19.972966567Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:21 transport: loopyWriter.run returning. connection error: desc = "transport is closing"

INFO: 2019/03/01 16:13:05 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T16:13:05.92908361Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, READY
INFO: 2019/03/01 16:20:17 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:20:17 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:20:18 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:20:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:42:24 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:42:24 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:20:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:20:47 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:20:49 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:20:54 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:43:01 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:43:01 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:43:02 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:43:07 Subchannel Connectivity change to READY

The text was updated successfully, but these errors were encountered:

menghanl · 2019-03-01T18:36:39Z

Can you also get server side logs to see if it contains more information?

The handshake deadline exceeded error looks very suspicious. It would help if we know why it happened.

menghanl · 2019-03-01T18:38:57Z

One more thing to try: can you set GRPC_GO_REQUIRE_HANDSHAKE=off on client side and see if that changes anything? (#2406)

kazegusuri · 2019-03-02T17:43:36Z

@menghanl I have experienced same issue here. Here is a reproduce procedure.

Example program

client

package main

import (
	"context"
	"log"
	"time"

	"github.com/golang/glog"
	"github.com/kazegusuri/grpc-test/pb"
	"google.golang.org/grpc"
	_ "google.golang.org/grpc/grpclog/glogger"
)

func main() {
	ctx := context.Background()
	conn1, err := grpc.DialContext(ctx, "127.0.0.1:9000", grpc.WithInsecure())
	if err != nil {
		log.Fatalf("grpc.DialContext failed: %v", err)
	}
	defer func() { _ = conn1.Close() }()

	cli1 := pb.NewEchoClient(conn1)
	for {
		time.Sleep(1 * time.Millisecond)
		_, err := cli1.Echo(ctx, &pb.EchoMessage{Message: "hello"})
		if err != nil {
			glog.Infof("goroutine: echo failed: %v", err)
		} else {
			glog.Infof("goroutine: echo succeeded")
		}
	}
}

server

func main() {
	server := newServer()
	s := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{
		MaxConnectionAge: 10 * time.Second, // this kills connection 10s after connected
	}))
	pb.RegisterEchoServer(s, server)

	lis, err := net.Listen("tcp", ":9000")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}

	if err := s.Serve(lis); err != nil {
		log.Fatalf("err %v\n", err)
	}

	select {}
}

Results

With above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19.
The server kills the connection from the client 10 second after connected.
The client should reconnect to the server but its behavior is different in client versions.
With 1.18 and 1.19 the client takes over 1 second to recconect the server.

v1.17.0 (default)

I0303 02:22:59.745529    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, TRANSIENT_FAILURE
I0303 02:22:59.745566    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, CONNECTING
I0303 02:22:59.745732    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, READY

v1.18.0 (default)

I0303 02:20:58.820793    5169 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:20:58.820892    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, TRANSIENT_FAILURE
I0303 02:20:59.821028    5169 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:20:59.821189    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, CONNECTING
I0303 02:20:59.822058    5169 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:20:59.822155    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, READY

v1.18.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:25:59.393033    8568 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:25:59.393096    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, TRANSIENT_FAILURE
I0303 02:26:00.393385    8568 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:26:00.393510    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, CONNECTING
I0303 02:26:00.393937    8568 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:26:00.393979    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, READY


v1.19.0 (default)

I0303 02:28:57.655932   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, TRANSIENT_FAILURE
I0303 02:28:58.656125   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, CONNECTING
I0303 02:28:58.656489   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, READY

v1.19.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:30:30.642447   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, TRANSIENT_FAILURE
I0303 02:30:31.642680   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, CONNECTING
I0303 02:30:31.643004   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, READY

kazegusuri · 2019-03-02T17:50:27Z

This causes performance regression on the server which provides single subconn like Google Spanner.
Spanner seems to kill connections in the server-side in an hour, then the connection becomes transient failure. For a second in transient failure until becoming ready, the connection keeps unavaiable.

kazegusuri · 2019-03-03T08:16:16Z

If it's okay to try to create transport immediately after transient failure for first retry, how about adding code like this here?

if ac.backoffIdx == 0 {
	backoffFor = 0
}

vvakame · 2019-03-06T06:37:56Z

I have a question.
I found baseDelay = 1.0 * time.Second in here.
I think this value is little bit big. Why we use 1 sec by default?
1 sec is a too big from user-facing frontend side.

mpipet · 2019-03-06T14:12:55Z

I encountered the same issue using the gc bigtable go client (grpc based), bigtable grpc server seems to have a connection max age of one hour, which means that every hour, I litteraly stop handling my traffic for one second.

I agree with @vvakame this default value is too high for a lot of use cases and should be easily configurable in the other google clients based on this repo.

menghanl · 2019-03-06T19:47:33Z

There's a bug in the reconnect logic, the backoff time is not reset after creating a connection.

The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks!

#2663 #2636

dfawley · 2019-03-14T20:50:42Z

Closing. Please comment if you are still experiencing problems.

#2663 #2636

mikesun · 2019-06-14T14:18:43Z

Does the fix in #2669 still require GRPC_GO_REQUIRE_HANDSHAKE=off to be set?

dfawley · 2019-06-14T16:02:38Z

I hope not, since I intend to remove that option soon (this or next week). Please let me know if you are experiencing any problems with it unset.

mikesun · 2019-06-14T16:57:03Z

@dfawley

It seems to still be an issue for us. I reported it to Google Cloud support which pointed me to this thread and suggested the workaround was to set GRPC_GO_REQUIRE_HANDSHAKE=off. It appears that is no longer the case.

We running a Golang Google App Engine flexible app with Go v1.11 and grpc-gio v1.20.0.

With Spanner, we see context timeouts:

spanner: code = “Canceled”, desc = “context canceled, lastErr is <<nil>>”

With Cloud Tasks, we see these RPC errors:

all SubConns are in TransientFailure, latest connection error:

These errors are not occuring consistently, only intermittently and in bursts.

dfawley · 2019-06-14T17:13:50Z

These errors are not occuring consistently, only intermittently and in bursts.

How long do these bursts last and how often do they occur?

Does setting GRPC_GO_REQUIRE_HANDSHAKE=off help for you?

mikesun · 2019-06-14T17:20:16Z

These errors are not occuring consistently, only intermittently and in bursts.

How long do these bursts last and how often do they occur?

Currently, we do not have a production environment with any meaningful load, but in a trial production run that we ran for 3 days, we had one burst of errors for ~20 minutes. Otherwise, we see it less than once a day, for a minute.

mikesun · 2019-06-14T20:52:35Z

I turned on GRPC logging on one of our deployments and it appears I get the TRANSIENT_FAILURE errors continuously regardless if GRPC_GO_REQUIRE_HANDSHAKE=off is set.

grpc v1.21.1 and GRPC_GO_REQUIRE_HANDSHAKE=off

2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239980, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc000239770, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
2019-06-14 20:50:14 default[20190614t163806]  INFO: 2019/06/14 20:50:14 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY

The reconnects seems to happening immediately and I haven't experienced any RPC failures with all SubConns are in TransientFailure, so I guess this is acceptable behavior?

We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?

dfawley · 2019-06-14T22:46:39Z

@mikesun

I guess this is acceptable behavior?

It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.

We do configure http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 500, would that be a problem?

Go HTTP server settings should not affect grpc-go - we have our own server implementation.

mikesun · 2019-06-17T16:28:54Z

It's not normal for your connection to be dying so frequently (several per second). Is this happening continuously or are these entries the only ones? Unfortunately, there isn't enough logging in place to see why the connection is being closed. But it does go into READY, which means at least TLS was successful.

It seems to happen in bunches (~3/sec) every few minutes (1-4 min). This is on an GAE flexible instance with basically no load:

A  INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, CONNECTING
A  INFO: 2019/06/17 16:10:15 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, CONNECTING
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239900, READY
A  INFO: 2019/06/17 16:10:15 pickfirstBalancer: HandleSubConnStateChange: 0xc000239870, READY
A  INFO: 2019/06/17 16:10:37 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:37 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:10:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:10:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:14:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:14:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:18:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:18:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc000238ec0, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00030c150, READY
A  INFO: 2019/06/17 16:22:38 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, TRANSIENT_FAILURE
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, CONNECTING
A  INFO: 2019/06/17 16:22:38 pickfirstBalancer: HandleSubConnStateChange: 0xc00081c350, READY

dfawley · 2019-06-20T18:15:02Z

@mikesun - one last thing I can think of: make sure you aren't running a grpc-go version between #2740 (82fdf27 - 21d ago) and #2862 (a1d4c28 - 10d ago). We had a bug in that range that could potentially cause problems like this.

If it's not that, then most likely the problem is not on the client side. Logs from the server/proxy side would be needed for further debugging. Also, since your problems are not related to this issue, could you file a new issue if you suspect a problem with grpc-go?

This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). This change also unfortunately updates the keepalive behavior to be more spec compliant (grpc/grpc-go#2642). This change mandates a minimum ping time of 10s to the client. Given grpc/grpc-go#2638 this means that the rpc test for keepalives now takes over 20s. I would be okay skipping it but leave that discussion for review. Also updated the acceptance test to look out for an HTTP/2.0 header because grpc now does not send RPCs until after the HTTP handshake has completed (see grpc/grpc-go#2406). Release note (bug fix): Upgrade grpc library to fix connection state management bug.

39041: vendor: upgrade grpc from 1.13.0 to 1.21.2 r=ajwerner a=ajwerner This PR upgrades gRPC from 1.13.0 to 1.21.2. The primary motivation for this upgrade is to eliminate the disconnections caused by grpc/grpc-go#1882. These failures manifest themselves as the following set of errors: ``` ajwerner-test-0001> I190722 22:15:01.203008 12054 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE ``` Which then lead to tripped breakers and general badness. I suspect that there are several other good bug fixes in here, including some purported leaks and correctness fixes on shutdown. I have verified that with this upgrade I no longer see connections break in overload scenarios which reliably reproduced the situation in the above log. This commit removes one condition from grpcutil.IsClosedConnection which should be subsumed by the status check above. The `transport` subpackage has not been around for many releases. This does not upgrade to the current release 1.22.0 because the maintainer mentions that it contains a bug (grpc/grpc-go#2663 (comment)). Release note (bug fix): Upgrade grpc library to fix connection state management bug. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

menghanl mentioned this issue Mar 6, 2019

client: reset backoff to 0 after a connection is established #2669

Merged

dfawley added P1 Type: Bug labels Mar 7, 2019

sinmetal mentioned this issue Mar 8, 2019

3tableを連続で取得する sinmetal/srunner#1

Merged

menghanl added a commit that referenced this issue Mar 8, 2019

client: reset backoff to 0 after a connection is established (#2669)

5abb357

#2663 #2636

dfawley closed this as completed Mar 14, 2019

menghanl added a commit that referenced this issue Mar 20, 2019

client: reset backoff to 0 after a connection is established (#2669)

dcd6fac

#2663 #2636

menghanl added a commit that referenced this issue Mar 20, 2019

client: reset backoff to 0 after a connection is established (#2669)

a29b2fc

#2663 #2636

mikesun mentioned this issue Jun 17, 2019

grpc transport closing issues googleapis/google-cloud-go#1462

Closed

ajwerner mentioned this issue Jul 22, 2019

core: experiment with a separate GRPC transport for node liveness cockroachdb/cockroach#37195

Closed

ajwerner mentioned this issue Jul 22, 2019

vendor: upgrade grpc from 1.13.0 to 1.21.2 cockroachdb/cockroach#39041

Merged

y0sher mentioned this issue Oct 6, 2019

update grpc to version that excpected to solve our bug spacemeshos/go-spacemesh#1431

Merged

lock bot locked as resolved and limited conversation to collaborators Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Transport is closing and transient failure errors since upgrading to 1.18 #2663

cristiangraz commented Mar 1, 2019

menghanl commented Mar 1, 2019

menghanl commented Mar 1, 2019

kazegusuri commented Mar 2, 2019

kazegusuri commented Mar 2, 2019 •

edited

Loading

kazegusuri commented Mar 3, 2019

vvakame commented Mar 6, 2019

mpipet commented Mar 6, 2019 •

edited

Loading

menghanl commented Mar 6, 2019

dfawley commented Mar 14, 2019

mikesun commented Jun 14, 2019

dfawley commented Jun 14, 2019

mikesun commented Jun 14, 2019

dfawley commented Jun 14, 2019

mikesun commented Jun 14, 2019 •

edited

Loading

mikesun commented Jun 14, 2019 •

edited

Loading

dfawley commented Jun 14, 2019

mikesun commented Jun 17, 2019

dfawley commented Jun 20, 2019

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Comments

cristiangraz commented Mar 1, 2019

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

menghanl commented Mar 1, 2019

menghanl commented Mar 1, 2019

kazegusuri commented Mar 2, 2019

Example program

client

server

Results

kazegusuri commented Mar 2, 2019 • edited Loading

kazegusuri commented Mar 3, 2019

vvakame commented Mar 6, 2019

mpipet commented Mar 6, 2019 • edited Loading

menghanl commented Mar 6, 2019

dfawley commented Mar 14, 2019

mikesun commented Jun 14, 2019

dfawley commented Jun 14, 2019

mikesun commented Jun 14, 2019

dfawley commented Jun 14, 2019

mikesun commented Jun 14, 2019 • edited Loading

mikesun commented Jun 14, 2019 • edited Loading

dfawley commented Jun 14, 2019

mikesun commented Jun 17, 2019

dfawley commented Jun 20, 2019

What version of Go are you using (`go version`)?

kazegusuri commented Mar 2, 2019 •

edited

Loading

mpipet commented Mar 6, 2019 •

edited

Loading

mikesun commented Jun 14, 2019 •

edited

Loading

mikesun commented Jun 14, 2019 •

edited

Loading