Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix race condition in shutdown #764

Merged
merged 1 commit into from
May 28, 2021

Conversation

amdalal
Copy link
Contributor

@amdalal amdalal commented May 28, 2021

Currently http, tchannel servers and grpc client dispatcher are stopped in parallel. It is possible that gprc client dispatcher closes all outbounds while http/tchannel servers are still serving requests leading to failures.

This change waits for http/tchannel servers to shutdown before closing gprc client dispatcher.

Test:
To reproduce:

  • Add artificial slowness in tchannel shutdown
go func() {
		defer swg.Done()
		gateway.Logger.Info("sleeping before shutting down tchannel server")
		time.Sleep(1 * time.Minute)
		gateway.Logger.Info("resuming tchannel server shutdown")
		if err := gateway.shutdownTChannelServer(ctx); err != nil {
			ec <- errors.Wrap(err, "error shutting down tchannel server")
		}
	}()
  • Issue kill -15 on the pid
  • invoke bounce endpoint while tchannel shutdown in ongoing.
yab -s selective-gateway -m Bounce::bounce -t $GOPATH/src/github.com/uber/zanzibar/examples/selective-gateway/idl/endpoints/bounce/bounce.thrift --caller yab-amdalal --request '{"msg":"bdfb7a20-d7fa-48f4-96f3-d175a49fe2ae"}' -header 'x-uber-source:curl' --timeout=30000 -p 127.0.0.1:5050
  • Error seen
`{"level":"error","ts":1622190556.827372,"msg":"Endpoint failure: handler returned error","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":83591,"endpointID":"bounce","endpointHandler":"bounce","endpointThriftMethod":"Bounce::bounce","x-uber-source":"curl","requestUUID":"848a7e6d-faa6-497d-831d-a7a229dcf76a","error":"code:failed-precondition message:error waiting for grpc outbound to start for service: echo: could not wait for instance to start running: current state is \"stopped\""}`

After the fix, calls to echo go through.

shutdown logs

{"level":"info","ts":1622196809.018789,"msg":"Channel closed.","serviceName":"selective-gateway","process":"selective-gateway","chID":1,"hostPort":"127.0.0.1:5050"}
{"level":"error","ts":1622196809.018784,"msg":"Error http serving","error":"http: Server closed"}
{"level":"info","ts":1622196809.029137,"logger":"yarpc","msg":"shutting down dispatcher","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.0293381,"logger":"yarpc","msg":"stopping inbounds","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.029359,"logger":"yarpc","msg":"stopped inbounds","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.029372,"logger":"yarpc","msg":"stopping outbounds","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.0295491,"logger":"yarpc","msg":"stopped outbounds","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.0296,"logger":"yarpc","msg":"stopping transports","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.029619,"logger":"yarpc","msg":"stopped transports","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.029632,"logger":"yarpc","msg":"stopping metrics push loop, if any","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"debug","ts":1622196809.0295491,"msg":"peer status change","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"status":"Unavailable","peer":"127.0.0.1:8981","transport":"grpc"}
{"level":"debug","ts":1622196809.0296981,"logger":"yarpc","msg":"stopped metrics push loop, if any","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459,"yarpc":{"dispatcher":"selective-gateway"}}
{"level":"info","ts":1622196809.029728,"msg":"servers are shut down gracefully","zone":"unknown","env":"test","hostname":"amdalal-C02YVFT0LVCG","service":"selective-gateway","pid":99459}

@CLAassistant
Copy link

CLAassistant commented May 28, 2021

CLA assistant check
All committers have signed the CLA.

@coveralls
Copy link

coveralls commented May 28, 2021

Coverage Status

Coverage increased (+0.002%) to 69.239% when pulling c715f3b on amdalal:amdalal/shutdown-race-condition-3 into 5b76c58 on uber:master.

@amdalal amdalal marked this pull request as ready for review May 28, 2021 11:56
Copy link
Contributor

@tejaswiagarwal tejaswiagarwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@amdalal amdalal merged commit 7f3ed25 into uber:master May 28, 2021
@@ -434,6 +434,9 @@ func (gateway *Gateway) Shutdown() {
}
}()

// wait for servers to shutdown before stopping GRPCClientDispatcher
swg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, you can make only 3 channels on line# 407, make the following gRPC shutdown run sequentially and remove the finally wait on line# 451.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants