clientv3: when a endpoint is unavailable, switch it in retry #8327

HardySimpson · 2017-07-28T08:29:14Z

clientv3: balancer.go and retry.go

when a endpoint is unavailable, switch it in retry

fix: #8326

heyitsanthony

this needs tests; dangerously broken as-is

heyitsanthony · 2017-07-28T16:06:29Z

clientv3/retry.go

@@ -36,14 +36,22 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
 			eErr := rpctypes.Error(err)
 			// always stop retry on etcd errors
 			if _, ok := eErr.(rpctypes.EtcdError); ok {
-				return err
+				if grpc.Code(err) != codes.Unavailable {


This isn't safe. Retrying will violate at-most-once semantics for Put/Del/Txn since a request timed out error could be returned even if the request makes it through raft. This can happen if there's a partition between the time of submitting a raft proposal and commit.

code changed

heyitsanthony · 2017-07-28T16:07:57Z

clientv3/balancer.go

+	b.updateAddrs(allEpsWithoutPinAddr)
+
+
+	// at most sleep 10 seconds here


depending on sleeps is slow and unreliable; the balancer probably needs to directly reason about graylisted endpoints

code also changed

xiang90 · 2017-07-28T18:09:48Z

@heyitsanthony @fanminshi

I thought our balance knows how to switch endpoints on failures/timeouts already. No?

heyitsanthony · 2017-07-28T18:15:15Z

@xiang90 no, it's still an open issue. It will only detect/switch on transport errors, not errors returned by etcd.

xiang90 · 2017-07-28T18:17:07Z

transport errors, not errors returned by etcd.

isnt timeout a transport error?

heyitsanthony · 2017-07-28T18:17:40Z

@xiang90 transport error between the client and server, not transport errors between peers

xiang90 · 2017-07-29T00:24:39Z

@heyitsanthony

It that is the case, then this should be done at server side. server will cut off the connection if it is partitioned.

heyitsanthony · 2017-07-29T01:19:25Z

@xiang90 I don't think that's enough; the client could retry the same endpoint without graylisting.

HardySimpson · 2017-07-29T06:43:37Z

@heyitsanthony agree to your opinion, server side switch is not enough, as server can not tell whether client's request is actually success or fail or request time out

HardySimpson · 2017-07-29T06:45:20Z

@heyitsanthony

about previous code commet.

the sleep is slow and not reliable, so I change to sync.cond way in code

HardySimpson · 2017-07-29T06:48:25Z

clientv3/retry.go

+					// at most switch one time per 30 seconds
+					// can be securely called by concurrently
+					c.SwitchEndpoint()
+				}


about the put/txn 's only-once semantic, only do switch here but not retry. keep this request fail and wish other RPC will success

HardySimpson · 2017-07-29T06:48:56Z

clientv3/retry.go

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
 				return err


But, if client fall into a grpc level request time out, it also violate only-once semantic and not OK

these are only returned before the request is sent over the wire afaik

yes I search the doc, see comments below

heyitsanthony

this is not adequately tested

heyitsanthony · 2017-07-30T01:24:12Z

clientv3/retry.go

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
 				return err


these are only returned before the request is sent over the wire afaik

heyitsanthony · 2017-07-30T01:24:53Z

clientv3/balancer.go

+
+	b.mu.Lock()
+	for (prevPinAddr == b.pinAddr) {
+		b.pinAddrCond.Wait()


there is no way for the ctx attached to the rpc to cancel and break out of this wait

same to the previous comment, cond is relate to sb.mu

heyitsanthony · 2017-07-30T01:28:29Z

clientv3/balancer.go

+	}
+
+
+	b.mu.Lock()


this will deadlock because b.mu.Lock isn't released on the Wait; anything that tries to update pinAddrCond will wait forever to acquire the lock

forget to do the cond init, add it

heyitsanthony · 2017-07-30T01:30:08Z

clientv3/retry.go

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
 				return err
 			}

+


please don't add blank lines if there's no new code attached

heyitsanthony · 2017-07-30T01:34:31Z

clientv3/retry.go

+					// try to request another endpoint at next rpc
+					// leave this fail
+					// already do anti-too-frequently switch
+					// at most switch one time per 30 seconds


the implementation details of the balancer policy shouldn't be included in the comment at the callsite since if the balancer code changes, this comment will have to be updated (which probably won't happen)

yes, I delete all comments, and rename to TrySwitchEndpoint, that is a better name tells what the function does

heyitsanthony · 2017-07-30T01:41:42Z

clientv3/balancer.go

+
+	b.mu.RUnlock()
+
+	if lastSwitchTime.Add(30 * time.Second).After(tNow) {


This needs to be endpoint graylisting; a single timeout isn't enough. Suppose there are 5 members with 2 partitioned. The client could end up bouncing between both partitioned endpoints for a while.

graylist has one problem, what time should we put the graylisted endpoint back to noramal? in the client we don't have enough info

I think a good way it to put a counter on each endpoint, every time switch the endpoint we choose a minimal retry count endpoint to connect. And all counter will reset every time updateEndpoints happened, periodically.

Still this will change a lot of code in balancer now. and it may cause some kind of un-balance between multiple clients. after a period maybe most client will attach to a health member and leave recover-health member without clients.

The algorithm now is random choose. when a client found a good endpoint, it will stick to it. the random choose is near to balance. as once it found a health member it will stick to it. and will cause 'bouncing between both partitioned endpoints' a while and at last worked well.

maybe a more better way it to choose endpoint both random and retry-count based.
for example, we put a 100 points on every endpoint, evey time the switch happens reduce one count.

and the switch choose is to select by the weight probability, if there are 3 memeber. their points is like this:

member point choose probability

etcd-0 100 100/(100+80+60) = 42%

etcd-1 80 80/(100+80+60) = 33%

etcd-2 60 60/(100+80+60) = 25%

if this is OK, should be implement in another PR

HardySimpson · 2017-07-31T03:28:45Z

about retry, I found in grpc's doc

    // FailedPrecondition indicates operation was rejected because the
    // system is not in a state required for the operation's execution.
    // For example, directory to be deleted may be non-empty, an rmdir
    // operation is applied to a non-directory, etc.
    //
    // A litmus test that may help a service implementor in deciding
    // between FailedPrecondition, Aborted, and Unavailable:
    //  (a) Use Unavailable if the client can retry just the failing call.
    //  (b) Use Aborted if the client should retry at a higher-level
    //      (e.g., restarting a read-modify-write sequence).
    //  (c) Use FailedPrecondition if the client should not retry until
    //      the system state has been explicitly fixed.  E.g., if an "rmdir"
    //      fails because the directory is non-empty, FailedPrecondition
    //      should be returned since the client should not retry unless
    //      they have first fixed up the directory by deleting files from it.
    //  (d) Use FailedPrecondition if the client performs conditional
    //      REST Get/Update/Delete on a resource and the resource on the
    //      server does not match the condition. E.g., conflicting
    //      read-modify-write on the same resource.
    FailedPrecondition Code = 9

    // OutOfRange means operation was attempted past the valid range.
    // E.g., seeking or reading past end of file.
    //
    // Unlike InvalidArgument, this error indicates a problem that may
    // be fixed if the system state changes. For example, a 32-bit file
    // system will generate InvalidArgument if asked to read at an
    // offset that is not in the range [0,2^32-1], but it will generate
    // OutOfRange if asked to read from an offset past the current
    // file size.
    //
    // There is a fair bit of overlap between FailedPrecondition and
    // OutOfRange.  We recommend using OutOfRange (the more specific
    // error) when it applies so that callers who are iterating through
    // a space can easily look for an OutOfRange error to detect when
    // they are done.
    OutOfRange Code = 11

As it mentioned, when face to a codes.Unavailable, client should retry, and when face to a codes.FailedPrecondition, client should not retry

But in etcd's error now

	ErrGRPCNoLeader                   = grpc.Errorf(codes.Unavailable, "etcdserver: no leader")
	ErrGRPCNotLeader                  = grpc.Errorf(codes.Unavailable, "etcdserver: not leader")
	ErrGRPCNotCapable                 = grpc.Errorf(codes.Unavailable, "etcdserver: not capable")
	ErrGRPCStopped                    = grpc.Errorf(codes.Unavailable, "etcdserver: server stopped")
	ErrGRPCTimeout                    = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out")
	ErrGRPCTimeoutDueToLeaderFail     = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to previous leader failure")
	ErrGRPCTimeoutDueToConnectionLost = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to connection lost")
	ErrGRPCUnhealthy                  = grpc.Errorf(codes.Unavailable, "etcdserver: unhealthy cluster")

So, as a result, should we re-classify these errors?
Put all time out error to codes.FailedPrecondition and not retry, leave no leader error to Unavailale

and the retry code will be like this

func (c *Client) newRetryWrapper() retryRpcFunc {
	return func(rpcCtx context.Context, f rpcFunc) error {
		for {
			err := f(rpcCtx)
			if err == nil {
				return nil
			}

                       if grpc.Code(err) != codes.Unavailable || {{ some timeout judge}} {
                               c.TrySwitchEndpoint()
                       }
			// only retry if unavailable
			if grpc.Code(err) != codes.Unavailable {
				return err
			}
			select {
			case <-c.balancer.ConnectNotify():
			case <-rpcCtx.Done():
				return rpcCtx.Err()
			case <-c.ctx.Done():
				return c.ctx.Err()
			}
		}
	}
}

HardySimpson · 2017-08-08T02:33:19Z

@heyitsanthony I tried for a week, to implement a good switch, the best I can achieve is in this PR.

though I found as the balance code now is async, not sync. so can not write a stable unit test. the unit test's output now is like this.

"D:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 2017.1\bin\runnerw.exe" C:/Go\bin\go.exe test -v -timeout 30s github.com/coreos/etcd/clientv3 -run ^TestBalancerTrySwitch$
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
-----
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
-----
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
-----
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
-----
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
-----

though it works, with sb.trySwitchEndpoint(0), in a duration of 0 second, it actually does the switch.

so, what's next? could you solve this problem perfectly in next version etcd? as we really need it.

heyitsanthony · 2017-08-08T19:53:27Z

@HardySimpson
Puts should follow at most-once-semantics. gRPC, not etcd, is returning Unavailable codes even if it's possible the RPC request was already transmitted to the server. Error codes like ErrGRPCNoLeader should be retried in all cases, ErrGRPCTimeout shouldn't be retried for puts, etc. I'm not really interested in reclassifying all the error codes at this time since it'll likely break older clients.

I'll try to put some time into this. The balancer doesn't really need synchronous endpoint switching as much as the client should eventually switch endpoints if RPCs begin returning unavailable responses so that it can at least make forward progress.

when a endpoint is unavailable, switch it in retry fix: etcd-io#8326

HardySimpson · 2017-08-25T09:19:52Z

@heyitsanthony , I have refactor some code, base on test the last pinAddr equals pinAddr now, something like graylist, and add unit test to verify the correctness. So I think maybe this PR can be merged ?

heyitsanthony · 2017-08-25T20:23:04Z

clientv3/client.go

+// TrySwitchEndpoint try to make balancer change it's fix endpoint.
+// It may do nothing while their is only one endpoint, or switch has just happened
+// It is called when the endpoint now is not available, like network partition happens
+func (c *Client) TrySwitchEndpoint() {


I don't think this should be a public interface; the user shouldn't have to give hints to the client to switch endpoints. Just call c.balancer.trySwitchEndpoint in retry.go?

heyitsanthony · 2017-08-25T20:27:40Z

clientv3/balancer_test.go

@@ -237,3 +238,82 @@ func (kcl *killConnListener) close() {
 	close(kcl.stopc)
 	kcl.wg.Wait()
 }
+
+func TestBalancerTrySwitch(t *testing.T) {


The mock grpc notifications this test uses could diverge from grpc behavior and the test itself is rather complicated. A clientv3/integration/kv_test.go test would be better instead since what matters is the Get failing over:

func TestKVGetFailoverPartition(t *testing.T) { defer testutil.AfterTest(t) clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3}) defer clus.Terminate(t) cli, err := clientv3.New(clientv3.Config{ Endpoints: clus.Client(0).Endpoints(), DialTimeout: time.Second, }) testutil.AssertNil(t, err) defer cli.Close() // force to connect to member 0 _, err = cli.Get(context.TODO(), "abc") testutil.AssertNil(t, err) // permit failover to 1 cli.SetEndpoints(append(cli.Endpoints(), clus.Client(1).Endpoints()...)...) // partition member 0 from rest of cluster clus.Members[0].InjectPartition(t, clus.Members[1:]) // may have triggered a leader election; expect timeouts timeout := clus.Members[1].ServerConfig.ReqTimeout() // member 0 fails to get consensus for the l-read, balancer switches ctx, cancel := context.WithTimeout(context.TODO(), 3*timeout) _, err = cli.Get(ctx, "abc") cancel() testutil.AssertNil(t, err) }

heyitsanthony · 2017-08-25T20:29:27Z

clientv3/balancer.go

@@ -310,7 +323,9 @@ func (b *simpleBalancer) Get(ctx context.Context, opts grpc.BalancerGetOptions)
 	return grpc.Address{Addr: addr}, func() {}, nil
 }

-func (b *simpleBalancer) Notify() <-chan []grpc.Address { return b.notifyCh }
+func (b *simpleBalancer) Notify() <-chan []grpc.Address {


don't change

heyitsanthony · 2017-08-25T20:29:34Z

clientv3/balancer.go

 	// notify client that a connection is up
-	b.readyOnce.Do(func() { close(b.readyc) })
+	b.readyOnce.Do(func() {


don't change

heyitsanthony · 2017-08-25T20:29:43Z

clientv3/balancer.go

 	close(sb.downc)
 	go sb.updateNotifyLoop()
 	return sb
 }

-func (b *simpleBalancer) Start(target string, config grpc.BalancerConfig) error { return nil }
+func (b *simpleBalancer) Start(target string, config grpc.BalancerConfig) error {


don't change

heyitsanthony · 2017-08-25T20:36:26Z

clientv3/balancer.go

+
+	func() {
+		b.mu.Lock()
+		defer b.mu.Unlock()


deferring will deadlock

acquire b.mu.Lock

block on b.updateAddrsC <- struct{}{}

updateNotifyLoop tries to acquire b.mu.RLock and deadlocks.

OK, put it outside the func()

heyitsanthony · 2017-08-25T20:46:07Z

clientv3/balancer.go

+			b.pinAddrCond.Wait()
+		}
+
+		b.host2ep[prevPinAddr] = prevEp


This could corrupt the balancer endpoint list.

start trySwitchEndpoint, delete prevPinAddr "a" from the addr list

some other goroutine calls client.SetEndpoints("b", "c")

trySwitchEndpoint gets notification from pinAddrCond, puts prevEp back into endpoint list.

client.Endpoints() == {"a", "b", "c"}

OK, I think this is the main reason that this implementation not good. A better way, I think, is not to modify the balancer's endpoints list, but erase the pinAddr, and then put the (endpoints - pinAddr) directly to notifyCh, wait grpc to Up() one of them. it this OK?

gyuho · 2017-09-21T12:22:52Z

@HardySimpson Closing in favor of #8545.
Thanks!

HardySimpson mentioned this pull request Jul 28, 2017

Automatically switch endpoint when one endpoint is not available #8326

Closed

heyitsanthony suggested changes Jul 28, 2017

View reviewed changes

HardySimpson force-pushed the switch-on-retry branch from 106bf59 to 0b87a29 Compare July 29, 2017 06:42

HardySimpson commented Jul 29, 2017

View reviewed changes

HardySimpson force-pushed the switch-on-retry branch 4 times, most recently from 4c29508 to 7398596 Compare July 29, 2017 07:47

heyitsanthony suggested changes Jul 30, 2017

View reviewed changes

HardySimpson force-pushed the switch-on-retry branch 5 times, most recently from 1a3ef27 to fc2fcd5 Compare July 31, 2017 02:12

HardySimpson force-pushed the switch-on-retry branch 2 times, most recently from 9da2545 to 6c4e421 Compare August 2, 2017 01:50

HardySimpson changed the title ~~when a endpoint is unavailable, switch it in retry~~ clientv3: when a endpoint is unavailable, switch it in retry Aug 2, 2017

HardySimpson force-pushed the switch-on-retry branch 4 times, most recently from 1e1ca75 to 8421b28 Compare August 3, 2017 07:34

HardySimpson force-pushed the switch-on-retry branch 13 times, most recently from f5910ea to a621ee2 Compare August 8, 2017 02:22

HardySimpson force-pushed the switch-on-retry branch 7 times, most recently from 978a6ea to 441c9cf Compare August 25, 2017 04:48

HardySimpson force-pushed the switch-on-retry branch from 441c9cf to 5c434bd Compare August 25, 2017 07:28

clientv3: balancer.go and retry.go

fcee836

when a endpoint is unavailable, switch it in retry fix: etcd-io#8326

HardySimpson force-pushed the switch-on-retry branch from 5c434bd to fcee836 Compare August 25, 2017 07:43

heyitsanthony suggested changes Aug 25, 2017

View reviewed changes

heyitsanthony mentioned this pull request Sep 12, 2017

clientv3: health balancer #8545

Merged

gyuho closed this Sep 21, 2017

		b.updateAddrs(allEpsWithoutPinAddr)


		// at most sleep 10 seconds here

		@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
		return err


		b.mu.RUnlock()

		if lastSwitchTime.Add(30 * time.Second).After(tNow) {

member	point	choose probability
etcd-0	100	100/(100+80+60) = 42%
etcd-1	80	80/(100+80+60) = 33%
etcd-2	60	60/(100+80+60) = 25%

clientv3: when a endpoint is unavailable, switch it in retry #8327

clientv3: when a endpoint is unavailable, switch it in retry #8327

Conversation

HardySimpson commented Jul 28, 2017

heyitsanthony left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiang90 commented Jul 28, 2017

heyitsanthony commented Jul 28, 2017

xiang90 commented Jul 28, 2017

heyitsanthony commented Jul 28, 2017

xiang90 commented Jul 29, 2017

heyitsanthony commented Jul 29, 2017

HardySimpson commented Jul 29, 2017 • edited Loading

HardySimpson commented Jul 29, 2017

HardySimpson Jul 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heyitsanthony left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HardySimpson Jul 31, 2017 • edited Loading

Choose a reason for hiding this comment

HardySimpson Jul 31, 2017 • edited Loading

Choose a reason for hiding this comment

HardySimpson commented Jul 31, 2017 • edited Loading

HardySimpson commented Aug 8, 2017

heyitsanthony commented Aug 8, 2017

HardySimpson commented Aug 25, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyuho commented Sep 21, 2017

HardySimpson commented Jul 29, 2017 •

edited

Loading

HardySimpson Jul 29, 2017 •

edited

Loading

HardySimpson Jul 31, 2017 •

edited

Loading

HardySimpson Jul 31, 2017 •

edited

Loading

HardySimpson commented Jul 31, 2017 •

edited

Loading