Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle etcd leader changes #228

Closed
lukasertl opened this issue May 22, 2024 · 5 comments · Fixed by #229
Closed

Handle etcd leader changes #228

lukasertl opened this issue May 22, 2024 · 5 comments · Fixed by #229
Assignees
Labels

Comments

@lukasertl
Copy link
Contributor

This is somewhat related to #208

When the etcd leader restarts, vip-manager decides to remove the VIP:

May 22 08:21:31 hostname vip-manager[4212]: 2024/05/22 08:21:31 IP address 10.x.y.z/16 state is true, desired true
May 22 08:21:41 hostname vip-manager[4212]: 2024/05/22 08:21:41 IP address 10.x.y.z/16 state is true, desired true
May 22 08:21:43 hostname vip-manager[4212]: {"level":"warn","ts":"2024-05-22T08:21:43.182166+0200","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ba5a0/vvhu4255.power.inet:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: leader changed"}
May 22 08:21:43 hostname vip-manager[4212]: {"level":"error","ts":"2024-05-22T08:21:43.189178+0200","logger":"etcd-client","caller":"v3@v3.5.13/retry_interceptor.go:114","msg":"clientv3/retry_interceptor: getToken failed","error":"etcdserver: leader changed","stacktrace":"go.etcd.io/etcd/client/v3.(*Client).streamClientInterceptor.func1\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/retry_interceptor.go:114\ngoogle.golang.org/grpc.(*ClientConn).NewStream\n\t/home/runner/go/pkg/mod/google.golang.org/grpc@v1.63.2/stream.go:167\ngo.etcd.io/etcd/api/v3/etcdserverpb.(*watchClient).Watch\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/api/v3@v3.5.13/etcdserverpb/rpc.pb.go:6690\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).openWatchClient\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:1004\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).newWatchClient\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:901\ngo.etcd.io/etcd/client/v3.(*watchGrpcStream).run\n\t/home/runner/go/pkg/mod/go.etcd.io/etcd/client/v3@v3.5.13/watch.go:661"}
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 etcd watcher returned error: etcdserver: leader changed
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 IP address 10.x.y.z/16 state is true, desired false
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 Removing address 10.x.y.z/16 on ens192
May 22 08:21:43 hostname vip-manager[4212]: 2024/05/22 08:21:43 IP address 10.x.y.z/16 state is false, desired false
May 22 08:21:51 hostname vip-manager[4212]: 2024/05/22 08:21:51 IP address 10.x.y.z/16 state is false, desired false

This happens with vip-manager 2.4.0

@pashagolub
Copy link
Collaborator

And that's exactly what we expect, no?

@lukasertl
Copy link
Contributor Author

No we don't expect that. This is not a change in patroni leadership, but etcd leadership.

@pashagolub
Copy link
Collaborator

oh, I see! Thanks. Will check this

@pashagolub pashagolub self-assigned this May 22, 2024
@pashagolub pashagolub added the bug label May 22, 2024
@pashagolub pashagolub linked a pull request May 22, 2024 that will close this issue
@pashagolub
Copy link
Collaborator

Would you please check the PR if it works for you?

Thanks in advance!

@lukasertl
Copy link
Contributor Author

Hi Pavlo,

I'm afraid this is not the correct fix. If I trigger the leader change situation with a patched vip-manager, it will leave the VIP setup intact, but the process is spinning on CPU.

I tried to find out what happens here, and I suspect that the select{} in the watch() function doesn't block anymore, thus running in an uncontrolled infinite loop. My guess is that at this point the etcd Watch isn't valid anymore and needs to be setup from scratch.

This is confirmed by the fact that if I switch patroni roles in this situation, the (non-broken) vip-manager on the new primary would add the VIP to the interface, but the (broken) vip-manager on the old primary wouldn't remove it.

pashagolub added a commit that referenced this issue May 23, 2024
* [*] refetch etcd key value if `WATCH` RPC fails, fixes #228
* [*] handle canceled watch
---------------
Co-authored by @lukasertl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment