[Release-1.21] Unable to start secondary etcd nodes if initial cluster member is offline #4752

dereknola · 2021-12-15T16:43:00Z

Backport #4746 to release-1.21

mdrahman-suse · 2021-12-17T07:53:44Z

Validated in k3s with RC v1.21.8-rc1+k3s1 and observed that in a 3 node cluster, server 2 and server 3 successfully restated after being stopped while server 1 was started last

Steps:

Install k3s on a 3 node cluster

$ kubectl get nodes,pods -A -o wide
NAME                    STATUS   ROLES                       AGE   VERSION            INTERNAL-IP     EXTERNAL-IP     OS-IMAGE           KERNEL-VERSION   CONTAINER-RUNTIME
node/server2   Ready    control-plane,etcd,master   39m   v1.21.8-rc1+k3s1   <REDACTED>   <REDACTED>   Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/agent    Ready    <none>                      38m   v1.21.8-rc1+k3s1   <REDACTED>    <REDACTED>     Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/server1    Ready    control-plane,etcd,master   42m   v1.21.8-rc1+k3s1   <REDACTED>    <REDACTED>   Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/server3     Ready    control-plane,etcd,master   40m   v1.21.8-rc1+k3s1   <REDACTED>     <REDACTED>     Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1

Stop all the servers once initialized
Start server 2 and server 3
Verify they are up and running and no major error displayed in the log

$ kubectl get nodes,pods -A -o wide
NAME                    STATUS     ROLES                       AGE   VERSION            INTERNAL-IP     EXTERNAL-IP     OS-IMAGE           KERNEL-VERSION   CONTAINER-RUNTIME
node/server2   Ready      control-plane,etcd,master   62m   v1.21.8-rc1+k3s1   172.31.12.106   3.138.122.137   Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/agent    Ready      <none>                      61m   v1.21.8-rc1+k3s1   172.31.15.46    18.117.9.16     Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/server1    NotReady   control-plane,etcd,master   65m   v1.21.8-rc1+k3s1   172.31.2.134    18.221.237.67   Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1
node/server3     Ready      control-plane,etcd,master   63m   v1.21.8-rc1+k3s1   172.31.7.22     3.144.13.41     Ubuntu 20.04 LTS   5.4.0-1009-aws   containerd://1.4.12-k3s1

Start server 1
Verify cluster is up and running
Deploy workload and validate

NOTE: When tested with RC v1.21.8-rc1+k3s1 it was observed that upon restarting only server 2 while server 1 and server 3 is stopped, server 2 is showing the error

$ kubectl get nodes,pods -A -o wide
Error from server (InternalError): an error on the server ("apiserver not ready") has prevented the request from succeeding (get nodes)
Error from server (InternalError): an error on the server ("apiserver not ready") has prevented the request from succeeding (get pods)

Although I see that k3s is running on server 2

$ sudo systemctl status k3s
k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Fri 2021-12-17 07:34:44 UTC; 22s ago
       Docs: https://k3s.io

Please advice if this is expected @dereknola
CC: @rancher-max @ShylajaDevadiga

rancher-max · 2021-12-17T17:13:46Z

Awesome! That looks like correct behavior. The behavior mentioned above when just starting one node is due to quorum loss in etcd, so only starting one node does not restore quorum.

dereknola added the priority/critical-urgent label Dec 15, 2021

dereknola added this to the v1.21.8+k3s1 milestone Dec 15, 2021

dereknola self-assigned this Dec 15, 2021

dereknola mentioned this issue Dec 15, 2021

[Release-1.21] Fix cold boot and reconcilation on secondary servers #4753

Merged

ShylajaDevadiga assigned mdrahman-suse Dec 16, 2021

mdrahman-suse mentioned this issue Dec 17, 2021

[Release-1.22] Unable to start secondary etcd nodes if initial cluster member is offline #4751

Closed

rancher-max closed this as completed Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release-1.21] Unable to start secondary etcd nodes if initial cluster member is offline #4752

[Release-1.21] Unable to start secondary etcd nodes if initial cluster member is offline #4752

dereknola commented Dec 15, 2021

mdrahman-suse commented Dec 17, 2021

rancher-max commented Dec 17, 2021

[Release-1.21] Unable to start secondary etcd nodes if initial cluster member is offline #4752

[Release-1.21] Unable to start secondary etcd nodes if initial cluster member is offline #4752

Comments

dereknola commented Dec 15, 2021

mdrahman-suse commented Dec 17, 2021

rancher-max commented Dec 17, 2021