Joining multiple servers at the same time causes one to fail to join #4335

rancher-max · 2021-10-27T16:29:30Z

Environmental Info:
K3s Version:

All commit ids:
9a4ca5978b1cf6ab27f983097edbcd17df73bbf8 on release-1.22
d413f971463a34d0191d1e67f6913e161a037589 on release-1.21
ab3d25a2c5f479f77c9579af719506438f5d4fe2 on master

Node(s) CPU architecture, OS, and Version:

Ubuntu 20.04 LTS

Cluster Configuration:

3 servers, all joining at the same time

Describe the bug:

Joining 3 servers with etcd backend at the same time causes one to fail with no recovery method other than uninstalling and reinstalling.

Steps To Reproduce:

config.yamls:

# server1:
cluster-init: true
token: test

# server2 & 3:
token: test
server: https://<server1 ip>:6443

Supply config.yamls shown above, and then at as close to the same time as possible, run the following on all 3 servers:
curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=server INSTALL_K3S_COMMIT=ab3d25a2c5f479f77c9579af719506438f5d4fe2 sh -

Expected behavior:

These should all join successfully after some time, maybe after a few looping logs in one of the servers like: level=fatal msg="ETCD join failed: etcdserver: too many learner members in cluster"

Actual behavior:

These fail to join, with the following looping logs that never stop:

Oct 27 01:43:24 ip-172-31-27-228 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 28.
Oct 27 01:43:24 ip-172-31-27-228 systemd[1]: Stopped Lightweight Kubernetes.
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: Starting Lightweight Kubernetes...
Oct 27 01:43:25 ip-172-31-27-228 sh[6517]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 27 01:43:25 ip-172-31-27-228 sh[6522]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: Started Lightweight Kubernetes.
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Starting k3s v1.22.2+k3s-ab3d25a2 (ab3d25a2)"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Managed etcd cluster not yet initialized"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=fatal msg="starting kubernetes: preparing server: etcdclient: no available endpoints"
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: k3s.service: Failed with result 'exit-code'.

Additional context / logs:

The only workaround I can find is by uninstalling k3s on the affected node and reinstalling.

Backporting

Needs backporting to older releases

The text was updated successfully, but these errors were encountered:

rancher-max · 2021-11-11T23:19:00Z

This has been validated on master branch using commitid 8271d98a766b060463bc73ef66c5085b5797b4cc following the same steps as mentioned in the issue. I should still validate on release-1.21 and release-1.22 branches for backports before closing.

rancher-max · 2021-11-22T22:51:27Z

I validated this on all branches, so I am closing this out.

rancher-max added the kind/bug Something isn't working label Oct 27, 2021

rancher-max added this to the v1.22.3+k3s1 milestone Oct 27, 2021

briandowns self-assigned this Oct 27, 2021

briandowns mentioned this issue Nov 9, 2021

update bootstrap logic #4438

Merged

rancher-max closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining multiple servers at the same time causes one to fail to join #4335

Joining multiple servers at the same time causes one to fail to join #4335

rancher-max commented Oct 27, 2021

rancher-max commented Nov 11, 2021

rancher-max commented Nov 22, 2021

Joining multiple servers at the same time causes one to fail to join #4335

Joining multiple servers at the same time causes one to fail to join #4335

Comments

rancher-max commented Oct 27, 2021

rancher-max commented Nov 11, 2021

rancher-max commented Nov 22, 2021