Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining multiple servers at the same time causes one to fail to join #4335

Closed
1 task done
rancher-max opened this issue Oct 27, 2021 · 2 comments
Closed
1 task done
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@rancher-max
Copy link
Contributor

Environmental Info:
K3s Version:

All commit ids:
9a4ca5978b1cf6ab27f983097edbcd17df73bbf8 on release-1.22
d413f971463a34d0191d1e67f6913e161a037589 on release-1.21
ab3d25a2c5f479f77c9579af719506438f5d4fe2 on master

Node(s) CPU architecture, OS, and Version:

Ubuntu 20.04 LTS

Cluster Configuration:

3 servers, all joining at the same time

Describe the bug:

Joining 3 servers with etcd backend at the same time causes one to fail with no recovery method other than uninstalling and reinstalling.

Steps To Reproduce:

config.yamls:

# server1:
cluster-init: true
token: test
# server2 & 3:
token: test
server: https://<server1 ip>:6443

Supply config.yamls shown above, and then at as close to the same time as possible, run the following on all 3 servers:
curl -sfL https://get.k3s.io | INSTALL_K3S_TYPE=server INSTALL_K3S_COMMIT=ab3d25a2c5f479f77c9579af719506438f5d4fe2 sh -

Expected behavior:

These should all join successfully after some time, maybe after a few looping logs in one of the servers like: level=fatal msg="ETCD join failed: etcdserver: too many learner members in cluster"

Actual behavior:

These fail to join, with the following looping logs that never stop:

Oct 27 01:43:24 ip-172-31-27-228 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 28.
Oct 27 01:43:24 ip-172-31-27-228 systemd[1]: Stopped Lightweight Kubernetes.
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: Starting Lightweight Kubernetes...
Oct 27 01:43:25 ip-172-31-27-228 sh[6517]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 27 01:43:25 ip-172-31-27-228 sh[6522]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: Started Lightweight Kubernetes.
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Starting k3s v1.22.2+k3s-ab3d25a2 (ab3d25a2)"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Managed etcd cluster not yet initialized"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=info msg="Reconciling bootstrap data between datastore and disk"
Oct 27 01:43:25 ip-172-31-27-228 k3s[6541]: time="2021-10-27T01:43:25Z" level=fatal msg="starting kubernetes: preparing server: etcdclient: no available endpoints"
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Oct 27 01:43:25 ip-172-31-27-228 systemd[1]: k3s.service: Failed with result 'exit-code'.

Additional context / logs:

The only workaround I can find is by uninstalling k3s on the affected node and reinstalling.

Backporting

  • Needs backporting to older releases
@rancher-max
Copy link
Contributor Author

This has been validated on master branch using commitid 8271d98a766b060463bc73ef66c5085b5797b4cc following the same steps as mentioned in the issue. I should still validate on release-1.21 and release-1.22 branches for backports before closing.

@rancher-max
Copy link
Contributor Author

I validated this on all branches, so I am closing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants