Nodes aren't joining a new cluster when using an external DB #3226

rudimk · 2021-04-23T19:36:36Z

Environmental Info:
K3s Version:
v1.20.6+k3s1 (8d04328)

Node(s) CPU architecture, OS, and Version:
Linux leankube-master-2 5.4.0-1038-aws #40-Ubuntu SMP Fri Feb 5 23:50:40 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
Just two servers for running a Rancher cluster. No agents.

Describe the bug:
When one spins up multiple server nodes with an external DB backend, the first node starts up okay, but the rest don't join the cluster. This is essentially the same behaviour as #3130.

Steps To Reproduce:

Installed K3s: Downloaded the installer script from https://get.k3s.io, and ran that: ./k3sInstaller.sh mysql://user:password@tcpX.X.X.X:3306)/rancher" --tls-san Y.Y.Y.Y
Repeat on all nodes.

Expected behavior:

All nodes join the cluster.

Actual behavior:

The first node works okay. Other nodes don't join the cluster and throw exceptions about being unable to read/verify certificates.

Additional context / logs:

Apr 23 19:27:17 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:17.429965524Z" level=error msg="Failed to authenticate request from 172.31.7.94:50818: [x509: certificate signed by unknown authority, verifying certificate SN=8450779571907569382, SKID=, AKID=8C:C2:52:8A:37:23:D0:66:80:A4:EE:67:1B:41:21:AC:5A:F0:4D:B0 failed: x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate \"k3s-client-ca@1619205961\")]"
Apr 23 19:27:17 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:17.753177982Z" level=info msg="Waiting for control-plane node leankube-master-2 startup: nodes \"leankube-master-2\" not found"
Apr 23 19:27:18 leankube-master-2 k3s[17118]: I0423 19:27:18.248879   17118 kubelet.go:449] kubelet nodes not sync
Apr 23 19:27:18 leankube-master-2 k3s[17118]: time="2021-04-23T19:27:18.377884930Z" level=info msg="Cluster-Http-Server 2021/04/23 19:27:18 http: TLS handshake error from 127.0.0.1:37414: remote error: tls: bad certificate"

The text was updated successfully, but these errors were encountered:

brandond · 2021-04-23T21:57:47Z

Can you attach the complete K3s service logs from all the nodes?

Are you by any chance running the installer on nodes that were previously a part of a different K3s cluster and have old cert files left on disk?

rudimk · 2021-04-24T04:08:29Z

So something weird happened. After posting this issue, I decided to call it a night and get some shut-eye - except the cluster started working on its own.

> kubectl --kubeconfig output/65.2.28.73-kubeconfig.yaml get nodes
NAME                STATUS   ROLES                  AGE     VERSION
leankube-master-2   Ready    control-plane,master   4h41m   v1.20.6+k3s1
leankube-master-1   Ready    control-plane,master   8h      v1.20.6+k3s1

As for your question about running the installer on nodes that previously had a different cluster - I thought of that too. But I'd face the same error regardless of whether it's recycled nodes or fresh nodes. The first node - leankube-master-1 worked fine, and leankube-master-2 didn't join the cluster for about 4 hours, throwing the error I included above. Now it seems like not only it has joined the cluster - after that 4 hour gap - but it also seems like the error it kept throwing is now also being thrown by the first node - except the interesting thing is that the first node is still a part of the cluster.

One last thing. Not sure if it matters much, but both these nodes are running K3s with the --tls-san flag; I use that to specify the public IP of another VM in the same network that's running HAProxy and simply passes all traffic on *:6443 to the two nodes.

Here are the logs for the two nodes: logs.zip

rudimk · 2021-04-29T08:44:05Z

Tried another test, with the same installer script and K3s version, but on Ubuntu 18.04 this time. Now it works just fine. Facing a wildly different issue with running the same workflow on Ubuntu 20.04 - using the downloaded kubeconfig results in Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "k3s-server-ca@1619684858") - note that is inspite of using the --tls-san flag whilst spinning up the masters to include the IP of a HAProxy VM with the appropriate config pointing towards port 6443 on the master nodes. This works just fine with Ubuntu 18.04, with no difference in the installer script or the K3s binary version, or in the underlying infrastructure.

Since I'm unable to replicate this issue now, looking for guidance on whether you'd like to close this issue or look into it further.

dnoland1 · 2021-05-20T05:30:27Z

@brandond I might be hitting this issue or something similar on k3s 1.20.6. I can give you full logs if you DM me.

brandond · 2021-05-20T20:14:20Z

I am guessing that this is related to some of the certificate bootstrap sequencing changes that went into 1.20.6 for backup-restore, but nothing's jumping out at me.
v1.20.5+k3s1...v1.20.6+k3s1

brandond · 2021-05-20T20:29:01Z

Just to be clear on the code path here, which I think was pretty likely to have always contained a race condition:

If both servers come up at the same time, they will both find the database empty, create new cluster certificates, and then store that to the database in the bootstrap key. One of them will do so first, the other one will get an error when trying to store the bootstrap data due to the key already existing. This should be a fatal error that causes it to exit, and get restarted by systemd. Due to #3015 it will not re-read the proper certificates from the datastore when it starts up - it will continue on joining the cluster using the different certificates it generated the first time it started, which were never written to the datastore.

It's possible that the improvements for backup/restore made the startup a little faster and therefore more likely to race.

rudimk · 2021-05-23T16:26:37Z

Just to be clear on the code path here, which I think was pretty likely to have always contained a race condition:

If both servers come up at the same time, they will both find the database empty, create new cluster certificates, and then store that to the database in the bootstrap key. One of them will do so first, the other one will get an error when trying to store the bootstrap data due to the key already existing. This should be a fatal error that causes it to exit, and get restarted by systemd. Due to #3015 it will not re-read the proper certificates from the datastore when it starts up - it will continue on joining the cluster using the different certificates it generated the first time it started, which were never written to the datastore.

It's possible that the improvements for backup/restore made the startup a little faster and therefore more likely to race.

I now see how I managed to run into this error then. Usually I've installed K3s manually; but lately I've been doing it with Ansible. Ansible runs the K3s installer script on each node in parallel, and that's probably how we're running into this.

brandond · 2021-05-24T21:15:35Z

I now see how I managed to run into this error then. Usually I've installed K3s manually; but lately I've been doing it with Ansible. Ansible runs the K3s installer script on each node in parallel, and that's probably how we're running into this

If you can reconfigure your playbook to jitter the task initiations by a couple seconds, or run them sequentially, that should fix it.

rudimk · 2021-05-25T08:51:19Z

Yep I did that yesterday. Before doing that, I had run those playbooks again to deploy a production K3s cluster with three master nodes. Surprisingly that one had worked just fine - because I was doing it over a VPN and the network lag accidentally ensured the first node was run a couple of seconds before the others. So yes, it’s definitely the race condition you mentioned.

fapatel1 · 2021-06-16T20:07:35Z

Will need issue #3015 to be completed as a dependency on this issue

davidnuzik · 2021-06-23T00:30:36Z

Needed in July timeframe.

cjellick · 2021-06-29T15:16:43Z

I think this issue will need ports for 1.21, master

zhoub · 2021-09-09T12:33:01Z

Bump, got same issue from 1.21.4

ShylajaDevadiga · 2022-01-06T18:51:41Z

Closing issue as it is validated as part of #3015

$ kubectl get nodes
NAME              STATUS   ROLES                  AGE     VERSION
ip-172-31-5-79    Ready    control-plane,master   4m30s   v1.20.13+k3s1
ip-172-31-2-191   Ready    control-plane,master   4m16s   v1.20.13+k3s1
ip-172-31-9-64    Ready    <none>                 3m6s    v1.20.13+k3s1
ip-172-31-4-62    Ready    control-plane,master   6m18s   v1.20.13+k3s1

rudimk · 2022-11-19T16:29:40Z

Hey guys - we're seeing this again, on K3s 1.20.15, on EC2 nodes running Ubuntu 20.04. Since the nodes are provisioned using an autoscaling group, there's actually no real way to introduce a jitter or delay, in order to ensure nodes don't come up in parallel.

rudimk · 2022-11-19T16:32:50Z

Okay nvm - just found a comment from @briandowns here: #3950 (comment). Guess it's time to move up to 1.21. :)

brandond added the kind/bug Something isn't working label May 20, 2021

brandond added this to the v1.20.8+k3s1 milestone May 20, 2021

brandond added the kind/internal label May 20, 2021

davidnuzik assigned briandowns May 21, 2021

davidnuzik added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 21, 2021

cjellick modified the milestones: v1.20.8+k3s1, v1.20.9+k3s1 Jun 15, 2021

davidnuzik mentioned this issue Jun 23, 2021

K3s does not ensure that certificates on disk match values from from cluster bootstrap data #3015

Closed

davidnuzik modified the milestones: v1.20.9+k3s1, v1.20.10+k3s1 Jul 7, 2021

fapatel1 modified the milestones: v1.20.10+k3s1, v1.20.11+k3s1 Aug 23, 2021

cwayne18 modified the milestones: v1.20.11+k3s1, v1.20.12+k3s1 Sep 27, 2021

bmdepesa assigned ShylajaDevadiga Oct 19, 2021

rancher-max modified the milestones: v1.20.12+k3s1, v1.20.13+k3s1 Oct 27, 2021

ShylajaDevadiga closed this as completed Jan 6, 2022

rudimk mentioned this issue Nov 19, 2022

Error using kubeconfig when connecting to a multi-master K3s cluster #6525

Closed

brandond mentioned this issue Mar 30, 2023

Simultaneously started K3s servers may race to create CA certificates when using external SQL #7185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes aren't joining a new cluster when using an external DB #3226

Nodes aren't joining a new cluster when using an external DB #3226

rudimk commented Apr 23, 2021

brandond commented Apr 23, 2021 •

edited

Loading

rudimk commented Apr 24, 2021 •

edited

Loading

rudimk commented Apr 29, 2021 •

edited

Loading

dnoland1 commented May 20, 2021

brandond commented May 20, 2021

brandond commented May 20, 2021 •

edited

Loading

rudimk commented May 23, 2021

brandond commented May 24, 2021

rudimk commented May 25, 2021

fapatel1 commented Jun 16, 2021

davidnuzik commented Jun 23, 2021

cjellick commented Jun 29, 2021

zhoub commented Sep 9, 2021

ShylajaDevadiga commented Jan 6, 2022

rudimk commented Nov 19, 2022

rudimk commented Nov 19, 2022

Nodes aren't joining a new cluster when using an external DB #3226

Nodes aren't joining a new cluster when using an external DB #3226

Comments

rudimk commented Apr 23, 2021

brandond commented Apr 23, 2021 • edited Loading

rudimk commented Apr 24, 2021 • edited Loading

rudimk commented Apr 29, 2021 • edited Loading

dnoland1 commented May 20, 2021

brandond commented May 20, 2021

brandond commented May 20, 2021 • edited Loading

rudimk commented May 23, 2021

brandond commented May 24, 2021

rudimk commented May 25, 2021

fapatel1 commented Jun 16, 2021

davidnuzik commented Jun 23, 2021

cjellick commented Jun 29, 2021

zhoub commented Sep 9, 2021

ShylajaDevadiga commented Jan 6, 2022

rudimk commented Nov 19, 2022

rudimk commented Nov 19, 2022

brandond commented Apr 23, 2021 •

edited

Loading

rudimk commented Apr 24, 2021 •

edited

Loading

rudimk commented Apr 29, 2021 •

edited

Loading

brandond commented May 20, 2021 •

edited

Loading