Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster creation on Azure unreliable #915

Closed
chrkl opened this issue May 25, 2020 · 8 comments · Fixed by #1020
Closed

Cluster creation on Azure unreliable #915

chrkl opened this issue May 25, 2020 · 8 comments · Fixed by #1020
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@chrkl
Copy link

chrkl commented May 25, 2020

What happened:

After the kubeadm calls, the cluster is often in a state where frequent etcd leader changes happen and API calls fail because of this. This causes the kubeone install to fail frequently because installing the CNI or machine-controller was not possible. The problem seem to be temporary as a second kubeone install run fixes the cluster.

What is the expected behavior:

How to reproduce the issue:

Create kubeone cluster on Azure.

Anything else we need to know?

Tried to switch to more powerful Azure resources than given in the Terraform example. Faster storage and VM types didn't bring a significant change. However, when increasing the number of retries that are performed by KubeOne, the installation is successful.

KubeOne output:
...
INFO[21:41:46 CEST] Downloading kubeconfig…
INFO[21:41:46 CEST] Building Kubernetes clientset…
INFO[21:41:47 CEST] Ensure node local DNS cache…
INFO[21:41:47 CEST] Activating additional features…
INFO[21:41:49 CEST] Applying canal CNI plugin…
WARN[21:41:55 CEST] Task failed…
WARN[21:41:55 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:00 CEST] Retrying task…
INFO[21:42:00 CEST] Applying canal CNI plugin…
WARN[21:42:13 CEST] Task failed…
WARN[21:42:13 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:23 CEST] Retrying task…
INFO[21:42:23 CEST] Applying canal CNI plugin…
WARN[21:42:32 CEST] Task failed…
WARN[21:42:32 CEST] error was: failed to get *v1.ConfigMap object: etcdserver: leader changed
Error: failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
ERROR[0412] Error installing kubernetes:

Information about the environment:
KubeOne version (kubeone version): v1.0.0-alpha
Operating system: CentOS
Provider you're deploying cluster on: Azure
Operating system you're deploying on:

@chrkl chrkl added the kind/bug Categorizes issue or PR as related to a bug. label May 25, 2020
@kron4eg
Copy link
Member

kron4eg commented May 25, 2020

@chrkl Can you investigate etcd logs? And check if ntp service is enabled

@chrkl
Copy link
Author

chrkl commented May 25, 2020

Here the etcd logs. They indicate some too long running requests. Cluster is healthy after a few minutes.

etcd-0.log
etcd-1.log
etcd-2.log

VMs run chrony.

@kron4eg
Copy link
Member

kron4eg commented Jun 2, 2020

etcd "leader changed" meas — there is a problem with one of the etcd members, and it needs to be replaced, most likely.

@kron4eg
Copy link
Member

kron4eg commented Jun 2, 2020

in order to find out which one, please use kubeone status

@chrkl
Copy link
Author

chrkl commented Jun 15, 2020

Looks like the cluster became temporarily unhealthy:

$ kubeone install -m config.yaml -t .
WARN[13:55:16 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[13:55:16 CEST] Determine hostname…
INFO[13:55:19 CEST] Determine operating system…
INFO[13:55:20 CEST] Installing prerequisites…
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.6 os=centos
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.6 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.6 os=centos
INFO[13:56:57 CEST] Generating kubeadm config file…
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.6
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.4
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.5
INFO[13:56:59 CEST] Configuring certs and etcd on first controller…
INFO[13:56:59 CEST] Ensuring Certificates…                        node=172.16.125.4
INFO[13:57:02 CEST] Downloading PKI…
INFO[13:57:02 CEST] Downloading PKI files…                        node=172.16.125.4
INFO[13:57:03 CEST] Creating local backup…                        node=172.16.125.4
INFO[13:57:03 CEST] Deploying PKI…
INFO[13:57:03 CEST] Uploading PKI files…                          node=172.16.125.6
INFO[13:57:03 CEST] Uploading PKI files…                          node=172.16.125.5
INFO[13:57:05 CEST] Configuring certs and etcd on consecutive controller…
INFO[13:57:05 CEST] Ensuring Certificates…                        node=172.16.125.6
INFO[13:57:05 CEST] Ensuring Certificates…                        node=172.16.125.5
INFO[13:57:07 CEST] Initializing Kubernetes on leader…
INFO[13:57:07 CEST] Running kubeadm…                              node=172.16.125.4
INFO[13:58:12 CEST] Building Kubernetes clientset…
INFO[13:58:13 CEST] Check if cluster needs any repairs…
INFO[13:58:13 CEST] Joining controlplane node…
INFO[13:58:13 CEST] Waiting 15s to ensure main control plane components are up…  node=172.16.125.5
INFO[13:58:28 CEST] Joining control plane node                    node=172.16.125.5
INFO[13:59:39 CEST] Waiting 15s to ensure main control plane components are up…  node=172.16.125.6
INFO[13:59:54 CEST] Joining control plane node                    node=172.16.125.6
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.6
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.4
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.5
INFO[14:01:05 CEST] Downloading kubeconfig…
INFO[14:01:05 CEST] Ensure node local DNS cache…
INFO[14:01:05 CEST] Activating additional features…
INFO[14:01:06 CEST] Applying canal CNI plugin…
INFO[14:01:15 CEST] Skipping applying addons because addons are not enabled…
INFO[14:01:15 CEST] Creating credentials secret…
INFO[14:01:15 CEST] Installing machine-controller…
WARN[14:01:25 CEST] Task failed, error was: failed to deploy machine-controller: failed to ensure machine-controller *v1beta1.CustomResourceDefinition: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: request timed out
WARN[14:01:30 CEST] Retrying task…
INFO[14:01:30 CEST] Installing machine-controller…
INFO[14:01:34 CEST] Installing machine-controller webhooks…
WARN[14:01:36 CEST] Task failed, error was: failed to deploy machine-controller webhook configuration: failed to ensure machine-controller webhook *v1.Service: failed to create *v1.Service: Internal error occurred: failed to allocate a serviceIP: etcdserver: leader changed
WARN[14:01:46 CEST] Retrying task…
INFO[14:01:46 CEST] Installing machine-controller…
INFO[14:01:48 CEST] Installing machine-controller webhooks…
INFO[14:01:52 CEST] Waiting for machine-controller to come up…
WARN[14:02:35 CEST] Task failed, error was: machine-controller-webhook did not come up: failed to list machine-controller's webhook pods: etcdserver: leader changed
WARN[14:02:40 CEST] Retrying task…
INFO[14:02:40 CEST] Waiting for machine-controller to come up…
INFO[14:03:05 CEST] Creating worker machines…

Polling the status during the installation several time resulted in the following output:

$ kubeone status -m config.yaml -t .
WARN[14:02:04 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:04 CEST] Determine hostname…
INFO[14:02:06 CEST] Determine operating system…
INFO[14:02:06 CEST] Building Kubernetes clientset…
INFO[14:02:06 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:07 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:07 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:07 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:10 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:10 CEST] Determine hostname…
INFO[14:02:12 CEST] Determine operating system…
INFO[14:02:13 CEST] Building Kubernetes clientset…
INFO[14:02:13 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:13 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:13 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:13 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:16 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:16 CEST] Determine hostname…
INFO[14:02:18 CEST] Determine operating system…
INFO[14:02:18 CEST] Building Kubernetes clientset…
INFO[14:02:19 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:19 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:19 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:19 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   unhealthy   unhealthy
k1-tf-cp-1   v1.18.1   unhealthy   unhealthy
k1-tf-cp-2   v1.18.1   unhealthy   unhealthy
$ kubeone status -m config.yaml -t .
WARN[14:02:50 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:50 CEST] Determine hostname…
INFO[14:02:52 CEST] Determine operating system…
INFO[14:02:52 CEST] Building Kubernetes clientset…
INFO[14:02:52 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:52 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:52 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:52 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:56 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:56 CEST] Determine hostname…
INFO[14:02:57 CEST] Determine operating system…
INFO[14:02:58 CEST] Building Kubernetes clientset…
INFO[14:02:58 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:58 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:58 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:58 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:03:07 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:03:07 CEST] Determine hostname…
INFO[14:03:09 CEST] Determine operating system…
INFO[14:03:13 CEST] Building Kubernetes clientset…
INFO[14:03:14 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:03:15 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:03:15 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:03:15 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy

For completeness again the etcd logs of this try (timezone +2h on my machine):
etcd-0.log
etcd-1.log
etcd-2.log

@kron4eg
Copy link
Member

kron4eg commented Jun 15, 2020

I'd say that VMs are hosted on overloaded hosts, in logs some requests take 5-10 seconds!, that's unreasonable long.

So ether etcd storage is half-dead, or hosts. Anyway, it's not something we can fix.

EDIT:
Or network between CP instances is so bad

@chrkl
Copy link
Author

chrkl commented Jun 15, 2020

This seem to happen only shortly after the VM creation. Could we make the cluster creation more reliable by increasing the number of retries for installing the CNI and machine-controller?

@kron4eg
Copy link
Member

kron4eg commented Jun 15, 2020

Yeah, we can do this, but since this is band aid I'm not sure if it would help 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants