Cluster creation on Azure unreliable #915

chrkl · 2020-05-25T14:57:58Z

What happened:

After the kubeadm calls, the cluster is often in a state where frequent etcd leader changes happen and API calls fail because of this. This causes the kubeone install to fail frequently because installing the CNI or machine-controller was not possible. The problem seem to be temporary as a second kubeone install run fixes the cluster.

What is the expected behavior:

How to reproduce the issue:

Create kubeone cluster on Azure.

Anything else we need to know?

Tried to switch to more powerful Azure resources than given in the Terraform example. Faster storage and VM types didn't bring a significant change. However, when increasing the number of retries that are performed by KubeOne, the installation is successful.

KubeOne output:
...
INFO[21:41:46 CEST] Downloading kubeconfig…
INFO[21:41:46 CEST] Building Kubernetes clientset…
INFO[21:41:47 CEST] Ensure node local DNS cache…
INFO[21:41:47 CEST] Activating additional features…
INFO[21:41:49 CEST] Applying canal CNI plugin…
WARN[21:41:55 CEST] Task failed…
WARN[21:41:55 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:00 CEST] Retrying task…
INFO[21:42:00 CEST] Applying canal CNI plugin…
WARN[21:42:13 CEST] Task failed…
WARN[21:42:13 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:23 CEST] Retrying task…
INFO[21:42:23 CEST] Applying canal CNI plugin…
WARN[21:42:32 CEST] Task failed…
WARN[21:42:32 CEST] error was: failed to get *v1.ConfigMap object: etcdserver: leader changed
Error: failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
ERROR[0412] Error installing kubernetes:

Information about the environment:
KubeOne version (kubeone version): v1.0.0-alpha
Operating system: CentOS
Provider you're deploying cluster on: Azure
Operating system you're deploying on:

The text was updated successfully, but these errors were encountered:

kron4eg · 2020-05-25T17:38:19Z

@chrkl Can you investigate etcd logs? And check if ntp service is enabled

chrkl · 2020-05-25T19:18:45Z

Here the etcd logs. They indicate some too long running requests. Cluster is healthy after a few minutes.

etcd-0.log
etcd-1.log
etcd-2.log

VMs run chrony.

kron4eg · 2020-06-02T11:46:51Z

etcd "leader changed" meas — there is a problem with one of the etcd members, and it needs to be replaced, most likely.

kron4eg · 2020-06-02T11:47:08Z

in order to find out which one, please use kubeone status

chrkl · 2020-06-15T12:15:13Z

Looks like the cluster became temporarily unhealthy:

$ kubeone install -m config.yaml -t .
WARN[13:55:16 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[13:55:16 CEST] Determine hostname…
INFO[13:55:19 CEST] Determine operating system…
INFO[13:55:20 CEST] Installing prerequisites…
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.6 os=centos
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Creating environment file…                    node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.5 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.4 os=centos
INFO[13:55:20 CEST] Configuring proxy…                            node=172.16.125.6 os=centos
INFO[13:55:20 CEST] Installing kubeadm…                           node=172.16.125.6 os=centos
INFO[13:56:57 CEST] Generating kubeadm config file…
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.6
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.4
INFO[13:56:58 CEST] Uploading config files…                       node=172.16.125.5
INFO[13:56:59 CEST] Configuring certs and etcd on first controller…
INFO[13:56:59 CEST] Ensuring Certificates…                        node=172.16.125.4
INFO[13:57:02 CEST] Downloading PKI…
INFO[13:57:02 CEST] Downloading PKI files…                        node=172.16.125.4
INFO[13:57:03 CEST] Creating local backup…                        node=172.16.125.4
INFO[13:57:03 CEST] Deploying PKI…
INFO[13:57:03 CEST] Uploading PKI files…                          node=172.16.125.6
INFO[13:57:03 CEST] Uploading PKI files…                          node=172.16.125.5
INFO[13:57:05 CEST] Configuring certs and etcd on consecutive controller…
INFO[13:57:05 CEST] Ensuring Certificates…                        node=172.16.125.6
INFO[13:57:05 CEST] Ensuring Certificates…                        node=172.16.125.5
INFO[13:57:07 CEST] Initializing Kubernetes on leader…
INFO[13:57:07 CEST] Running kubeadm…                              node=172.16.125.4
INFO[13:58:12 CEST] Building Kubernetes clientset…
INFO[13:58:13 CEST] Check if cluster needs any repairs…
INFO[13:58:13 CEST] Joining controlplane node…
INFO[13:58:13 CEST] Waiting 15s to ensure main control plane components are up…  node=172.16.125.5
INFO[13:58:28 CEST] Joining control plane node                    node=172.16.125.5
INFO[13:59:39 CEST] Waiting 15s to ensure main control plane components are up…  node=172.16.125.6
INFO[13:59:54 CEST] Joining control plane node                    node=172.16.125.6
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.6
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.4
INFO[14:01:04 CEST] Copying Kubeconfig to home directory…         node=172.16.125.5
INFO[14:01:05 CEST] Downloading kubeconfig…
INFO[14:01:05 CEST] Ensure node local DNS cache…
INFO[14:01:05 CEST] Activating additional features…
INFO[14:01:06 CEST] Applying canal CNI plugin…
INFO[14:01:15 CEST] Skipping applying addons because addons are not enabled…
INFO[14:01:15 CEST] Creating credentials secret…
INFO[14:01:15 CEST] Installing machine-controller…
WARN[14:01:25 CEST] Task failed, error was: failed to deploy machine-controller: failed to ensure machine-controller *v1beta1.CustomResourceDefinition: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: request timed out
WARN[14:01:30 CEST] Retrying task…
INFO[14:01:30 CEST] Installing machine-controller…
INFO[14:01:34 CEST] Installing machine-controller webhooks…
WARN[14:01:36 CEST] Task failed, error was: failed to deploy machine-controller webhook configuration: failed to ensure machine-controller webhook *v1.Service: failed to create *v1.Service: Internal error occurred: failed to allocate a serviceIP: etcdserver: leader changed
WARN[14:01:46 CEST] Retrying task…
INFO[14:01:46 CEST] Installing machine-controller…
INFO[14:01:48 CEST] Installing machine-controller webhooks…
INFO[14:01:52 CEST] Waiting for machine-controller to come up…
WARN[14:02:35 CEST] Task failed, error was: machine-controller-webhook did not come up: failed to list machine-controller's webhook pods: etcdserver: leader changed
WARN[14:02:40 CEST] Retrying task…
INFO[14:02:40 CEST] Waiting for machine-controller to come up…
INFO[14:03:05 CEST] Creating worker machines…

Polling the status during the installation several time resulted in the following output:

$ kubeone status -m config.yaml -t .
WARN[14:02:04 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:04 CEST] Determine hostname…
INFO[14:02:06 CEST] Determine operating system…
INFO[14:02:06 CEST] Building Kubernetes clientset…
INFO[14:02:06 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:07 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:07 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:07 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:10 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:10 CEST] Determine hostname…
INFO[14:02:12 CEST] Determine operating system…
INFO[14:02:13 CEST] Building Kubernetes clientset…
INFO[14:02:13 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:13 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:13 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:13 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:16 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:16 CEST] Determine hostname…
INFO[14:02:18 CEST] Determine operating system…
INFO[14:02:18 CEST] Building Kubernetes clientset…
INFO[14:02:19 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:19 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:19 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:19 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   unhealthy   unhealthy
k1-tf-cp-1   v1.18.1   unhealthy   unhealthy
k1-tf-cp-2   v1.18.1   unhealthy   unhealthy
$ kubeone status -m config.yaml -t .
WARN[14:02:50 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:50 CEST] Determine hostname…
INFO[14:02:52 CEST] Determine operating system…
INFO[14:02:52 CEST] Building Kubernetes clientset…
INFO[14:02:52 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:52 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:52 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:52 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:02:56 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:02:56 CEST] Determine hostname…
INFO[14:02:57 CEST] Determine operating system…
INFO[14:02:58 CEST] Building Kubernetes clientset…
INFO[14:02:58 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:02:58 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:02:58 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:02:58 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy
$ kubeone status -m config.yaml -t .
WARN[14:03:07 CEST] The provided APIVersion "kubeone.io/v1alpha1" is deprecated. Please use "kubeone config migrate" command to migrate to the latest version.
INFO[14:03:07 CEST] Determine hostname…
INFO[14:03:09 CEST] Determine operating system…
INFO[14:03:13 CEST] Building Kubernetes clientset…
INFO[14:03:14 CEST] Verifying that Docker, Kubelet and Kubeadm are installed…
INFO[14:03:15 CEST] Verifying that nodes in the cluster match nodes defined in the manifest…
INFO[14:03:15 CEST] Verifying that all nodes in the cluster are ready…
INFO[14:03:15 CEST] Verifying that there is no upgrade in the progress…
NODE         VERSION   APISERVER   ETCD
k1-tf-cp-0   v1.18.1   healthy     healthy
k1-tf-cp-1   v1.18.1   healthy     healthy
k1-tf-cp-2   v1.18.1   healthy     healthy

For completeness again the etcd logs of this try (timezone +2h on my machine):
etcd-0.log
etcd-1.log
etcd-2.log

kron4eg · 2020-06-15T12:20:52Z

I'd say that VMs are hosted on overloaded hosts, in logs some requests take 5-10 seconds!, that's unreasonable long.

So ether etcd storage is half-dead, or hosts. Anyway, it's not something we can fix.

EDIT:
Or network between CP instances is so bad

chrkl · 2020-06-15T12:26:31Z

This seem to happen only shortly after the VM creation. Could we make the cluster creation more reliable by increasing the number of retries for installing the CNI and machine-controller?

kron4eg · 2020-06-15T12:40:14Z

Yeah, we can do this, but since this is band aid I'm not sure if it would help 🤷‍♂️

chrkl added the kind/bug Categorizes issue or PR as related to a bug. label May 25, 2020

chrkl mentioned this issue Jun 25, 2020

Canal CNI defective in rare cases #934

Closed

kron4eg mentioned this issue Aug 5, 2020

Increase default number of task retries to 10 #1020

Merged

kubermatic-bot closed this as completed in #1020 Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster creation on Azure unreliable #915

Cluster creation on Azure unreliable #915

chrkl commented May 25, 2020

kron4eg commented May 25, 2020

chrkl commented May 25, 2020

kron4eg commented Jun 2, 2020

kron4eg commented Jun 2, 2020

chrkl commented Jun 15, 2020

kron4eg commented Jun 15, 2020 •

edited

Loading

chrkl commented Jun 15, 2020

kron4eg commented Jun 15, 2020 •

edited

Loading

Cluster creation on Azure unreliable #915

Cluster creation on Azure unreliable #915

Comments

chrkl commented May 25, 2020

kron4eg commented May 25, 2020

chrkl commented May 25, 2020

kron4eg commented Jun 2, 2020

kron4eg commented Jun 2, 2020

chrkl commented Jun 15, 2020

kron4eg commented Jun 15, 2020 • edited Loading

chrkl commented Jun 15, 2020

kron4eg commented Jun 15, 2020 • edited Loading

kron4eg commented Jun 15, 2020 •

edited

Loading

kron4eg commented Jun 15, 2020 •

edited

Loading