-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster creation on Azure unreliable #915
Comments
@chrkl Can you investigate etcd logs? And check if ntp service is enabled |
Here the etcd logs. They indicate some too long running requests. Cluster is healthy after a few minutes. etcd-0.log VMs run chrony. |
etcd "leader changed" meas — there is a problem with one of the etcd members, and it needs to be replaced, most likely. |
in order to find out which one, please use |
Looks like the cluster became temporarily unhealthy:
Polling the status during the installation several time resulted in the following output:
For completeness again the etcd logs of this try (timezone +2h on my machine): |
I'd say that VMs are hosted on overloaded hosts, in logs some requests take 5-10 seconds!, that's unreasonable long. So ether etcd storage is half-dead, or hosts. Anyway, it's not something we can fix. EDIT: |
This seem to happen only shortly after the VM creation. Could we make the cluster creation more reliable by increasing the number of retries for installing the CNI and machine-controller? |
Yeah, we can do this, but since this is band aid I'm not sure if it would help 🤷♂️ |
What happened:
After the kubeadm calls, the cluster is often in a state where frequent etcd leader changes happen and API calls fail because of this. This causes the
kubeone install
to fail frequently because installing the CNI or machine-controller was not possible. The problem seem to be temporary as a secondkubeone install
run fixes the cluster.What is the expected behavior:
How to reproduce the issue:
Create kubeone cluster on Azure.
Anything else we need to know?
Tried to switch to more powerful Azure resources than given in the Terraform example. Faster storage and VM types didn't bring a significant change. However, when increasing the number of retries that are performed by KubeOne, the installation is successful.
KubeOne output:
...
INFO[21:41:46 CEST] Downloading kubeconfig…
INFO[21:41:46 CEST] Building Kubernetes clientset…
INFO[21:41:47 CEST] Ensure node local DNS cache…
INFO[21:41:47 CEST] Activating additional features…
INFO[21:41:49 CEST] Applying canal CNI plugin…
WARN[21:41:55 CEST] Task failed…
WARN[21:41:55 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:00 CEST] Retrying task…
INFO[21:42:00 CEST] Applying canal CNI plugin…
WARN[21:42:13 CEST] Task failed…
WARN[21:42:13 CEST] error was: failed to get *v1beta1.CustomResourceDefinition object: etcdserver: leader changed
WARN[21:42:23 CEST] Retrying task…
INFO[21:42:23 CEST] Applying canal CNI plugin…
WARN[21:42:32 CEST] Task failed…
WARN[21:42:32 CEST] error was: failed to get *v1.ConfigMap object: etcdserver: leader changed
Error: failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
failed to install cni plugin: failed to get *v1.ConfigMap object: etcdserver: leader changed
ERROR[0412] Error installing kubernetes:
Information about the environment:
KubeOne version (
kubeone version
): v1.0.0-alphaOperating system: CentOS
Provider you're deploying cluster on: Azure
Operating system you're deploying on:
The text was updated successfully, but these errors were encountered: