Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Epicli upgrade issue - the process hangs for several hours on the task kubeadm upgrade apply #1399

Closed
przemyslavic opened this issue Jun 30, 2020 · 5 comments · Fixed by #1402, #1427 or #1431
Assignees
Labels
Milestone

Comments

@przemyslavic
Copy link
Collaborator

przemyslavic commented Jun 30, 2020

Describe the bug
Cannot upgrade the kubernetes cluster - looks like a random issue, most common for Azure configurations.

To Reproduce
Steps to reproduce the behavior:

  1. deploy 0.4.4 cluster
  2. execute epicli upgrade (from develop branch)

Expected behavior
The cluster has been successfully upgraded

OS (please complete the following information):

  • OS: [e.g. RHEL7.7, Ubuntu 18.04]

Cloud Environment (please complete the following information):

  • Cloud Provider [e.g. MS Azure]

Additional context


2020-06-29T22:04:48.1026877Z 22:04:48 INFO cli.engine.ansible.AnsibleCommand - TASK [upgrade : upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file)] ***
2020-06-29T22:49:14.7940500Z 22:49:14 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (20 retries left).
2020-06-29T23:37:26.7290995Z 23:37:26 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (19 retries left).
2020-06-29T23:42:59.1434418Z 23:42:59 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (18 retries left).
2020-06-30T00:31:11.0432338Z 00:31:11 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (17 retries left).
2020-06-30T00:36:43.4952279Z 00:36:43 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (16 retries left).
2020-06-30T01:24:55.4254331Z 01:24:55 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (15 retries left).
2020-06-30T01:30:27.8367170Z 01:30:27 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (14 retries left).
2020-06-30T02:18:39.8409973Z 02:18:39 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (13 retries left).
2020-06-30T02:24:12.2573132Z 02:24:12 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (12 retries left).
2020-06-30T03:12:24.2749881Z 03:12:24 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (11 retries left).
2020-06-30T03:17:56.7150480Z 03:17:56 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (10 retries left).
2020-06-30T04:06:08.6468293Z 04:06:08 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (9 retries left).
2020-06-30T04:11:41.0625442Z 04:11:41 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (8 retries left).
2020-06-30T04:59:53.0070741Z 04:59:53 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (7 retries left).
2020-06-30T05:05:25.4451723Z 05:05:25 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (6 retries left).
2020-06-30T05:53:37.9468084Z 05:53:37 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (5 retries left).
2020-06-30T05:59:09.7340515Z 05:59:09 INFO cli.engine.ansible.AnsibleCommand - FAILED - RETRYING: upgrade-master | Upgrade K8s cluster to v1.15.10 (using kubeadm-config.yml file) (4 retries left).

E0629 11:49:57.067369       1 reflector.go:125] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:49:58.070606       1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:49:58.072162       1 reflector.go:125] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:49:59.075549       1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:49:59.075968       1 reflector.go:125] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:50:00.089012       1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:50:00.091010       1 reflector.go:125] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to list *v1.Secret: illegal base64 data at input byte 3
E0629 11:50:01.093364       1 reflector.go:125] k8s.io/client-go/informers/factory.go:133: Failed to list *v1.Secret: illegal base64 data at input byte 3

Looks like an issue with etcd encryption.

@mkyc
Copy link
Contributor

mkyc commented Jul 2, 2020

@sk4zuzu can you provide description what was fixed in #1402? I assume you found cause of this bug.

@przemyslavic did you check it after #1402 was merged?

@sk4zuzu
Copy link
Contributor

sk4zuzu commented Jul 7, 2020

@sk4zuzu can you provide description what was fixed in #1402? I assume you found cause of this bug.

@przemyslavic did you check it after #1402 was merged?

@mkyc sorry for the late answer, the fix in question was successful at least in some cases for problems related to etcd encryption, but essentially it is an awful practice (truly and profoundly). We call it the fix of the month and it will be happily removed in the current fix for etcd problems #1427 (please check the diff). We're sorry we merged that, but maybe this is an example that instead of rushing releases we should plan them properly. 👍

@sk4zuzu sk4zuzu reopened this Jul 8, 2020
@to-bar to-bar linked a pull request Jul 9, 2020 that will close this issue
@mkyc mkyc reopened this Jul 10, 2020
@przemyslavic przemyslavic self-assigned this Jul 14, 2020
@przemyslavic
Copy link
Collaborator Author

przemyslavic commented Jul 15, 2020

Fix tested.
Tested many possible configurations (AWS / Azure x Ubuntu / RedHat x flannel / calico / canal
Upgrades from 0.4.4, 0.5.4, 0.6.0 to develop have been tested many times.
No more "hanging" upgrade issue was encountered.

@mkyc
Copy link
Contributor

mkyc commented Jul 15, 2020

@przemyslavic how is that we have "failed" label in "develop" section of almost all types of tests? https://github.com/epiphany-platform/epiphany/blob/develop/docs/home/TESTING.md

@przemyslavic
Copy link
Collaborator Author

@mkyc I have recently run other tests on the same pipelines and the AWS/RedHat environments failed due to exceeding the VPC limit. But that was after testing this fix. On Azure DevOps we have results from the previous run. Only 2 configurations out of 30 partially succeeded (due to problems with the RabbitMQ deployment, which is not part of this task), all others were successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment