Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

upgrade k8s process is broke #2022

Closed
jalberto opened this issue Jan 10, 2018 · 76 comments
Closed

upgrade k8s process is broke #2022

jalberto opened this issue Jan 10, 2018 · 76 comments

Comments

@jalberto
Copy link

jalberto commented Jan 10, 2018

Is this a request for help?:
yes

Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:
0.11

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
1.7.6 -> 1.8.4

What happened:

Running:

acs-engine upgrade \
  --subscription-id xxx \
  --deployment-dir ./_output/xxx \
  --location westeurope \
  --resource-group xxx \
  --upgrade-version 1.8.4 \
  --auth-method client_secret \
  --client-id xxx \
  --client-secret xxx

got:

[...]
INFO[0138] Master pool set count to: 1 temporarily during upgrade...
 
INFO[0138] Starting ARM Deployment (master-18-01-10T13.40.08-1740684113). This will take some time... 
INFO[0166] Finished ARM Deployment (master-18-01-10T13.40.08-1740684113). 
INFO[0166] Error creating upgraded master VM: k8s-master-11577755-0 
FATA[0166] Error upgrading cluster: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details." 

In azure portal:

screenshot from 2018-01-10 13-44-01

What you expected to happen:
Upgrade cluster without errors

How to reproduce it (as minimally and precisely as possible):

Follow upgrade docs: https://github.com/Azure/acs-engine/tree/master/examples/k8s-upgrade

Anything else we need to know:

I am using vnet peering configured in azure portal, not using custom vnets

This actually left my cluster in bad shape, so is quite urgent.

@jalberto
Copy link
Author

Is this a technical limitation or a non-implemented feature?

Probably 'upgrade' command must check for this commons pitfalls and throw a warning before being the process.

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

Testing if deleting peering make it works, and it does but it takes quite long to upgrade a single master, and even more to the upgrade command to notice deployment has finished as it continues stuck in:

INFO[0210] Starting ARM Deployment (master-18-01-11T12.56.22-617391183). This will take some time...

screenshot from 2018-01-11 14-08-57

Meanwhile because of previous error, now there is not leader, so whole cluster is unusable. Documentation says this process is idempotent, but 2nd time I run the commend it didn't try to continue with the alread disabled master, but takes a new one, creating a mess.

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

as expected now ther is this:

INFO[4992] Master VM: k8s-master-11577755-1 status error: client: etcd cluster is unavailable or misconfigured; error #0: client: etcd member http://127.0.0.1:2379 has no leader

So now what?

I already opened a cirtical issue in azure support but no one seems to be taking care of this.

@jalberto jalberto changed the title upgrade k8s failled when using VNET peering upgrade k8s process is broke Jan 11, 2018
@jalberto
Copy link
Author

so after several retyes every master get updated but any kubectl command get a i/o timeout

any clue will be welcome

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

Dear diary, 3 days since The incident and not I already lost any hope for help. Seems like this place is more isolated every day.
Now seem like upgrade broke CNI (calico) in some way and nor etcd nor k8s services will start

Unable to update cni config: No networks found in /etc/cni/net.d
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized  

@jalberto
Copy link
Author

more funny stuff:

-- Unit etcd.service has begun starting up.                                                                                                                                     
Jan 11 18:31:17 k8s-master-11577755-0 systemd[22069]: etcd.service: Failed at step EXEC spawning /usr/bin/etcd: No such file or directory                                       
-- Subject: Process /usr/bin/etcd could not be executed

@brendandburns
Copy link
Member

Can you do an ls /usr/bin/etcd, also can you look at /var/log/cloud-init.log and see if cloud-init is correctly installing/running?

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

@brendanburns I was investigating etcd when I found is not instaled, so I tried to install ti acs-engine way:

root@k8s-master-11577755-0:~# /opt/azure/containers/setup-etcd.sh                                                                                                               
+ ETCD_VER=v2.5.2                           
+ DOWNLOAD_URL=https://acs-mirror.azureedge.net/github-coreos                           
+ mkdir -p /tmp/etcd-download               
+ curl -L https://acs-mirror.azureedge.net/github-coreos/etcd-v2.5.2-linux-amd64.tar.gz -o /tmp/etcd-v2.5.2-linux-amd64.tar.gz                                                  
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current         
                                 Dload  Upload   Total   Spent    Left  Speed           
100   215  100   215    0     0     74      0  0:00:02  0:00:02 --:--:--    74          
+ tar xzvf /tmp/etcd-v2.5.2-linux-amd64.tar.gz -C /usr/bin/ --strip-components=1        

gzip: stdin: not in gzip format             
tar: Child returned status 1                
tar: Error is not recoverable: exiting now  
+ useradd -U etcd                           
useradd: user 'etcd' already exists         
++ head -c 32 /dev/urandom                  
++ base64                                   
+ usermod -p 5ICuQ0258U/OBqze1SjjZi8g+9+/pSL0LtkG7nquQbU= etcd                          
sent invalidate(passwd) request, exiting    
sent invalidate(group) request, exiting     
sent invalidate(passwd) request, exiting    
sent invalidate(group) request, exiting     
+ passwd -u ''                              
passwd: user '' does not exist              
+ systemctl daemon-reload                   
+ systemctl enable etcd.service             
Failed to execute operation: Invalid argument
root@k8s-master-11577755-0:~# cat /tmp/etcd-v2.5.2-linux-amd64.tar.gz
<?xml version="1.0" encoding="utf-8"?><Error><Code>BlobNotFound</Code><Message>The specified blob does not exist.
RequestId:02f5e54f-001e-008b-390e-8b5fd6000000
Time:2018-01-11T18:59:05.6635280Z</Message></Error>
~# tail /var/log/cloud-init.log                              
2018-01-11 18:09:44,187 - stages.py[DEBUG]: Running module power-state-change (<module 'cloudinit.config.cc_power_state_change' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_power_state_change.py'>) with frequency once-per-instance                      
2018-01-11 18:09:44,188 - handlers.py[DEBUG]: start: modules-final/config-power-state-change: running config-power-state-change with frequency once-per-instance                
2018-01-11 18:09:44,188 - helpers.py[DEBUG]: config-power-state-change already ran (freq=once-per-instance)                                                                     2018-01-11 18:09:44,188 - handlers.py[DEBUG]: finish: modules-final/config-power-state-change: SUCCESS: config-power-state-change previously ran                                
2018-01-11 18:09:44,188 - main.py[DEBUG]: Ran 20 modules with 0 failures                
2018-01-11 18:09:44,189 - util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/result.json' => '../../var/lib/cloud/data/result.json'                                   
2018-01-11 18:09:44,190 - util.py[DEBUG]: Reading from /proc/uptime (quiet=False)       
2018-01-11 18:09:44,190 - util.py[DEBUG]: Read 12 bytes from /proc/uptime               
2018-01-11 18:09:44,190 - util.py[DEBUG]: cloud-init mode 'modules' took 0.378 seconds (0.38)                                                                                   2018-01-11 18:09:44,190 - handlers.py[DEBUG]: finish: modules-final: SUCCESS: running modules for final

lots of fun

@brendandburns
Copy link
Member

What version of acs-engine are you using?

Thanks

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

@brendanburns 0.11

Thanks for take a look!

@jalberto
Copy link
Author

jalberto commented Jan 11, 2018

I actually cannot find an official 2.5.2 release:

screenshot from 2018-01-11 20-38-51

So maybe 2.5.2 is just a dummy version used as default for some reason?

@jackfrancis
Copy link
Member

Hi @jalberto could you try to modify the /opt/azure/containers/setup-etcd.sh script to change the assignment of ETCD_VER to v2.3.7?

@jalberto
Copy link
Author

I have doubts, Why not latest 3.x version? (I don't want to mess it more)

@jackfrancis
Copy link
Member

I assume your cluster was running a v2 of etcd before, things are likelier to come back online if you use that version. (It's the latest v2 released, v2.5.2 was a typo that should have been v2.2.5; this was fixed in master a few weeks ago).

@jalberto
Copy link
Author

that worked! thanks!

I am trying to resume the upgrade now

@jackfrancis
Copy link
Member

Thanks for being resilient here. You are contributing to acs-engine-driven upgrades becoming more of a 1st class feature, we appreciate it!

@jalberto
Copy link
Author

@CecileRobertMichon I think this is not closed yet, I am still evaluating if everything is right after upgrade.

I can see problem mounting azure-disk volumes

@CecileRobertMichon
Copy link
Contributor

@jalberto sorry about that, I thought the issue had been resolved. Re-opening it now.

@jalberto
Copy link
Author

no prob @CecileRobertMichon

So after upgrade my PVs using azure-file (shared volumes using CIFS) are not automatically created anymore (I have an storage class) even if created by hand, then it cannot be mounted becuase a permission problem.

More details in how StorageClass was created: #1506

@jalberto
Copy link
Author

I fixed azurefile by deleting adn recreating the StorageClass

Now I got a new problem quite critical:

Multi-Attach error for volume "pvc-3a32c7b6-fd48-11e7-b811-000d3a2da567" Volume is already exclusively attached to one node and can't be attached to another 

This happens each time a pod need to move from one node to another. I checked in azure portal and I can see how the disk move to the correct node correctly, and kubectl says the PVC is bound.

probably related to: #2002

@andyzhangx
Copy link
Contributor

@jalberto You got this error when moving pod with an azure disk (not azure file) to another node, right? This could happen since azure disk detach from a node costs minutes in current k8s version, but wait for a few minutes should work.

@jalberto
Copy link
Author

@andyzhangx even after 4h still not working

@andyzhangx
Copy link
Contributor

@jalberto could you run "kubectl describe po POD-NAME" and paste the output here?

@andyzhangx
Copy link
Contributor

@jalberto and what's your k8s version? is it managed disk VM or unmanaged disk VM? What's your k8s operations? kubectl drain POD-NAME? I would try to repro this issue in my env if possible.

@andyzhangx
Copy link
Contributor

@jalberto
The new cluster means v1.9.6? Could you paste that error msg again?
And do you have a clean config to repro this issue, I would like to repro it in my env if it's after v1.9.5.

@jalberto
Copy link
Author

jalberto commented Apr 4, 2018

yes 1.9.6 deployed with acs 0.14.5

you can check my apimodel in #2567

I cannot paste it now, but the error is the typical "Multi attach error"

@jalberto
Copy link
Author

jalberto commented Apr 6, 2018

@andyzhangx , this is the error:

 Multi-Attach error for volume "pvc-3ca68971-35d7-11e8-a4e3-000d3a205895" Volume is already exclusively attached to one node and can't be attached to another
Unable to mount volumes for pod "vpn-openvpn-69bc585689-fz77t_default(d4a11d6b-3978-11e8-aa0a-000d3a205895)": timeout expired waiting for volumes to attach/mount for pod "default"/"vpn-openvpn-69bc585689-fz77t". list of unattached/unmounted volumes=[certs] 

32mins and counting :)

@andyzhangx
Copy link
Contributor

Hi @jalberto, so azure disk has been moved to the right node? Multi-Attach error for volume msg is from kubelet on those two nodes, a quick workaround is run sudo systemctl restart kubelet on the two nodes.

@jalberto
Copy link
Author

jalberto commented Apr 6, 2018

@andyzhangx yes, usually a combination of kubelet restart and az vm update works, but to reuire manual intervention each time a pod moves is clearly a signal something is wrong

@andyzhangx
Copy link
Contributor

@jalberto when scheduling from one node to another, how many disks were attached in one node? I would like to repro this in my env, and you were using deployment, right?

@jalberto
Copy link
Author

jalberto commented Apr 6, 2018 via email

@andyzhangx
Copy link
Contributor

Hi @jalberto, I tried to repro it on my testing env(v1.9.6), while after around 9min, all 3 pods were scheduled from node#1 to node#2 successfully, here are the detailed steps: https://github.com/andyzhangx/demo/tree/master/linux/azuredisk/attach-stress-test/deployment

I would like to make sure:

  1. You controller manager on master is v1.9.6, could you paste:
sudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep image
image: "k8s-gcrio.azureedge.net/hyperkube-amd64:v1.9.6"
  1. You controller manager on master never dies in the pod rescheduling process, if it dies, the reschedule will never succeed, could you paste:
docker ps | grep hyperkube

@jalberto
Copy link
Author

jalberto commented Apr 9, 2018

thanks @andyzhangx, here you go:

  1. image: "k8s-gcrio.azureedge.net/hyperkube-amd64:v1.9.6"
b8ca93010771        k8s-gcrio.azureedge.net/hyperkube-amd64            "/hyperkube contro..."   32 hours ago        Up 32 hours                             k8s_kube-controller-manager_kube-controller-manager-k8s-master-22440456-1_kube-system_095c64aa225fa3433b371fd804b5f98f_4
f72d9cb83a0d        k8s-gcrio.azureedge.net/hyperkube-amd64            "/hyperkube proxy ..."   3 days ago          Up 3 days                               k8s_kube-proxy_kube-proxy-m92p8_kube-system_01a9e261-3480-11e8-a4e3-000d3a205895_2
a702569da1d4        k8s-gcrio.azureedge.net/hyperkube-amd64            "/hyperkube schedu..."   3 days ago          Up 3 days                               k8s_kube-scheduler_kube-scheduler-k8s-master-22440456-1_kube-system_5429d7b62d0f50e1c40f54b7f3d7b13f_2
4ffac9c55c50        k8s-gcrio.azureedge.net/hyperkube-amd64            "/hyperkube apiser..."   3 days ago          Up 3 days                               k8s_kube-apiserver_kube-apiserver-k8s-master-22440456-1_kube-system_3e253a421bdf6519a931b2e1c19a4cef_5
ddbe4816f57b        k8s-gcrio.azureedge.net/hyperkube-amd64:v1.9.6     "/hyperkube kubele..."   3 days ago          Up 3 days                               priceless_cori

Notice my problems in 1.9.6 are random in both, time lasting and occurence. Still 9mins seems excessive for a mount/umount operation

@andyzhangx
Copy link
Contributor

@jalberto thanks for the info, finally I reproed this issue by using StatefulSet, detailed steps with statefulset
The root cause is node.volumesInUse is incorrect in my repro, it could be a common k8s issue:
kubernetes/kubernetes#62282

To validate my finding, could you help run kubectl get no -o yaml and paste the node.volumesInUse & node.volumesAttached values of the two nodes? As I could find up to now, restart kubelet manually in both nodes would work.

@jalberto
Copy link
Author

@andyzhangx

    volumesAttached:
    - devicePath: "0"
      name: kubernetes.io/azure-disk//subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/vlacs01/providers/Microsoft.Compute/disks/vlacs01-5abedde3-dynamic-pvc-c621a5ed-3483-11e8-a344-000d3a205d6f
    volumesInUse:
    - kubernetes.io/azure-disk//subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/vlacs01/providers/Microsoft.Compute/disks/vlacs01-5abedde3-dynamic-pvc-c621a5ed-3483-11e8-a344-000d3a205d6f
    - kubernetes.io/azure-disk/m3606615189

    volumesAttached:
    - devicePath: "1"
      name: kubernetes.io/azure-disk//subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/vlacs01/providers/Microsoft.Compute/disks/vlacs01-5abedde3-dynamic-pvc-3ca68971-35d7-11e8-a4e3-000d3a205895
    volumesInUse:
    - kubernetes.io/azure-disk//subscriptions/3dd693dd-25f9-471b-a404-19e064398e85/resourceGroups/vlacs01/providers/Microsoft.Compute/disks/vlacs01-5abedde3-dynamic-pvc-3ca68971-35d7-11e8-a4e3-000d3a205895
    - kubernetes.io/azure-disk/m2462769329

Which is the time goal for this task? probably should be in the seconds order not minutes?

@andyzhangx
Copy link
Contributor

@jalberto from your node.volumesAttached and node.volumesInUse the status of both two nodes is recovered. The detach or attach disk could cost around 1 min, so for two disks, rescheduling from one node to another should cost about 4 min. The long time waiting could be due to node.volumesInUse not updated correctly in kubelet.

@jalberto
Copy link
Author

jalberto commented Apr 10, 2018

@andyzhangx yes, I restart kubelet quite often because to wait 10mins for a service to mov eis too much :)

You say 1 disk will take ~2min and 2 disks ~4min, I guess this operation are done in parallel, so 1 disk and 10disks should take the same right?

Still ~2mins for an detach/attach operation seems quite a lot (taking in consideration disks are already formatted and ready to be used).

In current configuration 1 disk is taking easily 10 to 32 mins so far from ~2mins (randomly, sometimes, it magically works quite fast)

@jalberto
Copy link
Author

jalberto commented Apr 11, 2018

@andyzhangx this happens too in AKS vanilla cluster 1.9.6 (just deployed) and trying to upgrade a helm release

screenshot from 2018-04-11 12-54-22

@andyzhangx
Copy link
Contributor

@jalberto is that volume removed in the node while still in that node's volumesInUse list?

@andyzhangx
Copy link
Contributor

@jalberto pls also paste the volumesAttached, volumesInUse on the two nodes, I am collecting such info. This could be a k8s common issue. Thanks.

@jalberto
Copy link
Author

jalberto commented Apr 11, 2018

@andyzhangx 2 hours already :)

screenshot from 2018-04-11 15-39-36

    volumesAttached:
    - devicePath: "0"
      name: kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-aa77e11d-3ccc-11e8-94ea-0a58ac1f1128
    volumesInUse:
    - kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-aa77e11d-3ccc-11e8-94ea-0a58ac1f1128
    - kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-ffa219b5-3d74-11e8-94ea-0a58ac1f1128
    - kubernetes.io/azure-disk/m3576630252
    volumesAttached:
    - devicePath: "2"
      name: kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic
-pvc-ffa219b5-3d74-11e8-94ea-0a58ac1f1128
    - devicePath: "0"
      name: kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic
-pvc-5e02b94c-3d6a-11e8-94ea-0a58ac1f1128
    volumesInUse:
    - kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-06d1f6b9-3d70-11e8-94ea-0a58ac1f1128
    - kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-5e02b94c-3d6a-11e8-94ea-0a58ac1f1128
    - kubernetes.io/azure-disk//subscriptions/foo/resourceGroups/MC_aksprod-1_aksprod-1_westeurope/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-ffa219b5-3d74-11e8-94ea-0a58ac1f1128
    - kubernetes.io/azure-disk/m2688977947
    - kubernetes.io/azure-disk/m3600819570
    - kubernetes.io/azure-disk/m570158133

@andyzhangx
Copy link
Contributor

@jalberto thanks, this is a clear evidence that pvc-ffa219b5-3d74-11e8-94ea-0a58ac1f1128 is actually attached to the second node, while it's in volumesInUse list in first node. I will keep tracking this issue.
@feiskyer FYI

@jalberto
Copy link
Author

@andyzhangx the az vm update trick didn't work, next one is to restart kubelet (ssh into aks is over complex)

@jalberto
Copy link
Author

@andyzhangx az vm run-command invoke -g foo_westeurope -n aks-nodepool1-22460035-2 --command-id RunShellScript --scripts "sudo systemctl restart kubelet" didn't work either

@jalberto
Copy link
Author

@andyzhangx if I restart every kubelet I get a new error...

screenshot from 2018-04-11 17-08-50

This is a brand new cluster, with mostly inexistent load or stress. This kind of problems convince me nor AKS nor acs-engine is ready even for basic usage

@jalberto
Copy link
Author

the only solution that worked is:

  • detach volume from portal
  • vm update
  • delete pod

This seems to me like a very critical issue

@andyzhangx
Copy link
Contributor

@jalberto
I have found the fix, this bug is a regrssion inroduced in v1.9.x and v1.10.x, you may use v1.8.11 or later, that version won't have such issue.

@andyzhangx
Copy link
Contributor

see kubernetes/kubernetes#62467

@jalberto
Copy link
Author

thanks @andyzhangx for this great work, sadly there is not way in acs-engine nor aks to downgrade in a safe way (I don't believe is possible to upgrade in a safe way either though)

@andyzhangx
Copy link
Contributor

@jalberto when that PR is merged into master, I will do cherry-pick to v1.9.x ASAP, and you could build your own hotfix binary based on that that fix on v1.9.6

@jalberto
Copy link
Author

that sounds great! thanks again

@andyzhangx
Copy link
Contributor

@jalberto I have rechecked the code logic, looks like v1.10.0 won't have this issue, you may try that version instead.

@andyzhangx
Copy link
Contributor

Actually I also tried on v1.10.0, pod rescheduling is fast and works well.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants