-
Notifications
You must be signed in to change notification settings - Fork 558
upgrade k8s process is broke #2022
Comments
Is this a technical limitation or a non-implemented feature? Probably 'upgrade' command must check for this commons pitfalls and throw a warning before being the process. |
Testing if deleting peering make it works, and it does but it takes quite long to upgrade a single master, and even more to the upgrade command to notice deployment has finished as it continues stuck in:
Meanwhile because of previous error, now there is not leader, so whole cluster is unusable. Documentation says this process is idempotent, but 2nd time I run the commend it didn't try to continue with the alread disabled master, but takes a new one, creating a mess. |
as expected now ther is this:
So now what? I already opened a cirtical issue in azure support but no one seems to be taking care of this. |
so after several retyes every master get updated but any any clue will be welcome |
Dear diary, 3 days since The incident and not I already lost any hope for help. Seems like this place is more isolated every day.
|
more funny stuff:
|
Can you do an |
@brendanburns I was investigating etcd when I found is not instaled, so I tried to install ti acs-engine way:
lots of fun |
What version of acs-engine are you using? Thanks |
@brendanburns 0.11 Thanks for take a look! |
Hi @jalberto could you try to modify the |
I have doubts, Why not latest 3.x version? (I don't want to mess it more) |
I assume your cluster was running a v2 of etcd before, things are likelier to come back online if you use that version. (It's the latest v2 released, v2.5.2 was a typo that should have been v2.2.5; this was fixed in master a few weeks ago). |
that worked! thanks! I am trying to resume the upgrade now |
Thanks for being resilient here. You are contributing to acs-engine-driven upgrades becoming more of a 1st class feature, we appreciate it! |
@CecileRobertMichon I think this is not closed yet, I am still evaluating if everything is right after upgrade. I can see problem mounting azure-disk volumes |
@jalberto sorry about that, I thought the issue had been resolved. Re-opening it now. |
no prob @CecileRobertMichon So after upgrade my PVs using azure-file (shared volumes using CIFS) are not automatically created anymore (I have an storage class) even if created by hand, then it cannot be mounted becuase a permission problem. More details in how StorageClass was created: #1506 |
I fixed azurefile by deleting adn recreating the StorageClass Now I got a new problem quite critical:
This happens each time a pod need to move from one node to another. I checked in azure portal and I can see how the disk move to the correct node correctly, and kubectl says the PVC is bound. probably related to: #2002 |
@jalberto You got this error when moving pod with an azure disk (not azure file) to another node, right? This could happen since azure disk detach from a node costs minutes in current k8s version, but wait for a few minutes should work. |
@andyzhangx even after 4h still not working |
@jalberto could you run "kubectl describe po POD-NAME" and paste the output here? |
@jalberto and what's your k8s version? is it managed disk VM or unmanaged disk VM? What's your k8s operations? |
@jalberto |
yes 1.9.6 deployed with acs 0.14.5 you can check my apimodel in #2567 I cannot paste it now, but the error is the typical "Multi attach error" |
@andyzhangx , this is the error:
32mins and counting :) |
Hi @jalberto, so azure disk has been moved to the right node? |
@andyzhangx yes, usually a combination of |
@jalberto when scheduling from one node to another, how many disks were attached in one node? I would like to repro this in my env, and you were using |
this cluster has only 2 PVs so from 0 to 2
…On Fri, 6 Apr 2018 at 16:18 Andy Zhang ***@***.***> wrote:
@jalberto <https://github.com/jalberto> when scheduling from one node to
another, how many disks were attached in one node? I would like to repro
this in my env, and you were using deployment, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2022 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAGGV5JFSBDxN0-uhPA4mx5MnYvd2aFDks5tl3kugaJpZM4RZRU6>
.
|
Hi @jalberto, I tried to repro it on my testing env(v1.9.6), while after around 9min, all 3 pods were scheduled from node#1 to node#2 successfully, here are the detailed steps: https://github.com/andyzhangx/demo/tree/master/linux/azuredisk/attach-stress-test/deployment I would like to make sure:
|
thanks @andyzhangx, here you go:
Notice my problems in 1.9.6 are random in both, time lasting and occurence. Still 9mins seems excessive for a mount/umount operation |
@jalberto thanks for the info, finally I reproed this issue by using StatefulSet, detailed steps with statefulset To validate my finding, could you help run |
Which is the time goal for this task? probably should be in the seconds order not minutes? |
@jalberto from your |
@andyzhangx yes, I restart You say 1 disk will take ~2min and 2 disks ~4min, I guess this operation are done in parallel, so 1 disk and 10disks should take the same right? Still ~2mins for an detach/attach operation seems quite a lot (taking in consideration disks are already formatted and ready to be used). In current configuration 1 disk is taking easily 10 to 32 mins so far from ~2mins (randomly, sometimes, it magically works quite fast) |
@andyzhangx this happens too in AKS vanilla cluster 1.9.6 (just deployed) and trying to upgrade a helm release |
@jalberto is that volume removed in the node while still in that node's |
@jalberto pls also paste the |
@andyzhangx 2 hours already :)
|
@andyzhangx the |
@andyzhangx |
@andyzhangx if I restart every kubelet I get a new error... This is a brand new cluster, with mostly inexistent load or stress. This kind of problems convince me nor AKS nor acs-engine is ready even for basic usage |
the only solution that worked is:
This seems to me like a very critical issue |
@jalberto |
thanks @andyzhangx for this great work, sadly there is not way in acs-engine nor aks to downgrade in a safe way (I don't believe is possible to upgrade in a safe way either though) |
@jalberto when that PR is merged into master, I will do cherry-pick to v1.9.x ASAP, and you could build your own hotfix binary based on that that fix on v1.9.6 |
that sounds great! thanks again |
@jalberto I have rechecked the code logic, looks like v1.10.0 won't have this issue, you may try that version instead. |
Actually I also tried on v1.10.0, pod rescheduling is fast and works well. |
Is this a request for help?:
yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
0.11
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
1.7.6 -> 1.8.4
What happened:
Running:
got:
In azure portal:
What you expected to happen:
Upgrade cluster without errors
How to reproduce it (as minimally and precisely as possible):
Follow upgrade docs: https://github.com/Azure/acs-engine/tree/master/examples/k8s-upgrade
Anything else we need to know:
I am using vnet peering configured in azure portal, not using custom vnets
This actually left my cluster in bad shape, so is quite urgent.
The text was updated successfully, but these errors were encountered: