-
Notifications
You must be signed in to change notification settings - Fork 558
Since Intel CPU Azure update, new Azure Disks are not mounting, very critical... #2002
Comments
@theobolo could you run |
@andyzhangx Hi Andy, there it is : Don't look at the log on the Kubernetes Dashboard, it's not always the last logs. It's happening on all the new PVC right now the real state is that the PVC is bound but : The disks are up on the Azure Dashboard as i said and "Bound" on Kubernetes PVC, but they are not mounted even if they are marked "Unattached" on Azure. And i paste also the log from kubelet from the worker-3 where Kafka-0 is supposed to be deployed : |
@andyzhangx There it is (the kubernetes worker 3), but agents are running on workers i think, since they were before the update :
|
It seems that the worker-2 have a problem by the way :
|
thanks, so where is the data disk |
@andyzhangx Yep there it is : |
@theobolo then I think it's caused by the VM failed state, the error(attaching data disk |
Yep but basically the worker-3 is in a Running State with no error and with a successfull agent running on it. I'm gonna try to remove the "Failed" status using your method on the other workers, but i think that there is something more. |
you may update worker-3 VM as well if possible, the error said attach disk 1 failed because attach disk 2 to worker-3 VM failed, it's quite weired, there is possiblity that worker-3 VM is also in a wrong state. |
BTW, could you check the status of data disk |
@andyzhangx Yep since yesterday i have a lot of really weird things on the Azure API, i saw a statement yesterday saying that it's not relevant anymore to look at the Azure messages on the dashboard, because they are delayed or laggy. By the way i putted the disk status within my first message, it's absolutly not attached to any node and it was never been. |
Now all my nodes seems to be healthy using : No more "Failed" state on Azure. I'll delete all the PVC and the statefulsets and just redeploy Kafka. |
The thing is that, when i delete manually my PVC on Kubernetes the Managed Disks are actually deleted ! So it's about mounting and unmounting it seems |
Ok so i did again my deployment :
You can see the Error on the Azure dashboard just right now : Azure consider my new disk as already attached or mounted ???? |
Please file an Azure support incident, as this appears related to the Azure wide reboot caused by meltdown and spectre... |
Update in this thread, according to Azure/ACS#12 (comment), one customer has successfully fix this issue by using this PowerShell script to update the agent VM that has the mounting disk issue. |
@brendanburns I already opened an issue Friday about that, on Critical Priority, i'm still waiting the answer :/ @andyzhangx I'm trying some workaround, i'll redeploy each virtual machines. |
@andyzhangx I did the same as @rocketraman : The final solution that worked for me was to simply do the following for every agent node in my cluster one at a time:
With that i was finally able to recover my nodes. Now my disks are correctly mounted on the workers. |
@theobolo we made the same and we have positive results for a short period of time but the issue returned. We also tried to scale the cluster (from 3 to 6 nodes in our case) and get rid of the first 3 nodes but it had no effect (problem has appeared again at virtual machine's logs after a few hours). Azure's support recomendation was the uncordon/cordon method, but is seems uneffective in our deployment. |
I have same error message (#2022) but in my case I can see how the disk move correctly to new node, still pods are unable to mount the disk this si very critical, as each time somethign crash pods are not able to recover |
maybe you right @andyzhangx but this started with the securit patch release |
@pauloeliasjr sorry I missed your comments. What's your error? Could you paste again. There is some error not related to this issue. |
@andyzhangx I just ran into this problem again as well, despite solving it last week after the security updates with the cordon/drain/reboot process on each node :-( I have a persistent volume which is showing "unbound" in the portal, and yet Kubernetes keeps reporting a 409 error for it:
As an experiment I decided to delete the disk entirely. Kubernetes even kept reported a 409 error after deleting it! Something is clearly wrong here and needs to be fixed asap. |
@rocketraman what's your
or Powershell solution if VM status is failed: |
@andyzhangx No, I fixed all of the VM states last week: |
@rocketraman could you use following command to update the VM
|
@andyzhangx Ok wow, that worked and solved the issue. Thank you! Is there a way I can identify other VMs that are in this "reporting Running but not running" state? |
@rocketraman There is a command |
IMHO this is a very critical problem, we cannot be manually restarting master services or nodes each time a pod needs to move. Is there anyone on azure working on this? |
@andyzhangx according to azure portal, the disk is mounted in I |
@andyzhangx detached the disk manually (in portal) restarted every kubelet, run @andyzhangx can you confirm to me if this is marked as a critical problem in azure or I need to phone and open a new ticket? This is critical and is blocking my production environment |
Update this thread: Here is the PR: fix race condition issue when detaching azure disk |
For the moment i can say that since January it's stable with a cluster on 1.8.5 Kubernetes version. Didn't have any other mount/umount problems... |
I'm closing this issue, with Kubernetes 1.10 and the lasts updates there is no more issue about disks on Azure. Great Thanks to @andyzhangx you did a really good job ! |
@theobolo that not accurate: Azure/AKS#477 still happening in 10.3 |
@jalberto Alright, following your case i'm reopening this issue, but i have to say : on a 1.10.2 Kubernetes cluster deployed with the 0.16.2 ACS Engine version, i have 0 problem since 1 month about Mounting and Unmounting disks. Since you'r on AKS (GA) maybe that the deployement is slightly different, i didn't take a look at it. I precise that i run more than 150 pods on that Cluster with more than 50% of them mounting "managed disks" with Premium Tier. @andyzhangx I'll try the @jalberto scenario on my cluster to at least i hope reproduce the issue. Cheers |
@theobolo I agree! this actually was solved already twice as far I can remember (with hard work by @andyzhangx) , but somehow it made a comeback, and actually that worries me as raises lot of concerns about a safe upgrades |
@jalberto @andyzhangx Alright guys i managed to reproduce that issue on my cluster (reminder : 1.10.2 and No AKS just ACS) I did :
But basically as @andyzhangx said, i waited 1-2 min with the pod in this "error" state and then the pod mounted the PV as expected ... and Grafana went up healthy : Also i did another test but with the "managed-premium" class for the disks and not the "default" one. Cheers |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead. |
Is this a request for help?:
New azure managed disks are not mouting on the Kubernetes nodes, Azure API slow.
Is this an ISSUE or FEATURE REQUEST? (choose one):
It's an ISSUE
What version of acs-engine?:
ACS Engine version : 0.7.0
Kubernetes version : 1.7.5
After the big update yesterday, i'm seeing a lot of errors about attaching or detaching Azure Managed Disks on kubernetes, espacially the new created ones.
I recreated all my stacks (Elasticsearch/kafka/Zookeeper/...) using a PersistentVolumeClaim with my statefullsets after the big update yesterday.
All my new Azure Disks are not able to be mounted on my workers, but actually it should :
I'm posting the logs of the kube-controller manager :
By the way it's an issue that the Azure community faced already and me aswell. Someone experienced the same kind of issues yesterday on an other Git thread. Azure/ACS#12
That's really critical, during the updates yesterday the Machines went down in a normal way but the Azure API was really slow and the mounting and detaching too.
Kubernetes was trying to balance my pods with their disks to another worker since the actual one was upgrading (more than 40min per machine), but it was impossible since the disks weren't able to detach from that worker.
I almost waited more than 1 hour to have my 3 mongoDb Azure Disks attached on my ReplicaSet after each machine reboot. I tought that Azure Disks where safe in production but basically it wasn't yesterday.
Conclusion, i'm not able to mount any disks on my statefullsets since yesterday on my Staging cluster.
Rebooted : kube-controller-manager, kubelet on the concerned workers, even restarted the virtual machines, it's always the same detach/attach error.
Please guys do you have any info on that, my Direct Professional Support plan mail is still not answered since 24h...
How to reproduce it (as minimally and precisely as possible):
Apparently someone else experienced that : Azure/ACS#12
The text was updated successfully, but these errors were encountered: