Since Intel CPU Azure update, new Azure Disks are not mounting, very critical... #2002

theobolo · 2018-01-05T11:25:43Z

Is this a request for help?:

New azure managed disks are not mouting on the Kubernetes nodes, Azure API slow.

Is this an ISSUE or FEATURE REQUEST? (choose one):

It's an ISSUE

What version of acs-engine?:

ACS Engine version : 0.7.0
Kubernetes version : 1.7.5

kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

After the big update yesterday, i'm seeing a lot of errors about attaching or detaching Azure Managed Disks on kubernetes, espacially the new created ones.

I recreated all my stacks (Elasticsearch/kafka/Zookeeper/...) using a PersistentVolumeClaim with my statefullsets after the big update yesterday.

All my new Azure Disks are not able to be mounted on my workers, but actually it should :

I'm posting the logs of the kube-controller manager :

W0105 10:24:00.133401       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.133439       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.133458       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.233803       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.233910       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.234034       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.234063       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.334429       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.334518       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.334462       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.334586       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.434953       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.434987       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.435103       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.435139       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.535432       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.535472       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.535692       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.535715       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.636100       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.636142       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.636168       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.636183       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.736504       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another

By the way it's an issue that the Azure community faced already and me aswell. Someone experienced the same kind of issues yesterday on an other Git thread. Azure/ACS#12

That's really critical, during the updates yesterday the Machines went down in a normal way but the Azure API was really slow and the mounting and detaching too.
Kubernetes was trying to balance my pods with their disks to another worker since the actual one was upgrading (more than 40min per machine), but it was impossible since the disks weren't able to detach from that worker.

I almost waited more than 1 hour to have my 3 mongoDb Azure Disks attached on my ReplicaSet after each machine reboot. I tought that Azure Disks where safe in production but basically it wasn't yesterday.

Conclusion, i'm not able to mount any disks on my statefullsets since yesterday on my Staging cluster.

Rebooted : kube-controller-manager, kubelet on the concerned workers, even restarted the virtual machines, it's always the same detach/attach error.

Please guys do you have any info on that, my Direct Professional Support plan mail is still not answered since 24h...

How to reproduce it (as minimally and precisely as possible):

Apparently someone else experienced that : Azure/ACS#12

The text was updated successfully, but these errors were encountered:

andyzhangx · 2018-01-05T12:11:59Z

@theobolo could you run kubectl describe pvc PVC-NAME , it says "PersisentVolumeClaim is not bound" in the screenshot

theobolo · 2018-01-05T13:20:12Z

@andyzhangx Hi Andy, there it is :

Don't look at the log on the Kubernetes Dashboard, it's not always the last logs. It's happening on all the new PVC right now the real state is that the PVC is bound but :

The disks are up on the Azure Dashboard as i said and "Bound" on Kubernetes PVC, but they are not mounted even if they are marked "Unattached" on Azure.

And i paste also the log from kubelet from the worker-3 where Kafka-0 is supposed to be deployed :

andyzhangx · 2018-01-05T13:49:58Z

@theobolo
could you use azure-cli to run az vm show -g RESOURCE-GROUP -n AGENT-VM-NAME -d to get the agent VM details? you may need to run az login first. Thanks.

theobolo · 2018-01-05T13:57:14Z

@andyzhangx There it is (the kubernetes worker 3), but agents are running on workers i think, since they were before the update :

ghosty@workstation-ghosty:~/kubernetes-deployment/kafka_deployment$ az vm show -g k8s-fleeters-cluster-preproduction -n k8s-k8sworkers-20163042-3 -d
{
  "additionalProperties": {},
  "availabilitySet": {
    "additionalProperties": {},
    "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/availabilitySets/K8SWORKERS-AVAILABILITYSET-20163042",
    "resourceGroup": "k8s-fleeters-cluster-preproduction"
  },
  "diagnosticsProfile": null,
  "fqdns": "",
  "hardwareProfile": {
    "additionalProperties": {},
    "vmSize": "Standard_DS13_v2_Promo"
  },
  "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-3",
  "identity": null,
  "licenseType": null,
  "location": "northeurope",
  "macAddresses": "00-0D-3A-B7-64-8B",
  "name": "k8s-k8sworkers-20163042-3",
  "networkProfile": {
    "additionalProperties": {},
    "networkInterfaces": [
      {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Network/networkInterfaces/k8s-k8sworkers-20163042-nic-3",
        "primary": null,
        "resourceGroup": "k8s-fleeters-cluster-preproduction"
      }
    ]
  },
  "osProfile": {
    "additionalProperties": {},
    "adminPassword": null,
    "adminUsername": "fleeters",
    "computerName": "k8s-k8sworkers-20163042-3",
    "customData": null,
    "linuxConfiguration": {
      "additionalProperties": {},
      "disablePasswordAuthentication": true,
      "ssh": {
        "additionalProperties": {},
        "publicKeys": [
          {
            "additionalProperties": {},
            "keyData": "HIDDEN",
            "path": "/home/fleeters/.ssh/authorized_keys"
          }
        ]
      }
    },
    "secrets": [],
    "windowsConfiguration": null
  },
  "plan": null,
  "powerState": "VM running",
  "privateIps": "10.240.0.5",
  "provisioningState": "Succeeded",
  "publicIps": "",
  "resourceGroup": "k8s-fleeters-cluster-preproduction",
  "resources": [
    {
      "additionalProperties": {},
      "autoUpgradeMinorVersion": true,
      "forceUpdateTag": null,
      "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-3/extensions/cse3",
      "instanceView": null,
      "location": "northeurope",
      "name": "cse3",
      "protectedSettings": null,
      "provisioningState": "Succeeded",
      "publisher": "Microsoft.Azure.Extensions",
      "resourceGroup": "k8s-fleeters-cluster-preproduction",
      "settings": {},
      "tags": null,
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "typeHandlerVersion": "2.0",
      "virtualMachineExtensionType": "CustomScript"
    }
  ],
  "storageProfile": {
    "additionalProperties": {},
    "dataDisks": [],
    "imageReference": {
      "additionalProperties": {},
      "id": null,
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "16.04.201706191"
    },
    "osDisk": {
      "additionalProperties": {},
      "caching": "ReadWrite",
      "createOption": "FromImage",
      "diskSizeGb": 128,
      "encryptionSettings": null,
      "image": null,
      "managedDisk": {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/k8s-k8sworkers-20163042-3_OsDisk_1_b24ffe34a2504df2924985f9c9fd013f",
        "resourceGroup": "k8s-fleeters-cluster-preproduction",
        "storageAccountType": "Premium_LRS"
      },
      "name": "k8s-k8sworkers-20163042-3_OsDisk_1_b24ffe34a2504df2924985f9c9fd013f",
      "osType": "Linux",
      "vhd": null
    }
  },
  "tags": {
    "creationSource": "acsengine-k8s-k8sworkers-20163042-3",
    "orchestrator": "Kubernetes:1.7.5",
    "poolName": "k8sworkers",
    "resourceNameSuffix": "20163042"
  },
  "type": "Microsoft.Compute/virtualMachines",
  "vmId": "30c29441-6c41-4de0-b88b-1a50293fd257",
  "zones": null
}

theobolo · 2018-01-05T14:02:31Z

It seems that the worker-2 have a problem by the way :

{
  "additionalProperties": {},
  "availabilitySet": {
    "additionalProperties": {},
    "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/availabilitySets/K8SWORKERS-AVAILABILITYSET-20163042",
    "resourceGroup": "k8s-fleeters-cluster-preproduction"
  },
  "diagnosticsProfile": null,
  "fqdns": "",
  "hardwareProfile": {
    "additionalProperties": {},
    "vmSize": "Standard_DS13_v2_Promo"
  },
  "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-2",
  "identity": null,
  "licenseType": null,
  "location": "northeurope",
  "macAddresses": "00-0D-3A-B7-60-BC",
  "name": "k8s-k8sworkers-20163042-2",
  "networkProfile": {
    "additionalProperties": {},
    "networkInterfaces": [
      {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Network/networkInterfaces/k8s-k8sworkers-20163042-nic-2",
        "primary": null,
        "resourceGroup": "k8s-fleeters-cluster-preproduction"
      }
    ]
  },
  "osProfile": {
    "additionalProperties": {},
    "adminPassword": null,
    "adminUsername": "fleeters",
    "computerName": "k8s-k8sworkers-20163042-2",
    "customData": null,
    "linuxConfiguration": {
      "additionalProperties": {},
      "disablePasswordAuthentication": true,
      "ssh": {
        "additionalProperties": {},
        "publicKeys": [
          {
            "additionalProperties": {},
            "keyData": "NOT THIS TIME",
            "path": "/home/fleeters/.ssh/authorized_keys"
          }
        ]
      }
    },
    "secrets": [],
    "windowsConfiguration": null
  },
  "plan": null,
  "powerState": "VM running",
  "privateIps": "10.240.0.4",
  "provisioningState": "Failed",
  "publicIps": "",
  "resourceGroup": "k8s-fleeters-cluster-preproduction",
  "resources": [
    {
      "additionalProperties": {},
      "autoUpgradeMinorVersion": true,
      "forceUpdateTag": null,
      "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-2/extensions/cse2",
      "instanceView": null,
      "location": "northeurope",
      "name": "cse2",
      "protectedSettings": null,
      "provisioningState": "Updating",
      "publisher": "Microsoft.Azure.Extensions",
      "resourceGroup": "k8s-fleeters-cluster-preproduction",
      "settings": {},
      "tags": null,
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "typeHandlerVersion": "2.0",
      "virtualMachineExtensionType": "CustomScript"
    }
  ],
  "storageProfile": {
    "additionalProperties": {},
    "dataDisks": [
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 20,
        "image": null,
        "lun": 2,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-287cfe7d-b999-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-287cfe7d-b999-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 30,
        "image": null,
        "lun": 3,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-d68e7210-bee9-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-d68e7210-bee9-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 64,
        "image": null,
        "lun": 0,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-c48ca983-b0d1-11e7-b9f4-000d3ab76964",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-c48ca983-b0d1-11e7-b9f4-000d3ab76964",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 32,
        "image": null,
        "lun": 4,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 128,
        "image": null,
        "lun": 1,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-848a1391-a472-11e7-b780-000d3ab769d6",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-848a1391-a472-11e7-b780-000d3ab769d6",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 64,
        "image": null,
        "lun": 5,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-984109f2-c9da-11e7-a941-000d3ab76964",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-984109f2-c9da-11e7-a941-000d3ab76964",
        "vhd": null
      }
    ],
    "imageReference": {
      "additionalProperties": {},
      "id": null,
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "16.04.201706191"
    },
    "osDisk": {
      "additionalProperties": {},
      "caching": "ReadWrite",
      "createOption": "FromImage",
      "diskSizeGb": 128,
      "encryptionSettings": null,
      "image": null,
      "managedDisk": {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/k8s-k8sworkers-20163042-2_OsDisk_1_3bb94d8af0d449d38f1f19123975094a",
        "resourceGroup": "k8s-fleeters-cluster-preproduction",
        "storageAccountType": "Premium_LRS"
      },
      "name": "k8s-k8sworkers-20163042-2_OsDisk_1_3bb94d8af0d449d38f1f19123975094a",
      "osType": "Linux",
      "vhd": null
    }
  },
  "tags": {
    "creationSource": "acsengine-k8s-k8sworkers-20163042-2",
    "orchestrator": "Kubernetes:1.7.5",
    "poolName": "k8sworkers",
    "resourceNameSuffix": "20163042"
  },
  "type": "Microsoft.Compute/virtualMachines",
  "vmId": "9f8e9bb3-2152-4aca-8141-e41a31b39d35",
  "zones": null
}

theobolo · 2018-01-05T14:12:54Z

There is certainly something wrong with the Azure API, it says that my worker-2 and 4 have failed, on Azure Dashboard. But Kubernetes sees the nodes as "Ready" and i've some pods running on them ...

Azure status :

My pods on the worker-2 :

andyzhangx · 2018-01-05T14:15:13Z

thanks, so where is the data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6, error says it's attaching that disk to work-3, could you find it your resource group?
az disk list | grep f4be8993

theobolo · 2018-01-05T14:17:24Z

@andyzhangx Yep there it is :

andyzhangx · 2018-01-05T14:25:33Z

@theobolo then I think it's caused by the VM failed state, the error(attaching data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6 to work-3 failed) could be the last error long time ago. I think you should first resolve the VM failed state issue, I only got the Powershell solution here:
https://blogs.technet.microsoft.com/mckittrick/azure-vm-stuck-in-failed-state-arm/

theobolo · 2018-01-05T14:29:51Z

Yep but basically the worker-3 is in a Running State with no error and with a successfull agent running on it.
My disks are not either attached to a Failed machine since they are brand new : i created my statefulsets few hours ago, and my disks weren't mounted before anywhere else.

I'm gonna try to remove the "Failed" status using your method on the other workers, but i think that there is something more.

andyzhangx · 2018-01-05T14:34:53Z

you may update worker-3 VM as well if possible, the error said attach disk 1 failed because attach disk 2 to worker-3 VM failed, it's quite weired, there is possiblity that worker-3 VM is also in a wrong state.

andyzhangx · 2018-01-05T14:36:53Z

BTW, could you check the status of data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6, is it attached to a VM? that's important. Thanks.

theobolo · 2018-01-05T14:39:17Z

@andyzhangx Yep since yesterday i have a lot of really weird things on the Azure API, i saw a statement yesterday saying that it's not relevant anymore to look at the Azure messages on the dashboard, because they are delayed or laggy.

By the way i putted the disk status within my first message, it's absolutly not attached to any node and it was never been.

theobolo · 2018-01-05T14:40:49Z

Now all my nodes seems to be healthy using : az vm show -g k8s-fleeters-cluster-preproduction -n k8s-k8sworkers-20163042-0 -d command.

No more "Failed" state on Azure. I'll delete all the PVC and the statefulsets and just redeploy Kafka.

theobolo · 2018-01-05T14:43:13Z

The thing is that, when i delete manually my PVC on Kubernetes the Managed Disks are actually deleted !
And in the other way when i create my statefulsets they are correctly created on the dashboard.

So it's about mounting and unmounting it seems

theobolo · 2018-01-05T14:50:00Z

Ok so i did again my deployment :

Nodes are healthy
Deploying Kafka Statefulset
PVC is created
PV is created and "Bound", i can see it in the Azure Dashboard with "Unattached" status.
It was distributed on the worker-3, and again it can"t be mounted

You can see the Error on the Azure dashboard just right now :

Azure consider my new disk as already attached or mounted ????

brendandburns · 2018-01-05T21:16:22Z

Please file an Azure support incident, as this appears related to the Azure wide reboot caused by meltdown and spectre...

andyzhangx · 2018-01-06T11:44:11Z

Update in this thread, according to Azure/ACS#12 (comment), one customer has successfully fix this issue by using this PowerShell script to update the agent VM that has the mounting disk issue.
The root cause is after the big upgrade, the VM status is sometimes in incosistent status.

theobolo · 2018-01-08T09:08:46Z

@brendanburns I already opened an issue Friday about that, on Critical Priority, i'm still waiting the answer :/

@andyzhangx I'm trying some workaround, i'll redeploy each virtual machines.

theobolo · 2018-01-08T09:43:40Z

@andyzhangx I did the same as @rocketraman :

The final solution that worked for me was to simply do the following for every agent node in my cluster one at a time:

kubectl cordon <node>
delete any pods on with stateful sets
kubectl drain <node>
restart the Azure VM for via the API or portal
kubectl uncordon <node>

With that i was finally able to recover my nodes. Now my disks are correctly mounted on the workers.

pauloeliasjr · 2018-01-12T16:30:26Z

@theobolo we made the same and we have positive results for a short period of time but the issue returned. We also tried to scale the cluster (from 3 to 6 nodes in our case) and get rid of the first 3 nodes but it had no effect (problem has appeared again at virtual machine's logs after a few hours). Azure's support recomendation was the uncordon/cordon method, but is seems uneffective in our deployment.

jalberto · 2018-01-19T22:30:04Z

I have same error message (#2022) but in my case I can see how the disk move correctly to new node, still pods are unable to mount the disk

this si very critical, as each time somethign crash pods are not able to recover

andyzhangx · 2018-01-20T02:34:55Z

@jalberto I am quite sure your issue is different from this one, I have replied in #2022

jalberto · 2018-01-20T11:27:48Z

maybe you right @andyzhangx but this started with the securit patch release

andyzhangx · 2018-01-20T14:14:30Z

@pauloeliasjr sorry I missed your comments. What's your error? Could you paste again. There is some error not related to this issue.

rocketraman · 2018-01-23T01:53:32Z

@andyzhangx I just ran into this problem again as well, despite solving it last week after the security updates with the cordon/drain/reboot process on each node :-(

I have a persistent volume which is showing "unbound" in the portal, and yet Kubernetes keeps reporting a 409 error for it:

2018-01-22 20:32:09 -0500 EST   2018-01-22 20:20:57 -0500 EST   5         eslogging-2.150c4bbec69ffd5e   Pod                 Warning   FailedMount   attachdetach   AttachVolume.Attach failed for volume "pvc-85783916-9e82-11e7-a717-000d3af4357e" : Attach volume "devkube1-dynamic-pvc-85783916-9e82-11e7-a717-000d3af4357e" to instance "k8s-agentpool1-18117938-4" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=409 -- Original Error: autorest/azure: Service returned an error. Status=409 Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'devkube1-dynamic-pvc-85783916-9e82-11e7-a717-000d3af4357e' to VM 'k8s-agentpool1-18117938-4' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

As an experiment I decided to delete the disk entirely. Kubernetes even kept reported a 409 error after deleting it!

Something is clearly wrong here and needs to be fixed asap.

andyzhangx · 2018-01-23T02:58:49Z

@rocketraman what's your VM 'k8s-agentpool1-18117938-4' status?
It could be the last error long time ago. I think you should first resolve the VM failed state issue, use following command to update the VM status:

az vm update -g <group> -n <name>

or Powershell solution if VM status is failed:
https://blogs.technet.microsoft.com/mckittrick/azure-vm-stuck-in-failed-state-arm/

rocketraman · 2018-01-23T03:13:04Z

@andyzhangx No, I fixed all of the VM states last week:

andyzhangx · 2018-01-23T03:17:05Z

@rocketraman could you use following command to update the VM k8s-agentpool1-18117938-4 status anyway, sometimes the VM status in azure portal is cheating you.

az vm update -g <group> -n <name>

rocketraman · 2018-01-23T03:24:04Z

@andyzhangx Ok wow, that worked and solved the issue. Thank you! Is there a way I can identify other VMs that are in this "reporting Running but not running" state?

andyzhangx · 2018-01-23T04:32:08Z

@rocketraman There is a command az vm get-instance-view -g <group> -n <name>, while I am not sure it could get the error info as I don't have the VM that has issue, and you could az vm update all for VMs, it's harmless.

jalberto · 2018-02-13T16:05:36Z

❯ az vm update -g k8svl -n k8s-pool01-xxxxx-1
Cannot attach data disk 'k8svl-dynamic-pvc-xxxxx' to VM 'k8s-pool01-11577755-1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

IMHO this is a very critical problem, we cannot be manually restarting master services or nodes each time a pod needs to move.

Is there anyone on azure working on this?

jalberto · 2018-02-13T16:29:50Z

@andyzhangx according to azure portal, the disk is mounted in vm-1 the same one that is failling with the v update command

I rebooted the node, still same problem.

jalberto · 2018-02-13T16:49:47Z

@andyzhangx detached the disk manually (in portal) restarted every kubelet, run vm status in every node (no errors now), deleted the pod, k8s still complainign with same error

@andyzhangx can you confirm to me if this is marked as a critical problem in azure or I need to phone and open a new ticket?

This is critical and is blocking my production environment

andyzhangx · 2018-02-25T04:13:33Z

Update this thread:
Recently I fixed a race condition issue which could cause disk attach error, the fix has been merged into v1.10, I am trying to cherry-pick this fix to other k8s version, you could find details about this issue here:
https://github.com/andyzhangx/Demo/blob/master/issues/README.md#1-disk-attach-error

Here is the PR: fix race condition issue when detaching azure disk

theobolo · 2018-03-07T10:32:56Z

For the moment i can say that since January it's stable with a cluster on 1.8.5 Kubernetes version. Didn't have any other mount/umount problems...

theobolo · 2018-06-30T10:39:21Z

I'm closing this issue, with Kubernetes 1.10 and the lasts updates there is no more issue about disks on Azure.

Great Thanks to @andyzhangx you did a really good job !

jalberto · 2018-07-02T08:57:07Z

@theobolo that not accurate: Azure/AKS#477 still happening in 10.3

theobolo · 2018-07-02T10:04:49Z

@jalberto Alright, following your case i'm reopening this issue, but i have to say : on a 1.10.2 Kubernetes cluster deployed with the 0.16.2 ACS Engine version, i have 0 problem since 1 month about Mounting and Unmounting disks.

Since you'r on AKS (GA) maybe that the deployement is slightly different, i didn't take a look at it.

I precise that i run more than 150 pods on that Cluster with more than 50% of them mounting "managed disks" with Premium Tier.

@andyzhangx I'll try the @jalberto scenario on my cluster to at least i hope reproduce the issue.

Cheers

jalberto · 2018-07-02T10:29:53Z

@theobolo I agree! this actually was solved already twice as far I can remember (with hard work by @andyzhangx) , but somehow it made a comeback, and actually that worries me as raises lot of concerns about a safe upgrades

theobolo · 2018-07-02T13:14:08Z

@jalberto @andyzhangx Alright guys i managed to reproduce that issue on my cluster (reminder : 1.10.2 and No AKS just ACS)

I did :

helm install stable/grafana (persistence enabled)
wait 5 sec and delete the grafana pod
the new pod is showing the multi-attach error....

But basically as @andyzhangx said, i waited 1-2 min with the pod in this "error" state and then the pod mounted the PV as expected ... and Grafana went up healthy :

Also i did another test but with the "managed-premium" class for the disks and not the "default" one.
I did manage to reproduce the bug, with the same methodology. But like the firt time after 1 or 2 min the pod recover by itseft ... So i don't know if it's a AKS specific issue or something else, because on my cluster it's working fine.

Cheers

stale · 2019-03-09T10:23:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

theobolo mentioned this issue Jan 5, 2018

Busy azure-disk regularly fail to mount causing K8S Pod deployments to halt. Azure/ACS#12

Open

jalberto mentioned this issue Jan 19, 2018

upgrade k8s process is broke #2022

Closed

andyzhangx mentioned this issue Apr 3, 2018

Azure disk fails to attach and mount, causing rescheduled pod to stall following node disruption kubernetes/kubernetes#46421

Closed

jalberto mentioned this issue Jun 25, 2018

Multi Attach Error Azure/AKS#477

Closed

theobolo closed this as completed Jun 30, 2018

theobolo reopened this Jul 2, 2018

stale bot added the stale label Mar 9, 2019

stale bot closed this as completed Mar 16, 2019

Since Intel CPU Azure update, new Azure Disks are not mounting, very critical... #2002

Since Intel CPU Azure update, new Azure Disks are not mounting, very critical... #2002

Comments

theobolo commented Jan 5, 2018 • edited Loading

andyzhangx commented Jan 5, 2018

theobolo commented Jan 5, 2018 • edited Loading

andyzhangx commented Jan 5, 2018

theobolo commented Jan 5, 2018 • edited Loading

theobolo commented Jan 5, 2018 • edited Loading

theobolo commented Jan 5, 2018

andyzhangx commented Jan 5, 2018

theobolo commented Jan 5, 2018

andyzhangx commented Jan 5, 2018

theobolo commented Jan 5, 2018

andyzhangx commented Jan 5, 2018

andyzhangx commented Jan 5, 2018

theobolo commented Jan 5, 2018 • edited Loading

theobolo commented Jan 5, 2018

theobolo commented Jan 5, 2018

theobolo commented Jan 5, 2018 • edited Loading

brendandburns commented Jan 5, 2018

andyzhangx commented Jan 6, 2018

theobolo commented Jan 8, 2018 • edited Loading

theobolo commented Jan 8, 2018

pauloeliasjr commented Jan 12, 2018 • edited Loading

jalberto commented Jan 19, 2018 • edited Loading

andyzhangx commented Jan 20, 2018

jalberto commented Jan 20, 2018

andyzhangx commented Jan 20, 2018

rocketraman commented Jan 23, 2018

andyzhangx commented Jan 23, 2018

rocketraman commented Jan 23, 2018

andyzhangx commented Jan 23, 2018

rocketraman commented Jan 23, 2018

andyzhangx commented Jan 23, 2018

jalberto commented Feb 13, 2018 • edited Loading

jalberto commented Feb 13, 2018

jalberto commented Feb 13, 2018

andyzhangx commented Feb 25, 2018 • edited Loading

theobolo commented Mar 7, 2018

theobolo commented Jun 30, 2018

jalberto commented Jul 2, 2018

theobolo commented Jul 2, 2018

jalberto commented Jul 2, 2018

theobolo commented Jul 2, 2018 • edited Loading

stale bot commented Mar 9, 2019

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 5, 2018 •

edited

Loading

theobolo commented Jan 8, 2018 •

edited

Loading

pauloeliasjr commented Jan 12, 2018 •

edited

Loading

jalberto commented Jan 19, 2018 •

edited

Loading

jalberto commented Feb 13, 2018 •

edited

Loading

andyzhangx commented Feb 25, 2018 •

edited

Loading

theobolo commented Jul 2, 2018 •

edited

Loading