Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Since Intel CPU Azure update, new Azure Disks are not mounting, very critical... #2002

Closed
theobolo opened this issue Jan 5, 2018 · 42 comments
Labels

Comments

@theobolo
Copy link

theobolo commented Jan 5, 2018

Is this a request for help?:

New azure managed disks are not mouting on the Kubernetes nodes, Azure API slow.

Is this an ISSUE or FEATURE REQUEST? (choose one):

It's an ISSUE

What version of acs-engine?:

ACS Engine version : 0.7.0
Kubernetes version : 1.7.5

kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-14T06:55:55Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.5", GitCommit:"17d7182a7ccbb167074be7a87f0a68bd00d58d97", GitTreeState:"clean", BuildDate:"2017-08-31T08:56:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

After the big update yesterday, i'm seeing a lot of errors about attaching or detaching Azure Managed Disks on kubernetes, espacially the new created ones.

I recreated all my stacks (Elasticsearch/kafka/Zookeeper/...) using a PersistentVolumeClaim with my statefullsets after the big update yesterday.

All my new Azure Disks are not able to be mounted on my workers, but actually it should :

image

image

image

image

I'm posting the logs of the kube-controller manager :

W0105 10:24:00.133401       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.133439       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.133458       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.233803       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.233910       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.234034       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.234063       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.334429       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.334518       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.334462       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.334586       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.434953       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.434987       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.435103       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.435139       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.535432       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.535472       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.535692       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.535715       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.636100       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.636142       1 reconciler.go:267] Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a") from node "k8s-k8sworkers-20163042-3" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.636168       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"integration", Name:"grafana-1826545959-8qw5q", UID:"20425f47-f1f7-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22791080", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a" Volume is already exclusively attached to one node and can't be attached to another
I0105 10:24:00.636183       1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"preproduction", Name:"mongo-0", UID:"5492929f-f1b4-11e7-92c5-000d3ab769d6", APIVersion:"v1", ResourceVersion:"22247369", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" Volume is already exclusively attached to one node and can't be attached to another
W0105 10:24:00.736504       1 reconciler.go:267] Multi-Attach error for volume "pvc-8743f77c-a472-11e7-b780-000d3ab769d6" (UniqueName: "kubernetes.io/azure-disk//subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-8743f77c-a472-11e7-b780-000d3ab769d6") from node "k8s-k8sworkers-20163042-2" Volume is already exclusively attached to one node and can't be attached to another

By the way it's an issue that the Azure community faced already and me aswell. Someone experienced the same kind of issues yesterday on an other Git thread. Azure/ACS#12

That's really critical, during the updates yesterday the Machines went down in a normal way but the Azure API was really slow and the mounting and detaching too.
Kubernetes was trying to balance my pods with their disks to another worker since the actual one was upgrading (more than 40min per machine), but it was impossible since the disks weren't able to detach from that worker.

I almost waited more than 1 hour to have my 3 mongoDb Azure Disks attached on my ReplicaSet after each machine reboot. I tought that Azure Disks where safe in production but basically it wasn't yesterday.

Conclusion, i'm not able to mount any disks on my statefullsets since yesterday on my Staging cluster.

Rebooted : kube-controller-manager, kubelet on the concerned workers, even restarted the virtual machines, it's always the same detach/attach error.

Please guys do you have any info on that, my Direct Professional Support plan mail is still not answered since 24h...

How to reproduce it (as minimally and precisely as possible):

Apparently someone else experienced that : Azure/ACS#12

@andyzhangx
Copy link
Contributor

@theobolo could you run kubectl describe pvc PVC-NAME , it says "PersisentVolumeClaim is not bound" in the screenshot

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

@andyzhangx Hi Andy, there it is :

image

Don't look at the log on the Kubernetes Dashboard, it's not always the last logs. It's happening on all the new PVC right now the real state is that the PVC is bound but :

image

The disks are up on the Azure Dashboard as i said and "Bound" on Kubernetes PVC, but they are not mounted even if they are marked "Unattached" on Azure.

And i paste also the log from kubelet from the worker-3 where Kafka-0 is supposed to be deployed :

image

@andyzhangx
Copy link
Contributor

@theobolo
could you use azure-cli to run az vm show -g RESOURCE-GROUP -n AGENT-VM-NAME -d to get the agent VM details? you may need to run az login first. Thanks.

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

@andyzhangx There it is (the kubernetes worker 3), but agents are running on workers i think, since they were before the update :

ghosty@workstation-ghosty:~/kubernetes-deployment/kafka_deployment$ az vm show -g k8s-fleeters-cluster-preproduction -n k8s-k8sworkers-20163042-3 -d
{
  "additionalProperties": {},
  "availabilitySet": {
    "additionalProperties": {},
    "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/availabilitySets/K8SWORKERS-AVAILABILITYSET-20163042",
    "resourceGroup": "k8s-fleeters-cluster-preproduction"
  },
  "diagnosticsProfile": null,
  "fqdns": "",
  "hardwareProfile": {
    "additionalProperties": {},
    "vmSize": "Standard_DS13_v2_Promo"
  },
  "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-3",
  "identity": null,
  "licenseType": null,
  "location": "northeurope",
  "macAddresses": "00-0D-3A-B7-64-8B",
  "name": "k8s-k8sworkers-20163042-3",
  "networkProfile": {
    "additionalProperties": {},
    "networkInterfaces": [
      {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Network/networkInterfaces/k8s-k8sworkers-20163042-nic-3",
        "primary": null,
        "resourceGroup": "k8s-fleeters-cluster-preproduction"
      }
    ]
  },
  "osProfile": {
    "additionalProperties": {},
    "adminPassword": null,
    "adminUsername": "fleeters",
    "computerName": "k8s-k8sworkers-20163042-3",
    "customData": null,
    "linuxConfiguration": {
      "additionalProperties": {},
      "disablePasswordAuthentication": true,
      "ssh": {
        "additionalProperties": {},
        "publicKeys": [
          {
            "additionalProperties": {},
            "keyData": "HIDDEN",
            "path": "/home/fleeters/.ssh/authorized_keys"
          }
        ]
      }
    },
    "secrets": [],
    "windowsConfiguration": null
  },
  "plan": null,
  "powerState": "VM running",
  "privateIps": "10.240.0.5",
  "provisioningState": "Succeeded",
  "publicIps": "",
  "resourceGroup": "k8s-fleeters-cluster-preproduction",
  "resources": [
    {
      "additionalProperties": {},
      "autoUpgradeMinorVersion": true,
      "forceUpdateTag": null,
      "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-3/extensions/cse3",
      "instanceView": null,
      "location": "northeurope",
      "name": "cse3",
      "protectedSettings": null,
      "provisioningState": "Succeeded",
      "publisher": "Microsoft.Azure.Extensions",
      "resourceGroup": "k8s-fleeters-cluster-preproduction",
      "settings": {},
      "tags": null,
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "typeHandlerVersion": "2.0",
      "virtualMachineExtensionType": "CustomScript"
    }
  ],
  "storageProfile": {
    "additionalProperties": {},
    "dataDisks": [],
    "imageReference": {
      "additionalProperties": {},
      "id": null,
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "16.04.201706191"
    },
    "osDisk": {
      "additionalProperties": {},
      "caching": "ReadWrite",
      "createOption": "FromImage",
      "diskSizeGb": 128,
      "encryptionSettings": null,
      "image": null,
      "managedDisk": {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/k8s-k8sworkers-20163042-3_OsDisk_1_b24ffe34a2504df2924985f9c9fd013f",
        "resourceGroup": "k8s-fleeters-cluster-preproduction",
        "storageAccountType": "Premium_LRS"
      },
      "name": "k8s-k8sworkers-20163042-3_OsDisk_1_b24ffe34a2504df2924985f9c9fd013f",
      "osType": "Linux",
      "vhd": null
    }
  },
  "tags": {
    "creationSource": "acsengine-k8s-k8sworkers-20163042-3",
    "orchestrator": "Kubernetes:1.7.5",
    "poolName": "k8sworkers",
    "resourceNameSuffix": "20163042"
  },
  "type": "Microsoft.Compute/virtualMachines",
  "vmId": "30c29441-6c41-4de0-b88b-1a50293fd257",
  "zones": null
}

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

It seems that the worker-2 have a problem by the way :

{
  "additionalProperties": {},
  "availabilitySet": {
    "additionalProperties": {},
    "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/availabilitySets/K8SWORKERS-AVAILABILITYSET-20163042",
    "resourceGroup": "k8s-fleeters-cluster-preproduction"
  },
  "diagnosticsProfile": null,
  "fqdns": "",
  "hardwareProfile": {
    "additionalProperties": {},
    "vmSize": "Standard_DS13_v2_Promo"
  },
  "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-2",
  "identity": null,
  "licenseType": null,
  "location": "northeurope",
  "macAddresses": "00-0D-3A-B7-60-BC",
  "name": "k8s-k8sworkers-20163042-2",
  "networkProfile": {
    "additionalProperties": {},
    "networkInterfaces": [
      {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Network/networkInterfaces/k8s-k8sworkers-20163042-nic-2",
        "primary": null,
        "resourceGroup": "k8s-fleeters-cluster-preproduction"
      }
    ]
  },
  "osProfile": {
    "additionalProperties": {},
    "adminPassword": null,
    "adminUsername": "fleeters",
    "computerName": "k8s-k8sworkers-20163042-2",
    "customData": null,
    "linuxConfiguration": {
      "additionalProperties": {},
      "disablePasswordAuthentication": true,
      "ssh": {
        "additionalProperties": {},
        "publicKeys": [
          {
            "additionalProperties": {},
            "keyData": "NOT THIS TIME",
            "path": "/home/fleeters/.ssh/authorized_keys"
          }
        ]
      }
    },
    "secrets": [],
    "windowsConfiguration": null
  },
  "plan": null,
  "powerState": "VM running",
  "privateIps": "10.240.0.4",
  "provisioningState": "Failed",
  "publicIps": "",
  "resourceGroup": "k8s-fleeters-cluster-preproduction",
  "resources": [
    {
      "additionalProperties": {},
      "autoUpgradeMinorVersion": true,
      "forceUpdateTag": null,
      "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/virtualMachines/k8s-k8sworkers-20163042-2/extensions/cse2",
      "instanceView": null,
      "location": "northeurope",
      "name": "cse2",
      "protectedSettings": null,
      "provisioningState": "Updating",
      "publisher": "Microsoft.Azure.Extensions",
      "resourceGroup": "k8s-fleeters-cluster-preproduction",
      "settings": {},
      "tags": null,
      "type": "Microsoft.Compute/virtualMachines/extensions",
      "typeHandlerVersion": "2.0",
      "virtualMachineExtensionType": "CustomScript"
    }
  ],
  "storageProfile": {
    "additionalProperties": {},
    "dataDisks": [
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 20,
        "image": null,
        "lun": 2,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-287cfe7d-b999-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-287cfe7d-b999-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 30,
        "image": null,
        "lun": 3,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-d68e7210-bee9-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-d68e7210-bee9-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 64,
        "image": null,
        "lun": 0,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-c48ca983-b0d1-11e7-b9f4-000d3ab76964",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-c48ca983-b0d1-11e7-b9f4-000d3ab76964",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 32,
        "image": null,
        "lun": 4,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-4c0b9e2b-a48d-11e7-a344-000d3ab7665a",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 128,
        "image": null,
        "lun": 1,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-848a1391-a472-11e7-b780-000d3ab769d6",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-848a1391-a472-11e7-b780-000d3ab769d6",
        "vhd": null
      },
      {
        "additionalProperties": {},
        "caching": "ReadWrite",
        "createOption": "Attach",
        "diskSizeGb": 64,
        "image": null,
        "lun": 5,
        "managedDisk": {
          "additionalProperties": {},
          "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/fleeters-59cd2dcd-dynamic-pvc-984109f2-c9da-11e7-a941-000d3ab76964",
          "resourceGroup": "k8s-fleeters-cluster-preproduction",
          "storageAccountType": "Premium_LRS"
        },
        "name": "fleeters-59cd2dcd-dynamic-pvc-984109f2-c9da-11e7-a941-000d3ab76964",
        "vhd": null
      }
    ],
    "imageReference": {
      "additionalProperties": {},
      "id": null,
      "offer": "UbuntuServer",
      "publisher": "Canonical",
      "sku": "16.04-LTS",
      "version": "16.04.201706191"
    },
    "osDisk": {
      "additionalProperties": {},
      "caching": "ReadWrite",
      "createOption": "FromImage",
      "diskSizeGb": 128,
      "encryptionSettings": null,
      "image": null,
      "managedDisk": {
        "additionalProperties": {},
        "id": "/subscriptions/3770e200-bb58-4d3b-afcb-66c7ce083c3f/resourceGroups/k8s-fleeters-cluster-preproduction/providers/Microsoft.Compute/disks/k8s-k8sworkers-20163042-2_OsDisk_1_3bb94d8af0d449d38f1f19123975094a",
        "resourceGroup": "k8s-fleeters-cluster-preproduction",
        "storageAccountType": "Premium_LRS"
      },
      "name": "k8s-k8sworkers-20163042-2_OsDisk_1_3bb94d8af0d449d38f1f19123975094a",
      "osType": "Linux",
      "vhd": null
    }
  },
  "tags": {
    "creationSource": "acsengine-k8s-k8sworkers-20163042-2",
    "orchestrator": "Kubernetes:1.7.5",
    "poolName": "k8sworkers",
    "resourceNameSuffix": "20163042"
  },
  "type": "Microsoft.Compute/virtualMachines",
  "vmId": "9f8e9bb3-2152-4aca-8141-e41a31b39d35",
  "zones": null
}

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

There is certainly something wrong with the Azure API, it says that my worker-2 and 4 have failed, on Azure Dashboard. But Kubernetes sees the nodes as "Ready" and i've some pods running on them ...

Azure status :
image

My pods on the worker-2 :
image

@andyzhangx
Copy link
Contributor

thanks, so where is the data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6, error says it's attaching that disk to work-3, could you find it your resource group?
az disk list | grep f4be8993

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

@andyzhangx Yep there it is :

image

@andyzhangx
Copy link
Contributor

@theobolo then I think it's caused by the VM failed state, the error(attaching data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6 to work-3 failed) could be the last error long time ago. I think you should first resolve the VM failed state issue, I only got the Powershell solution here:
https://blogs.technet.microsoft.com/mckittrick/azure-vm-stuck-in-failed-state-arm/

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

Yep but basically the worker-3 is in a Running State with no error and with a successfull agent running on it.
My disks are not either attached to a Failed machine since they are brand new : i created my statefulsets few hours ago, and my disks weren't mounted before anywhere else.

I'm gonna try to remove the "Failed" status using your method on the other workers, but i think that there is something more.

@andyzhangx
Copy link
Contributor

you may update worker-3 VM as well if possible, the error said attach disk 1 failed because attach disk 2 to worker-3 VM failed, it's quite weired, there is possiblity that worker-3 VM is also in a wrong state.

@andyzhangx
Copy link
Contributor

BTW, could you check the status of data disk fleeters-59cd2dcd-dynamic-pvc-f4be8993-a472-11e7-b780-000d3ab769d6, is it attached to a VM? that's important. Thanks.

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

@andyzhangx Yep since yesterday i have a lot of really weird things on the Azure API, i saw a statement yesterday saying that it's not relevant anymore to look at the Azure messages on the dashboard, because they are delayed or laggy.

By the way i putted the disk status within my first message, it's absolutly not attached to any node and it was never been.

image

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

Now all my nodes seems to be healthy using : az vm show -g k8s-fleeters-cluster-preproduction -n k8s-k8sworkers-20163042-0 -d command.

No more "Failed" state on Azure. I'll delete all the PVC and the statefulsets and just redeploy Kafka.

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

The thing is that, when i delete manually my PVC on Kubernetes the Managed Disks are actually deleted !
And in the other way when i create my statefulsets they are correctly created on the dashboard.

So it's about mounting and unmounting it seems

@theobolo
Copy link
Author

theobolo commented Jan 5, 2018

Ok so i did again my deployment :

  • Nodes are healthy
  • Deploying Kafka Statefulset
  • PVC is created
  • PV is created and "Bound", i can see it in the Azure Dashboard with "Unattached" status.
  • It was distributed on the worker-3, and again it can"t be mounted

You can see the Error on the Azure dashboard just right now :

image

Azure consider my new disk as already attached or mounted ????

@brendandburns
Copy link
Member

Please file an Azure support incident, as this appears related to the Azure wide reboot caused by meltdown and spectre...

@andyzhangx
Copy link
Contributor

Update in this thread, according to Azure/ACS#12 (comment), one customer has successfully fix this issue by using this PowerShell script to update the agent VM that has the mounting disk issue.
The root cause is after the big upgrade, the VM status is sometimes in incosistent status.

@theobolo
Copy link
Author

theobolo commented Jan 8, 2018

@brendanburns I already opened an issue Friday about that, on Critical Priority, i'm still waiting the answer :/

@andyzhangx I'm trying some workaround, i'll redeploy each virtual machines.

@theobolo
Copy link
Author

theobolo commented Jan 8, 2018

@andyzhangx I did the same as @rocketraman :

The final solution that worked for me was to simply do the following for every agent node in my cluster one at a time:

kubectl cordon <node>
delete any pods on with stateful sets
kubectl drain <node>
restart the Azure VM for via the API or portal
kubectl uncordon <node>

With that i was finally able to recover my nodes. Now my disks are correctly mounted on the workers.

@pauloeliasjr
Copy link

pauloeliasjr commented Jan 12, 2018

@theobolo we made the same and we have positive results for a short period of time but the issue returned. We also tried to scale the cluster (from 3 to 6 nodes in our case) and get rid of the first 3 nodes but it had no effect (problem has appeared again at virtual machine's logs after a few hours). Azure's support recomendation was the uncordon/cordon method, but is seems uneffective in our deployment.

@jalberto
Copy link

jalberto commented Jan 19, 2018

I have same error message (#2022) but in my case I can see how the disk move correctly to new node, still pods are unable to mount the disk

this si very critical, as each time somethign crash pods are not able to recover

@andyzhangx
Copy link
Contributor

@jalberto I am quite sure your issue is different from this one, I have replied in #2022

@jalberto
Copy link

maybe you right @andyzhangx but this started with the securit patch release

@andyzhangx
Copy link
Contributor

@pauloeliasjr sorry I missed your comments. What's your error? Could you paste again. There is some error not related to this issue.

@rocketraman
Copy link
Contributor

@andyzhangx I just ran into this problem again as well, despite solving it last week after the security updates with the cordon/drain/reboot process on each node :-(

I have a persistent volume which is showing "unbound" in the portal, and yet Kubernetes keeps reporting a 409 error for it:

2018-01-22 20:32:09 -0500 EST   2018-01-22 20:20:57 -0500 EST   5         eslogging-2.150c4bbec69ffd5e   Pod                 Warning   FailedMount   attachdetach   AttachVolume.Attach failed for volume "pvc-85783916-9e82-11e7-a717-000d3af4357e" : Attach volume "devkube1-dynamic-pvc-85783916-9e82-11e7-a717-000d3af4357e" to instance "k8s-agentpool1-18117938-4" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure responding to request: StatusCode=409 -- Original Error: autorest/azure: Service returned an error. Status=409 Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'devkube1-dynamic-pvc-85783916-9e82-11e7-a717-000d3af4357e' to VM 'k8s-agentpool1-18117938-4' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again."

As an experiment I decided to delete the disk entirely. Kubernetes even kept reported a 409 error after deleting it!

Something is clearly wrong here and needs to be fixed asap.

@andyzhangx
Copy link
Contributor

@rocketraman what's your VM 'k8s-agentpool1-18117938-4' status?
It could be the last error long time ago. I think you should first resolve the VM failed state issue, use following command to update the VM status:

az vm update -g <group> -n <name>

or Powershell solution if VM status is failed:
https://blogs.technet.microsoft.com/mckittrick/azure-vm-stuck-in-failed-state-arm/

@rocketraman
Copy link
Contributor

@andyzhangx No, I fixed all of the VM states last week:

image

@andyzhangx
Copy link
Contributor

@rocketraman could you use following command to update the VM k8s-agentpool1-18117938-4 status anyway, sometimes the VM status in azure portal is cheating you.

az vm update -g <group> -n <name>

@rocketraman
Copy link
Contributor

@andyzhangx Ok wow, that worked and solved the issue. Thank you! Is there a way I can identify other VMs that are in this "reporting Running but not running" state?

@andyzhangx
Copy link
Contributor

@rocketraman There is a command az vm get-instance-view -g <group> -n <name>, while I am not sure it could get the error info as I don't have the VM that has issue, and you could az vm update all for VMs, it's harmless.

@jalberto
Copy link

jalberto commented Feb 13, 2018

❯ az vm update -g k8svl -n k8s-pool01-xxxxx-1
Cannot attach data disk 'k8svl-dynamic-pvc-xxxxx' to VM 'k8s-pool01-11577755-1' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

IMHO this is a very critical problem, we cannot be manually restarting master services or nodes each time a pod needs to move.

Is there anyone on azure working on this?

@jalberto
Copy link

@andyzhangx according to azure portal, the disk is mounted in vm-1 the same one that is failling with the v update command

I rebooted the node, still same problem.

@jalberto
Copy link

@andyzhangx detached the disk manually (in portal) restarted every kubelet, run vm status in every node (no errors now), deleted the pod, k8s still complainign with same error

@andyzhangx can you confirm to me if this is marked as a critical problem in azure or I need to phone and open a new ticket?

This is critical and is blocking my production environment

@andyzhangx
Copy link
Contributor

andyzhangx commented Feb 25, 2018

Update this thread:
Recently I fixed a race condition issue which could cause disk attach error, the fix has been merged into v1.10, I am trying to cherry-pick this fix to other k8s version, you could find details about this issue here:
https://github.com/andyzhangx/Demo/blob/master/issues/README.md#1-disk-attach-error

Here is the PR: fix race condition issue when detaching azure disk

@theobolo
Copy link
Author

theobolo commented Mar 7, 2018

For the moment i can say that since January it's stable with a cluster on 1.8.5 Kubernetes version. Didn't have any other mount/umount problems...

@theobolo
Copy link
Author

I'm closing this issue, with Kubernetes 1.10 and the lasts updates there is no more issue about disks on Azure.

Great Thanks to @andyzhangx you did a really good job !

@jalberto
Copy link

jalberto commented Jul 2, 2018

@theobolo that not accurate: Azure/AKS#477 still happening in 10.3

@theobolo theobolo reopened this Jul 2, 2018
@theobolo
Copy link
Author

theobolo commented Jul 2, 2018

@jalberto Alright, following your case i'm reopening this issue, but i have to say : on a 1.10.2 Kubernetes cluster deployed with the 0.16.2 ACS Engine version, i have 0 problem since 1 month about Mounting and Unmounting disks.

Since you'r on AKS (GA) maybe that the deployement is slightly different, i didn't take a look at it.

I precise that i run more than 150 pods on that Cluster with more than 50% of them mounting "managed disks" with Premium Tier.

@andyzhangx I'll try the @jalberto scenario on my cluster to at least i hope reproduce the issue.

Cheers

@jalberto
Copy link

jalberto commented Jul 2, 2018

@theobolo I agree! this actually was solved already twice as far I can remember (with hard work by @andyzhangx) , but somehow it made a comeback, and actually that worries me as raises lot of concerns about a safe upgrades

@theobolo
Copy link
Author

theobolo commented Jul 2, 2018

@jalberto @andyzhangx Alright guys i managed to reproduce that issue on my cluster (reminder : 1.10.2 and No AKS just ACS)

I did :

  • helm install stable/grafana (persistence enabled)
  • wait 5 sec and delete the grafana pod
  • the new pod is showing the multi-attach error....

But basically as @andyzhangx said, i waited 1-2 min with the pod in this "error" state and then the pod mounted the PV as expected ... and Grafana went up healthy :

image

Also i did another test but with the "managed-premium" class for the disks and not the "default" one.
I did manage to reproduce the bug, with the same methodology. But like the firt time after 1 or 2 min the pod recover by itseft ... So i don't know if it's a AKS specific issue or something else, because on my cluster it's working fine.

Cheers

@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants