-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi Attach Error #477
Comments
I'm getting the same error using 1.10.3 in canada-central. |
@jalberto what's your k8s version? does it work after retry finally? |
and also provide your steps about how to repro this issue on AKS, thanks. |
Same issue for me. Can be easily reproduced when using an arbitrary helm-chart do deploy a PVC and the pod using it:
After that you can do 3 and 4 over and over again. The only thing that seems to fix the problem immediately is to also delete the PVC. |
@andyzhangx version 1.10.3 just hit this again with grafana helm chart, using default StorageClass. Why this issue re-appear so often? |
@jalberto how do you repro this issue? I tried AKS 1.10.3 with grafana helm chart, there would be
|
@andyzhangx how much is a "while" for you? I waited 45mins and it was not recovered. If it helps, I just trigger the situation by upgrading the helm. Now let's imagine it take an average of 30min, now imagine this happens because a real situation happens, I cannot wait 30 nor 15 nor 5min to a Disk to be mounted, that means production downtime. |
@jalberto you could see in the above logs, "a while" means less than 1 min. How could I exactly repro your issue?
|
- AKS 10.3 with 3 nodes (DS8)
- helm install grafana (with persistent disk enabled)
- do some config changes (enough to trigger pod to be recreated)
- no profit :(
…On Mon, 2 Jul 2018 at 11:05, Andy Zhang ***@***.***> wrote:
@jalberto <https://github.com/jalberto> you could see in the above logs,
"a while" means less than 1 min. How could I exactly repro your issue?
- Set up a AKS cluster with k8s v1.10.3 with 2 nodes
- helm install stable/grafana
- drain a node?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGGV2qMf-uNqVhwuSqnvMa4mOtLh_a0ks5uCeIVgaJpZM4U19Wk>
.
|
@jalberto what region is your AKS created? What's the config changes, delete pod? and make that pod reschedule on another node? Could you access azure portal to check whether that disk is still attached to original node after a few minutes?
And could you also paste |
hi guys, I could not repro on my k8s cluster with helm grafana or wordpress,
anything I am missing? |
Sorry I am out right now, I will provide as much info as I can:
I still think 1-2mins downtime is too much for this kind of process (as mount/umount is mostly instant) the longer time I saw is 45min, but I usually purge and recreate if it's not fixed after 10mins) |
Using AKS 1.9.6 It's taking a long time for disk attaches, normally with ACS I could work around by scaling up by one, shutting down the nodes in sequence and manually clearing any disks still attached, starting them up and finally scaling the cluster down by a node. When I look at attached disks of nodes in the portal they are "updating" for a long time. This basically means any pods have to be kept around otherwise it takes hours for them to reattach their PV's and start the container. |
We are also experiencing this issue when updating helm charts and the pod gets recreated on a different node. Downtime is always between 3 and 10 minutes which is not acceptable in our opinion. |
@vogelinho10 what's your k8s version and node size? From v1.9.2 and v1.10.0, the detach or attach of one azure disk could cost around 30s, so one disk detach from one node and attach to another would cost around 1 min, so if you have 4 disks for example, total time of these 4 disks attach/detach from one node to anther would cost 4 min. |
@andyzhangx |
@vogelinho10 could you provide |
@andyzhangx yes:
|
Thanks for the info, looks like disk detach/attach in |
Update: |
@andyzhangx westeurope too (that info is in the description of this issue) Thanks for continuing digging this :) In other hand, about your math of 4mins for 4 disks, I guess if umount+mount optimal time is 1min (still too much TBH) N disks should take same amount of time, as operations should be parallel not sequential, so 1 disk should take 1min as 5 disk should take 1min too |
Here is one of mine, the events expired but generally just failing to attach and you can see it takes about 1:40 per pod to attach a disk:
|
@jalberto I could confirm attaching multiple disks is not parallel, it's sequential. You may try |
I cannot, all my data should be in EU
…On Thu, 5 Jul 2018 at 13:02, Andy Zhang ***@***.***> wrote:
@jalberto <https://github.com/jalberto> I could confirm attaching
multiple disks is not parallel, it's sequential. You may try westus2
region if possible, attach/detach disk is much faster in that region.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGGV_dFVWvGBDEjlAGo3RI1qTAxKInwks5uDfINgaJpZM4U19Wk>
.
|
Just did another deploy:
|
looks like the disk attachment time issue has been resolved in West Europe region
|
Yup, back to normal, thanks. |
Do we know why this happened?
Is there any measure in place so we don't need to open this issue again?
…On Fri, 6 Jul 2018 at 16:12, Iain Colledge ***@***.***> wrote:
Yup, back to normal, thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#477 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAGGVwjZX-lftgJT2oWVi8m6HNbE4gSGks5uD3BSgaJpZM4U19Wk>
.
|
pls try again in West Europe, and I will contact with support about your concern, thanks. |
I'm havimg the same issue on east us |
Same here :( |
@Viroos @luisdavim |
I'm having this issue in canada central. It has been an hour or so, and I've tried deleting the pod a few times to try and get it rescheduled on a matching node. |
@derekperkins could you provide more details by:
|
thanks for the response @andyzhangx. After looking at it, the PV is already attached to the correct node that the pod is running on, so I don't know why it's throwing the multi-attach error. StatefulSet Pod
PVC
PV
Azure Portal |
I am having the same error as well on uksouth (on version 1.11.4). I am quite new to kubernetes but looks like the deployment isn't correctly scaling down the old replica set that is attached to the volume. As a work around for now I just scale the old replica set but kind of annoying as it means I have to keep doing this for every deployment. |
@stevetsang one pvc disk could only be used by one pod, and in some regions, like westeurope, the disk attachment is low, there would be a few multi-attach error and finally disk mount should recovered. |
@andyzhangx I am currently using one pvc for the one pod. What I am doing with scaling is setting it down to 0 to force the disk to unmount then scaling back to 1 again so it would mount. Do you recommend I just leave it until it figures it out to unmount from the old pod? |
@stevetsang could you share the |
@andyzhangx I have use old pod that needs to unmount the volume:
new pod that needs to mount the volume:
|
@stevetsang the error is clear, you can not use one PVC disk
|
@andyzhangx I am using helm upgrade to upgrade the pod to a new release should it not be tearing down the old pod and unmounting the volume before bringing up the new pod? |
@stevetsang I don't know the helm upgrade logic, anyway, the old pod is still living well... |
ah finally figured it out! By default the deployment file strategy is 'RollingUpdate', due to this the old pod doesn't get teared down until the new one is up and because we are unable to mount the drive onto the new pod it gets stuck. I was able to fix this by change the strategy to 'Recreate'. Many thanks to @andyzhangx for your help :) |
@stevetsang |
@andyzhangx I am using a Deployment set to replica 1. Unsure if your fix kubernetes/kubernetes#71377 would fix my issue when my deployment strategy set to |
@stevetsang I could confirm that all v1.11.x won't have that issue(kubernetes/kubernetes#71344), here are the details: https://github.com/andyzhangx/demo/blob/master/issues/azuredisk-issues.md#12-create-azure-disk-pvc-failed-due-to-account-creation-failure |
@andyzhangx okies well I'll will just leave it on |
@andyzhangx I'm on v1.11.5 and I'm still having the issue in east us. |
FYI. Last month I fixed an azure disk attach/detach issue caused by dirty vm cache which was introduced from k8s v1.9.2 This fix has been verified on ~150 clusters(~750 nodes totally) with disk attach/detach scenario by one customer running for more than one month.
Please upgrade to the above k8s versions when they are available on AKS(it's not available now), thanks. |
I will close this issue, jut let me know if you any question. Thanks. |
@andyzhangx still reproducible in 1.12.8. could you please help?
on my stateful set with Elasticsearch |
If that I just added 19. disk attach/detach self-healingIssue details: Fix Following PR would first check whether current disk is already attached to other node, if so, it would trigger a dangling error and k8s controller would detach disk first, and then do the attach volume operation. This PR would also fix a "disk not found" issue when detach azure disk due to disk URI case sensitive case, error logs are like following(without this PR):
Work around: manually detach disk in problem |
AKS GA 1.9 in west EU
This is common in acs-engine too, but I didn't expect it in AKS GA
rel:
The text was updated successfully, but these errors were encountered: