-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: detach azure disk issue using dangling error #81266
fix: detach azure disk issue using dangling error #81266
Conversation
@andyzhangx just to double check, the volume will only be detached if another pod and comes around and tries to use it. If no pod is attempting to use it, the volume will remain attached to the old node. Is that what happened? If volume is still not detaching even if another pod tries to use it - that will be kinda weird. Can we get controller-manager logs when this happened? |
I used following way to "repro" this issue:
So the disk would be detached from node#A, right? @gnufied |
@andyzhangx I think so yeah. Can you post controller-manager logs with v4 verbosity? |
@gnufied here is the useful logging, detach volume is skipped:
It's due to kubernetes/pkg/controller/volume/attachdetach/reconciler/reconciler.go Lines 199 to 201 in 61af419
At this time, the volume is in use by node#B. And if I hack the above code by comment out the
And you could find all controller manager logs here: https://gist.githubusercontent.com/andyzhangx/6d6460ad8802f78016b0b4cbce292399/raw/3ea7142feadaf8f7bb9da7c7e75016ce7559771d/controller-manager.log |
okay.. so:
This condition will automatically go away after
I think this is a bug that you have discovered. When I made #78595 change - to add dangling volumes as uncertain, we stopped reporting the volume to node's status and hence it is failing. But this looks like a general problem with detaching volumes which are in uncertain state, I will open a PR to fix this. This needs better coverage to prevent regression too. |
002c072
to
10a7eae
Compare
@andyzhangx actually part of my comment was not true, following error log:
Does not stop volume from being detached - https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/reconciler/reconciler.go#L208 the volume should still be detached.. |
224133b
to
1fe12be
Compare
@gnufied thanks a lot for your help! Finally I figured out that this is another issue in detaching disk, actually detach disk operation happens, which it could not find that disk in the list, this PR fixed these two issues. So this PR is ready for code review. cc @MSSedusch @dkistner @vlerenc |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyzhangx, feiskyer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go
Show resolved
Hide resolved
/hold |
/hold cancel |
@andyzhangx Can you cherry pick that fix to 1.15? |
it's been cherry picked to 1.13, 1.14, 1.15 |
…1266-upstream-release-1.13 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
…1266-upstream-release-1.15 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
…1266-upstream-release-1.14 Automated cherry pick of #81266: fix: detach azure disk issue using dangling error
below are the fixed versions:
|
What type of PR is this?
/kind bug
What this PR does / why we need it:
fix: detach azure disk issue using dangling error
I think this PR could fix most of the disk attach issue due to detach disk not succeed(could be due to many reasons), this PR could do the recover automatically, I will cherry pick the fix to all old releases.
Actually this PR fixed two issues:
disk not found
issue when detaching an azure diskThis is fixed by compare disk URI with case insensitive (
strings.EqualFold
), error logs are like following(without this PR):This is fixed by return dangling error, success events logs are like following:
Which issue(s) this PR fixes:
Fixes #81079
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/kind bug
/assign @gnufied
/priority important-soon
/sig cloud-provider
/area provider/azure