-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idempotent unmount from NodeUnstageVolume / NodeUnpublishVolume #1605
idempotent unmount from NodeUnstageVolume / NodeUnpublishVolume #1605
Conversation
c0665b3
to
0a06d01
Compare
Ah, #1597 just merged yesterday, which helps if the error returned contains 'not mounted', but does not help if it fails due to |
I had some discussion with @torredil and ran some extra tests with readOnlyRootFilesystem enabled to make sure that
@torredil do you think I'm missing anything? If you have any concerns about removing the explicit 'not mounted' error check, then I could reduce this patch to a 1-liner to just replace the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this lgtm as written, will leave it open for more feedback.
/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows |
I think we should consider doing this. I'm not sure I'm 100% convinced that Otherwise, this lgtm as written. For what it's worth, I think that #1597 should fix 95+% of real world cases of this type of failure (and was released in Monday's 1.19.0 release), so I would recommend upgrading to 1.19.0 in the meantime to fix the problem you outlined in the issue description. |
0a06d01
to
1b1e877
Compare
Thanks a lot for the feedback, I made this change and re-tested the original scenario in my PR description, works as expected. |
/test pull-aws-ebs-csi-driver-external-test |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ConnorJC3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a bug fix or adding new feature?
bug fix
What is this PR about? / Why do we need it?
We observed an issue where a workload mounting multiple EBS volumes gets stuck Terminating because unmount is failing on the node. It's racy and difficult to reproduce, but from the kubelet logs, the unmount succeeded once and about 0.1s later Unmounter.TearDownAt starts failing with
exit status 32
. Then the pod is left in the Terminating state until kubelet is restarted.There was some work to make mounts idempotent in #1019 but it did not address the unmount path. Other CSI drivers use
mountutils.CleanupMountPoint
which checks if the path exists, if the mount is corrupted, and if the mount exists before attempting to call Unmount. This PR simply changes the Unmount calls to use CleanupMountPoint instead.What testing is done?
Since this is not consistently reproducible, I resorted to manually umount and rmdir on the node before deleting the pod, and this left the pod in the Terminating state until kubelet was restarted.
Before this change:
After this change:
/cc @wongma7 @gnufied @jsafrane
/kind bug
/sig storage
/triage accepted
/priority important-soon