Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idempotent unmount from NodeUnstageVolume / NodeUnpublishVolume #1605

Merged
merged 1 commit into from
May 18, 2023

Conversation

dobsonj
Copy link
Member

@dobsonj dobsonj commented May 16, 2023

Is this a bug fix or adding new feature?

bug fix

What is this PR about? / Why do we need it?

We observed an issue where a workload mounting multiple EBS volumes gets stuck Terminating because unmount is failing on the node. It's racy and difficult to reproduce, but from the kubelet logs, the unmount succeeded once and about 0.1s later Unmounter.TearDownAt starts failing with exit status 32. Then the pod is left in the Terminating state until kubelet is restarted.

There was some work to make mounts idempotent in #1019 but it did not address the unmount path. Other CSI drivers use mountutils.CleanupMountPoint which checks if the path exists, if the mount is corrupted, and if the mount exists before attempting to call Unmount. This PR simply changes the Unmount calls to use CleanupMountPoint instead.

What testing is done?

Since this is not consistently reproducible, I resorted to manually umount and rmdir on the node before deleting the pod, and this left the pod in the Terminating state until kubelet was restarted.

Before this change:

$ cat test_pod_pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim1
  namespace: testns
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  namespace: testns
  labels:
    app: ubuntu
spec:
  volumes:
  - name: claim1-vol
    persistentVolumeClaim:
      claimName: claim1
  containers:
  - image: ubuntu
    command:
      - "sleep"
      - "604800"
    imagePullPolicy: IfNotPresent
    name: ubuntu
    volumeMounts:
      - mountPath: "/mnt/claim1"
        name: claim1-vol
    securityContext:
      allowPrivilegeEscalation: true
  restartPolicy: Always

$ oc apply -f test_pod_pvc.yaml
persistentvolumeclaim/claim1 created
pod/ubuntu created

$ oc get pods -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
ubuntu   1/1     Running   0          115s   10.131.0.25   ip-10-0-216-169.us-east-2.compute.internal   <none>           <none>

$ oc debug node/ip-10-0-216-169.us-east-2.compute.internal
Starting pod/ip-10-0-216-169us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.216.169
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# mount | grep csi
/dev/nvme1n1 on /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/64739da0c8f20ecdb049190c95e2587609871a2c69e89d5e2a0627bcb9b47cb6/globalmount type ext4 (rw,relatime,seclabel)
/dev/nvme1n1 on /var/lib/kubelet/pods/5ccf5697-f495-494f-9b5b-44eb3631266a/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount type ext4 (rw,relatime,seclabel)
sh-5.1# umount /var/lib/kubelet/pods/5ccf5697-f495-494f-9b5b-44eb3631266a/volumes/kubernetesio~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount && rmdir /var/lib/kubelet/pods/5ccf569-f495-494f-9b5b-44eb3631266a/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb8a7/mount

$ oc delete pod/ubuntu
pod "ubuntu" deleted
^C
$ oc get pods
NAME     READY   STATUS        RESTARTS   AGE
ubuntu   0/1     Terminating   0          46m

sh-5.1# journalctl -u kubelet | less
...
May 16 18:02:23 ip-10-0-216-169 kubenswrapper[1353]: E0516 18:02:23.165860    1353 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0fcba08aaa4ed58ee podName:5ccf5697-f495-494f-9b5b-44eb3631266a nodeName:}" failed. No retries permitted until 2023-05-16 18:02:24.165847203 +0000 UTC m=+5151.497100820 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "claim1-vol" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0fcba08aaa4ed58ee") pod "5ccf5697-f495-494f-9b5b-44eb3631266a" (UID: "5ccf5697-f495-494f-9b5b-44eb3631266a") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/5ccf5697-f495-494f-9b5b-44eb3631266a/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount": unmount failed: exit status 32
...
sh-5.1# systemctl restart kubelet

$ oc get pods
No resources found in testns namespace.

After this change:

$ oc apply -f test_pod_pvc.yaml
persistentvolumeclaim/claim1 unchanged
pod/ubuntu created

$ oc get pods -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
ubuntu   1/1     Running   0          99s   10.131.0.31   ip-10-0-216-169.us-east-2.compute.internal   <none>           <none>

$ oc debug node/ip-10-0-216-169.us-east-2.compute.internal
Starting pod/ip-10-0-216-169us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.216.169
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# mount | grep csi
/dev/nvme1n1 on /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/64739da0c8f20ecdb049190c95e2587609871a2c69e89d5e2a0627bcb9b47cb6/globalmount type ext4 (rw,relatime,seclabel)
/dev/nvme1n2 on /var/lib/kubelet/plugins/kubernetes.io/csi/ebs.csi.aws.com/64739da0c8f20ecdb049190c95e2587609871a2c69e89d5e2a0627bcb9b47cb6/globalmount type ext4 (rw,relatime,seclabel)
/dev/nvme1n2 on /var/lib/kubelet/pods/e1308dfd-6d95-4b4e-8f90-09120dee0a0d/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount type ext4 (rw,relatime,seclabel)
sh-5.1# umount /var/lib/kubelet/pods/e1308dfd-6d95-4b4e-8f90-09120dee0a0d/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount && rmdir /var/lib/kubelet/pods/e1308dfd-6d95-4b4e-8f90-09120dee0a0d/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount

$ oc delete pod/ubuntu
pod "ubuntu" deleted
$ oc get pods
No resources found in testns namespace.

/cc @wongma7 @gnufied @jsafrane

Fix unmount idempotency

/kind bug
/sig storage
/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 16, 2023
@dobsonj dobsonj force-pushed the idempotent-unmount branch from c0665b3 to 0a06d01 Compare May 16, 2023 19:57
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 16, 2023
@dobsonj
Copy link
Member Author

dobsonj commented May 16, 2023

Ah, #1597 just merged yesterday, which helps if the error returned contains 'not mounted', but does not help if it fails due to exit status 32 when the mount directory does not exist.
mountutils.CleanupMountPoint should address both cases. It's consistent with other drivers, and IMO is more reliable than checking for a specific error string.
/cc @ConnorJC3 @torredil @hanyuel
for reviews since it touches the same code

@dobsonj
Copy link
Member Author

dobsonj commented May 17, 2023

I had some discussion with @torredil and ran some extra tests with readOnlyRootFilesystem enabled to make sure that CleanupMountPoint handles that case correctly.

$ oc get daemonset -n openshift-cluster-csi-drivers aws-ebs-csi-driver-node -o yaml | grep readOnlyRootFilesystem
          readOnlyRootFilesystem: true
  1. Simple create / delete:
$ oc apply -f test_pod_pvc.yaml
persistentvolumeclaim/claim1 unchanged
pod/ubuntu created

$ oc get pods
NAME     READY   STATUS    RESTARTS   AGE
ubuntu   1/1     Running   0          19s
$ oc delete pod/ubuntu
pod "ubuntu" deleted
$ oc get pods
No resources found in testns namespace
  1. Unmounting and removing the directory before deleting the pod:
$ oc apply -f test_pod_pvc.yaml
persistentvolumeclaim/claim1 unchanged
pod/ubuntu created

$ oc get pods -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
ubuntu   1/1     Running   0          48s   10.131.0.52   ip-10-0-216-169.us-east-2.compute.internal   <none>           <none>

$ oc debug node/ip-10-0-216-169.us-east-2.compute.internal
Starting pod/ip-10-0-216-169us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.216.169
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-5.1# mount | grep csi | grep \/mount
/dev/nvme1n1 on /var/lib/kubelet/pods/5ec4cb82-5617-4e19-ad5e-faf2641da175/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount type ext4 (rw,relatime,seclabel)
sh-5.1# umount /var/lib/kubelet/pods/5ec4cb82-5617-4e19-ad5e-faf2641da175/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount && rmdir /var/lib/kubelet/pods/5ec4cb82-5617-4e19-ad5e-faf2641da175/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount

$ oc delete pod/ubuntu
pod "ubuntu" deleted
$ oc get pods
No resources found in testns namespace.
  1. Just unmounting (but not calling rmdir) before deleting the pod:
$ oc apply -f test_pod_pvc.yaml
persistentvolumeclaim/claim1 unchanged
pod/ubuntu created

$ oc get pods -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
ubuntu   1/1     Running   0          48s   10.131.0.53   ip-10-0-216-169.us-east-2.compute.internal   <none>           <none>

$ oc debug node/ip-10-0-216-169.us-east-2.compute.internal
Starting pod/ip-10-0-216-169us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.216.169
If you don't see a command prompt, try pressing enter.
sh-4.4# mount | grep csi | grep \/mount
/dev/nvme1n1 on /host/var/lib/kubelet/pods/0747dc0f-a384-48b4-8492-7ae8a0ca8c7d/volumes/kubernetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount type ext4 (rw,relatime,seclabel)
sh-4.4# umount /host/var/lib/kubelet/pods/0747dc0f-a384-48b4-8492-7ae8a0ca8c7d/volumes/kuberetes.io~csi/pvc-69ce02b2-62a2-40b6-82b4-fc4ea9bb81a7/mount

$ oc delete pod/ubuntu
pod "ubuntu" deleted
$ oc get pods
No resources found in testns namespace.

@torredil do you think I'm missing anything? If you have any concerns about removing the explicit 'not mounted' error check, then I could reduce this patch to a 1-liner to just replace the m.Unmount call with mountutils.CleanupMountPoint.

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this lgtm as written, will leave it open for more feedback.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 17, 2023
@torredil
Copy link
Member

/test pull-aws-ebs-csi-driver-test-e2e-external-eks-windows

@ConnorJC3
Copy link
Contributor

If you have any concerns about removing the explicit 'not mounted' error check, then I could reduce this patch to a 1-liner to just replace the m.Unmount call with mountutils.CleanupMountPoint.

I think we should consider doing this. I'm not sure I'm 100% convinced that CleanupMountPoint always handles this correctly in our case (especially with weird race-y possible TOCTOU problems), and performing the check is practically free and guarantees that particularly nasty bug never re-emerges.

Otherwise, this lgtm as written. For what it's worth, I think that #1597 should fix 95+% of real world cases of this type of failure (and was released in Monday's 1.19.0 release), so I would recommend upgrading to 1.19.0 in the meantime to fix the problem you outlined in the issue description.

@dobsonj dobsonj force-pushed the idempotent-unmount branch from 0a06d01 to 1b1e877 Compare May 17, 2023 17:36
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 17, 2023
@dobsonj
Copy link
Member Author

dobsonj commented May 17, 2023

If you have any concerns about removing the explicit 'not mounted' error check, then I could reduce this patch to a 1-liner to just replace the m.Unmount call with mountutils.CleanupMountPoint.

I think we should consider doing this. I'm not sure I'm 100% convinced that CleanupMountPoint always handles this correctly in our case (especially with weird race-y possible TOCTOU problems), and performing the check is practically free and guarantees that particularly nasty bug never re-emerges.

Thanks a lot for the feedback, I made this change and re-tested the original scenario in my PR description, works as expected.

@torredil
Copy link
Member

/test pull-aws-ebs-csi-driver-external-test

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 18, 2023
@ConnorJC3
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorJC3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit 9b7b951 into kubernetes-sigs:master May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants