From e2d4c5555b76b284e8bbfe439b2e759759a5c502 Mon Sep 17 00:00:00 2001 From: Deepak Kinni Date: Thu, 26 Aug 2021 16:17:39 -0700 Subject: [PATCH] Update 2644 KEP Signed-off-by: Deepak Kinni --- .../2644-honor-pv-reclaim-policy/README.md | 251 ++++++++++++++---- .../2644-honor-pv-reclaim-policy/kep.yaml | 9 +- 2 files changed, 201 insertions(+), 59 deletions(-) diff --git a/keps/sig-storage/2644-honor-pv-reclaim-policy/README.md b/keps/sig-storage/2644-honor-pv-reclaim-policy/README.md index 13b64246614b..c561ee974975 100644 --- a/keps/sig-storage/2644-honor-pv-reclaim-policy/README.md +++ b/keps/sig-storage/2644-honor-pv-reclaim-policy/README.md @@ -81,35 +81,34 @@ are currently using the PVC. 3. If there are no Pods using the PVC, the PVC finalizers are removed, eventually triggering a PVC delete event. 4. The PV-PVC-controller processes delete PVC event, leading to removal of the PVC from it's in-memory cache followed by triggering an explicit sync on the PV. -5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is not longer available +5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is no longer available and updates the PV phase to `Released` state. -6. If the PVC-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not +6. If the PV-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not in a `Bound` phase, this causes the finalizers to be removed. 7. This is followed by PV-PVC-controller initiating the reclaim volume workflows. 8. The reclaim volume workflow observes the `persistentVolumeReclaimPolicy` as `Delete` and schedules a volume deletion. -9. During delete volume operation, it is observed that there is already a `deletionTimestamp`, and the call is never -forwarded to the CSI driver to delete the underlying volume. - -As observed from the description above, (9) prevents the volume deletion to be forwarded to the plugin. - -Preventing the check should allow the volume deletion. - - -### Risks and Mitigations - -The primary risk is volume deletion for user that expect the current behavior, i.e, they do not expect the volume to be -deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting -the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating -the current behavior. - -## Design Details - -The below code is called when `persistentVolumeReclaimPolicy` is `Delete`, as a part of the proposed fix, removing the -`deletionTimestamp` check would allow the volume deletion through the plugin. +9. Under the event that (6) has occurred and when `deleteVolumeOperation` executes it attempts to retrieve the latest PV +state from the API server, however, due to (6) the PV is removed from the API server, leading to an error state. This +results in the plugin volume deletion not being exercised hence leaking the volume. +```go +func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) { + klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name) + // This method may have been waiting for a volume lock for some time. + // Previous deleteVolumeOperation might just have saved an updated version, so + // read current volume state now. + newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{}) //<========== NotFound error thrown + if err != nil { + klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err) + return "", nil + } + // Remaining code below skipped for brevity... + // Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code. +``` +10. Under the event that (6) has not occurred yet, during execution of `deleteVolumeOperation` it is observed that the +PV has a pre-existing `deletionTimestamp`, this makes the method assume that delete is already being processed. This +results in the plugin volume deletion not being exercised hence leaking the volume. ```go -// deleteVolumeOperation deletes a volume. This method is running in standalone -// goroutine and already has all necessary locks. func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) { klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name) @@ -122,49 +121,178 @@ func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.Persist return "", nil } - if newVolume.GetDeletionTimestamp() != nil {//<==========================Remove check. + if newVolume.GetDeletionTimestamp() != nil {//<==========================DeletionTimestamp set since the PV was deleted first. klog.V(3).Infof("Volume %q is already being deleted", volume.Name) return "", nil } - needsReclaim, err := ctrl.isVolumeReleased(newVolume) - if err != nil { - klog.V(3).Infof("error reading claim for volume %q: %v", volume.Name, err) - return "", nil + // Remaining code below skipped for brevity... + // Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code. +``` +11. Meanwhile, the external-provisioner checks if there is a `deletionTimestamp` on the PV, if so, it assumes that its +in a transitory state and returns false for the `shouldDelete` check. +```go +// shouldDelete returns whether a volume should have its backing volume +// deleted, i.e. whether a Delete is "desired" +func (ctrl *ProvisionController) shouldDelete(ctx context.Context, volume *v1.PersistentVolume) bool { + if deletionGuard, ok := ctrl.provisioner.(DeletionGuard); ok { + if !deletionGuard.ShouldDelete(ctx, volume) { + return false + } } - if !needsReclaim { - klog.V(3).Infof("volume %q no longer needs deletion, skipping", volume.Name) - return "", nil + + if ctrl.addFinalizer && !ctrl.checkFinalizer(volume, finalizerPV) && volume.ObjectMeta.DeletionTimestamp != nil { + return false + } else if volume.ObjectMeta.DeletionTimestamp != nil { //<========== DeletionTimestamp is set on the PV since it's deleted first. + return false } + // Remaining code below skipped for brevity... + // Please refer sig-storage-lib-external-provisioner/controller/controller.go for the full code. +``` + +The main approach in fixing the issue involves introducing a new annotation `pv.kubernetes.io/request-to-delete-volume` +on the PV. The PV is annotated when it is observed that it has it's `DeletionTimestamp` set but still associated with +a valid claim. + +When the PVC is deleted, the PV is moved into `Released` state and following checks are made: + +1. Plugin: + +If there was an `NotFound` error while retrieving the PV, then it is checked to see the PV received as a part of the +update event has the annotation, if so, then `NotFound` error is ignored, and the call to delete volume is allowed to +proceed. + +2. CSI driver + +If when processing the PV update it is observed that it has the annotation with `DeletionTimestamp` set, then the volume +deletion request is forwarded to the driver(provided other pre-defined conditions are met). - pluginName, deleted, err := ctrl.doDeleteVolume(volume) - // Remaining code below skipped for brevity... - // Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code. - + +### Risks and Mitigations + +The primary risk is volume deletion to a user that expect the current behavior, i.e, they do not expect the volume to be +deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting +the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating +the current behavior. + +## Design Details + +When processing the PV update in `kube-controller-manager`, if it's observed that the PV is still bound to a valid claim, +and also has `DeletionTimestamp` set, then the PV is annotated. + +```go +// AnnRequestToDeleteVolume annotation is added to the PV when the PV is still bound to a valid claim but has deletionTimeStamp +// This indicates to the driver that the volume needs to be deleted from the underlying storage. +AnnRequestToDeleteVolume = "pv.kubernetes.io/request-to-delete-volume" +``` + +```go +// Function: syncVolume +// Code above skipped for brevity... + +if !found { + klog.V(4).Infof("synchronizing PersistentVolume[%s]: claim %s not found", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef)) + // Fall through with claim = nil +} else { + var ok bool + claim, ok = obj.(*v1.PersistentVolumeClaim) + if !ok { + return fmt.Errorf("Cannot convert object from volume cache to volume %q!?: %#v", claim.Spec.VolumeName, obj) + } + klog.V(4).Infof("synchronizing PersistentVolume[%s]: claim %s found: %s", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef), getClaimStatusForLogging(claim)) + // Check if the PV has a deletion timestamp, if so, annotate the PV to indicate that the PV can be deleted + // from the underlying storage. + if volume.DeletionTimestamp!= nil && !metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnRequestToDeleteVolume) { + klog.Infof("synchronizing PersistentVolume[%s]: the volume was requested to be deleted but still" + + " associated with the claim %s", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef)) + _, err := ctrl.updateRequestToDeleteVolumeAnnotations(volume) + if err != nil { + klog.V(4).Infof("synchronizing PersistentVolume[%s]: failed to add the delete volume annotation", volume.Name) + } + klog.Infof("successfully added annotation to indicate the volume is to be deleted.") + } } + +// Remaining code below skipped for brevity... +// Please refer kubernetes/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go for the full code. ``` -The proposed fix will not have side effects in existing code flows as the `deleteVolumeOperation` is scheduled with an -identifier generated using the volume name and UID that is being deleted. All duplicate identifiers are skipped. +When `deleteVolumeOperation` executes, `NotFound` errors while retrieving the PV is ignored if the annotation is found. ```go - case v1.PersistentVolumeReclaimDelete: - klog.V(4).Infof("reclaimVolume[%s]: policy is Delete", volume.Name) - opName := fmt.Sprintf("delete-%s[%s]", volume.Name, string(volume.UID)) - // create a start timestamp entry in cache for deletion operation if no one exists with - // key = volume.Name, pluginName = provisionerName, operation = "delete" - ctrl.operationTimestamps.AddIfNotExist(volume.Name, ctrl.getProvisionerNameFromVolume(volume), "delete") - ctrl.scheduleOperation(opName, func() error { - _, err := ctrl.deleteVolumeOperation(volume) - if err != nil { - // only report error count to "volume_operation_total_errors" - // latency reporting will happen when the volume get finally - // deleted and a volume deleted event is captured - metrics.RecordMetric(volume.Name, &ctrl.operationTimestamps, err) +// deleteVolumeOperation deletes a volume. This method is running in standalone +// goroutine and already has all necessary locks. +func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) { + klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name) + var requestToDeleteAnn bool + // This method may have been waiting for a volume lock for some time. + // Previous deleteVolumeOperation might just have saved an updated version, so + // read current volume state now. + newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{}) + if err != nil { //<===============================Changed here START + if !apierrors.IsNotFound(err) { + return "", nil + } else { + // Check if the volume is annotated with explicit request to delete. + // In this case due to race with the pv protection controller, it was not possible to retrieve the latest PV + // state. However, the annotation clearly indicates the volume deletion can proceed. + requestToDeleteAnn = metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnRequestToDeleteVolume) + if !requestToDeleteAnn { + klog.Infof("no request to delete annotation found on the volume %s", volume.Name) + // Does not have the annotation, throw the NotFound error. + return "", err + } else { + klog.Infof("request to delete annotation found on the volume %s", volume.Name) + newVolume = volume } - return err - }) + } + } //<===============================Changed here END + // Note: no need to check for deletion timestamp, the schedule operation ensures that duplicate delete + // are not processed. + if newVolume.GetDeletionTimestamp() != nil {//<===========================Check to be removed + klog.V(3).Infof("Volume %q is already being deleted", volume.Name) + return "", nil + } +// Remaining code below skipped for brevity... +// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code. +``` + +The above changes ensure that the plugin receives the volume deletion request. + +On the driver side, additional checks are made whenever `DeletionTimestamp` is set. + +```go +// shouldDelete returns whether a volume should have its backing volume +// deleted, i.e. whether a Delete is "desired" +func (ctrl *ProvisionController) shouldDelete(ctx context.Context, volume *v1.PersistentVolume) bool { + if deletionGuard, ok := ctrl.provisioner.(DeletionGuard); ok { + if !deletionGuard.ShouldDelete(ctx, volume) { + return false + } + } + + // In 1.9+ PV protection means the object will exist briefly with a + // deletion timestamp even after our successful Delete. Ignore it. + if ctrl.kubeVersion.AtLeast(utilversion.MustParseSemantic("v1.9.0")) { + if ctrl.addFinalizer && !ctrl.checkFinalizer(volume, finalizerPV) && volume.ObjectMeta.DeletionTimestamp != nil { + return false + } else if volume.ObjectMeta.DeletionTimestamp != nil { + // Check if there is a request to delete the volume explicitly. + if !metav1.HasAnnotation(volume.ObjectMeta, annRequestToDeleteVolume) { //<===============================Changed here START + return false + } else { + klog.Info("The volume %s was explicitly requested to be deleted from underlying storage", volume.Name) + // NOTE: This allows further conditions to be checked, such as Phase=="Released", reclaimPolicy="Delete", etc. + // Please refer code to see other checks. + } //<===============================Changed here END + } + } +// Remaining code below skipped for brevity... +// Please refer sig-storage-lib-external-provisioner/controller/controller.go for the full code. ``` +When the `shouldDelete` checks succeed, a delete volume request is initiated on the driver. This ensures that the volume +is deleted in backend. + ### Test Plan #### E2E tests @@ -185,17 +313,30 @@ An e2e test to exercise deletion of PV prior to deletion of PVC for a `Bound` PV * Deployed in production and have gone through at least one K8s upgrade. ### Upgrade / Downgrade Strategy -* Upgrade from old Kubernetes(1.21) to new Kubernetes(1.22) with `HonorPvReclaim` flag enabled +* Upgrade from old Kubernetes(1.22) to new Kubernetes(1.23) with `HonorPvReclaim` flag enabled In the above scenario, the upgrade will cause change in default behavior as described in the current KEP. Additionally, -no efforts are made to delete volumes that were not previously as a result of old behavior. +if there are PVs that have a valid associated PVC and deletion timestamp set, then they will be annotated and processed +as explained in this KEP. * Downgrade from new Kubernetes(1.22) to old Kubernetes(1.21). In the above scenario, downgrade will retain the current behavior. ### Version Skew Strategy -No Issues, the fix is present on a single component `kube-controller-manager`. +The fix is part of `kube-controller-manager` of `external-provisioner`. + +1. `kube-controller-manager` is upgraded, but `external-provisioner` is not: + +In this case the changed behavior is observed only for in-tree plugins. The drivers would still have the issue. Users +will observe the new annotation on the PV, but it does not have any effect since the `external-provisioner` is not +updated + +2. `external-provisioner` is upgraded but `kube-controller-manager` + +In this case the annotations are not added, the upgraded `external-provisioner` will have no effect, the issue would +still exist. + ## Production Readiness Review Questionnaire diff --git a/keps/sig-storage/2644-honor-pv-reclaim-policy/kep.yaml b/keps/sig-storage/2644-honor-pv-reclaim-policy/kep.yaml index d0daaaeb052a..64ffae3e4557 100644 --- a/keps/sig-storage/2644-honor-pv-reclaim-policy/kep.yaml +++ b/keps/sig-storage/2644-honor-pv-reclaim-policy/kep.yaml @@ -19,17 +19,18 @@ replaces: stage: alpha -latest-milestone: "v1.22" +latest-milestone: "v1.23" milestone: - alpha: "v1.22" - beta: "v1.23" - stable: "v1.24" + alpha: "v1.23" + beta: "v1.24" + stable: "v1.25" feature-gates: - name: HonorPvReclaim components: - kube-controller-manager + - external-provisioner disable-supported: true metrics: