Skip to content

Commit

Permalink
Update 2644 KEP
Browse files Browse the repository at this point in the history
Signed-off-by: Deepak Kinni <dkinni@vmware.com>
  • Loading branch information
Deepak Kinni committed Aug 26, 2021
1 parent 17097f4 commit e2d4c55
Show file tree
Hide file tree
Showing 2 changed files with 201 additions and 59 deletions.
251 changes: 196 additions & 55 deletions keps/sig-storage/2644-honor-pv-reclaim-policy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,35 +81,34 @@ are currently using the PVC.
3. If there are no Pods using the PVC, the PVC finalizers are removed, eventually triggering a PVC delete event.
4. The PV-PVC-controller processes delete PVC event, leading to removal of the PVC from it's in-memory cache followed
by triggering an explicit sync on the PV.
5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is not longer available
5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is no longer available
and updates the PV phase to `Released` state.
6. If the PVC-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not
6. If the PV-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not
in a `Bound` phase, this causes the finalizers to be removed.
7. This is followed by PV-PVC-controller initiating the reclaim volume workflows.
8. The reclaim volume workflow observes the `persistentVolumeReclaimPolicy` as `Delete` and schedules a volume deletion.
9. During delete volume operation, it is observed that there is already a `deletionTimestamp`, and the call is never
forwarded to the CSI driver to delete the underlying volume.

As observed from the description above, (9) prevents the volume deletion to be forwarded to the plugin.

Preventing the check should allow the volume deletion.


### Risks and Mitigations

The primary risk is volume deletion for user that expect the current behavior, i.e, they do not expect the volume to be
deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting
the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating
the current behavior.

## Design Details

The below code is called when `persistentVolumeReclaimPolicy` is `Delete`, as a part of the proposed fix, removing the
`deletionTimestamp` check would allow the volume deletion through the plugin.
9. Under the event that (6) has occurred and when `deleteVolumeOperation` executes it attempts to retrieve the latest PV
state from the API server, however, due to (6) the PV is removed from the API server, leading to an error state. This
results in the plugin volume deletion not being exercised hence leaking the volume.
```go
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)

// This method may have been waiting for a volume lock for some time.
// Previous deleteVolumeOperation might just have saved an updated version, so
// read current volume state now.
newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{}) //<========== NotFound error thrown
if err != nil {
klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err)
return "", nil
}
// Remaining code below skipped for brevity...
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
```
10. Under the event that (6) has not occurred yet, during execution of `deleteVolumeOperation` it is observed that the
PV has a pre-existing `deletionTimestamp`, this makes the method assume that delete is already being processed. This
results in the plugin volume deletion not being exercised hence leaking the volume.
```go
// deleteVolumeOperation deletes a volume. This method is running in standalone
// goroutine and already has all necessary locks.
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)

Expand All @@ -122,49 +121,178 @@ func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.Persist
return "", nil
}

if newVolume.GetDeletionTimestamp() != nil {//<==========================Remove check.
if newVolume.GetDeletionTimestamp() != nil {//<==========================DeletionTimestamp set since the PV was deleted first.
klog.V(3).Infof("Volume %q is already being deleted", volume.Name)
return "", nil
}
needsReclaim, err := ctrl.isVolumeReleased(newVolume)
if err != nil {
klog.V(3).Infof("error reading claim for volume %q: %v", volume.Name, err)
return "", nil
// Remaining code below skipped for brevity...
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
```
11. Meanwhile, the external-provisioner checks if there is a `deletionTimestamp` on the PV, if so, it assumes that its
in a transitory state and returns false for the `shouldDelete` check.
```go
// shouldDelete returns whether a volume should have its backing volume
// deleted, i.e. whether a Delete is "desired"
func (ctrl *ProvisionController) shouldDelete(ctx context.Context, volume *v1.PersistentVolume) bool {
if deletionGuard, ok := ctrl.provisioner.(DeletionGuard); ok {
if !deletionGuard.ShouldDelete(ctx, volume) {
return false
}
}
if !needsReclaim {
klog.V(3).Infof("volume %q no longer needs deletion, skipping", volume.Name)
return "", nil

if ctrl.addFinalizer && !ctrl.checkFinalizer(volume, finalizerPV) && volume.ObjectMeta.DeletionTimestamp != nil {
return false
} else if volume.ObjectMeta.DeletionTimestamp != nil { //<========== DeletionTimestamp is set on the PV since it's deleted first.
return false
}
// Remaining code below skipped for brevity...
// Please refer sig-storage-lib-external-provisioner/controller/controller.go for the full code.
```
The main approach in fixing the issue involves introducing a new annotation `pv.kubernetes.io/request-to-delete-volume`
on the PV. The PV is annotated when it is observed that it has it's `DeletionTimestamp` set but still associated with
a valid claim.
When the PVC is deleted, the PV is moved into `Released` state and following checks are made:
1. Plugin:
If there was an `NotFound` error while retrieving the PV, then it is checked to see the PV received as a part of the
update event has the annotation, if so, then `NotFound` error is ignored, and the call to delete volume is allowed to
proceed.
2. CSI driver
If when processing the PV update it is observed that it has the annotation with `DeletionTimestamp` set, then the volume
deletion request is forwarded to the driver(provided other pre-defined conditions are met).
pluginName, deleted, err := ctrl.doDeleteVolume(volume)
// Remaining code below skipped for brevity...
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.

### Risks and Mitigations
The primary risk is volume deletion to a user that expect the current behavior, i.e, they do not expect the volume to be
deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting
the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating
the current behavior.
## Design Details
When processing the PV update in `kube-controller-manager`, if it's observed that the PV is still bound to a valid claim,
and also has `DeletionTimestamp` set, then the PV is annotated.
```go
// AnnRequestToDeleteVolume annotation is added to the PV when the PV is still bound to a valid claim but has deletionTimeStamp
// This indicates to the driver that the volume needs to be deleted from the underlying storage.
AnnRequestToDeleteVolume = "pv.kubernetes.io/request-to-delete-volume"
```
```go
// Function: syncVolume
// Code above skipped for brevity...

if !found {
klog.V(4).Infof("synchronizing PersistentVolume[%s]: claim %s not found", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef))
// Fall through with claim = nil
} else {
var ok bool
claim, ok = obj.(*v1.PersistentVolumeClaim)
if !ok {
return fmt.Errorf("Cannot convert object from volume cache to volume %q!?: %#v", claim.Spec.VolumeName, obj)
}
klog.V(4).Infof("synchronizing PersistentVolume[%s]: claim %s found: %s", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef), getClaimStatusForLogging(claim))
// Check if the PV has a deletion timestamp, if so, annotate the PV to indicate that the PV can be deleted
// from the underlying storage.
if volume.DeletionTimestamp!= nil && !metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnRequestToDeleteVolume) {
klog.Infof("synchronizing PersistentVolume[%s]: the volume was requested to be deleted but still" +
" associated with the claim %s", volume.Name, claimrefToClaimKey(volume.Spec.ClaimRef))
_, err := ctrl.updateRequestToDeleteVolumeAnnotations(volume)
if err != nil {
klog.V(4).Infof("synchronizing PersistentVolume[%s]: failed to add the delete volume annotation", volume.Name)
}
klog.Infof("successfully added annotation to indicate the volume is to be deleted.")
}
}

// Remaining code below skipped for brevity...
// Please refer kubernetes/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go for the full code.
```
The proposed fix will not have side effects in existing code flows as the `deleteVolumeOperation` is scheduled with an
identifier generated using the volume name and UID that is being deleted. All duplicate identifiers are skipped.
When `deleteVolumeOperation` executes, `NotFound` errors while retrieving the PV is ignored if the annotation is found.
```go
case v1.PersistentVolumeReclaimDelete:
klog.V(4).Infof("reclaimVolume[%s]: policy is Delete", volume.Name)
opName := fmt.Sprintf("delete-%s[%s]", volume.Name, string(volume.UID))
// create a start timestamp entry in cache for deletion operation if no one exists with
// key = volume.Name, pluginName = provisionerName, operation = "delete"
ctrl.operationTimestamps.AddIfNotExist(volume.Name, ctrl.getProvisionerNameFromVolume(volume), "delete")
ctrl.scheduleOperation(opName, func() error {
_, err := ctrl.deleteVolumeOperation(volume)
if err != nil {
// only report error count to "volume_operation_total_errors"
// latency reporting will happen when the volume get finally
// deleted and a volume deleted event is captured
metrics.RecordMetric(volume.Name, &ctrl.operationTimestamps, err)
// deleteVolumeOperation deletes a volume. This method is running in standalone
// goroutine and already has all necessary locks.
func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) {
klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name)
var requestToDeleteAnn bool
// This method may have been waiting for a volume lock for some time.
// Previous deleteVolumeOperation might just have saved an updated version, so
// read current volume state now.
newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{})
if err != nil { //<===============================Changed here START
if !apierrors.IsNotFound(err) {
return "", nil
} else {
// Check if the volume is annotated with explicit request to delete.
// In this case due to race with the pv protection controller, it was not possible to retrieve the latest PV
// state. However, the annotation clearly indicates the volume deletion can proceed.
requestToDeleteAnn = metav1.HasAnnotation(volume.ObjectMeta, pvutil.AnnRequestToDeleteVolume)
if !requestToDeleteAnn {
klog.Infof("no request to delete annotation found on the volume %s", volume.Name)
// Does not have the annotation, throw the NotFound error.
return "", err
} else {
klog.Infof("request to delete annotation found on the volume %s", volume.Name)
newVolume = volume
}
return err
})
}
} //<===============================Changed here END
// Note: no need to check for deletion timestamp, the schedule operation ensures that duplicate delete
// are not processed.
if newVolume.GetDeletionTimestamp() != nil {//<===========================Check to be removed
klog.V(3).Infof("Volume %q is already being deleted", volume.Name)
return "", nil
}
// Remaining code below skipped for brevity...
// Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code.
```
The above changes ensure that the plugin receives the volume deletion request.
On the driver side, additional checks are made whenever `DeletionTimestamp` is set.
```go
// shouldDelete returns whether a volume should have its backing volume
// deleted, i.e. whether a Delete is "desired"
func (ctrl *ProvisionController) shouldDelete(ctx context.Context, volume *v1.PersistentVolume) bool {
if deletionGuard, ok := ctrl.provisioner.(DeletionGuard); ok {
if !deletionGuard.ShouldDelete(ctx, volume) {
return false
}
}

// In 1.9+ PV protection means the object will exist briefly with a
// deletion timestamp even after our successful Delete. Ignore it.
if ctrl.kubeVersion.AtLeast(utilversion.MustParseSemantic("v1.9.0")) {
if ctrl.addFinalizer && !ctrl.checkFinalizer(volume, finalizerPV) && volume.ObjectMeta.DeletionTimestamp != nil {
return false
} else if volume.ObjectMeta.DeletionTimestamp != nil {
// Check if there is a request to delete the volume explicitly.
if !metav1.HasAnnotation(volume.ObjectMeta, annRequestToDeleteVolume) { //<===============================Changed here START
return false
} else {
klog.Info("The volume %s was explicitly requested to be deleted from underlying storage", volume.Name)
// NOTE: This allows further conditions to be checked, such as Phase=="Released", reclaimPolicy="Delete", etc.
// Please refer code to see other checks.
} //<===============================Changed here END
}
}
// Remaining code below skipped for brevity...
// Please refer sig-storage-lib-external-provisioner/controller/controller.go for the full code.
```
When the `shouldDelete` checks succeed, a delete volume request is initiated on the driver. This ensures that the volume
is deleted in backend.
### Test Plan
#### E2E tests
Expand All @@ -185,17 +313,30 @@ An e2e test to exercise deletion of PV prior to deletion of PVC for a `Bound` PV
* Deployed in production and have gone through at least one K8s upgrade.
### Upgrade / Downgrade Strategy
* Upgrade from old Kubernetes(1.21) to new Kubernetes(1.22) with `HonorPvReclaim` flag enabled
* Upgrade from old Kubernetes(1.22) to new Kubernetes(1.23) with `HonorPvReclaim` flag enabled
In the above scenario, the upgrade will cause change in default behavior as described in the current KEP. Additionally,
no efforts are made to delete volumes that were not previously as a result of old behavior.
if there are PVs that have a valid associated PVC and deletion timestamp set, then they will be annotated and processed
as explained in this KEP.
* Downgrade from new Kubernetes(1.22) to old Kubernetes(1.21).
In the above scenario, downgrade will retain the current behavior.
### Version Skew Strategy
No Issues, the fix is present on a single component `kube-controller-manager`.
The fix is part of `kube-controller-manager` of `external-provisioner`.
1. `kube-controller-manager` is upgraded, but `external-provisioner` is not:
In this case the changed behavior is observed only for in-tree plugins. The drivers would still have the issue. Users
will observe the new annotation on the PV, but it does not have any effect since the `external-provisioner` is not
updated
2. `external-provisioner` is upgraded but `kube-controller-manager`
In this case the annotations are not added, the upgraded `external-provisioner` will have no effect, the issue would
still exist.
## Production Readiness Review Questionnaire
Expand Down
9 changes: 5 additions & 4 deletions keps/sig-storage/2644-honor-pv-reclaim-policy/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,18 @@ replaces:

stage: alpha

latest-milestone: "v1.22"
latest-milestone: "v1.23"

milestone:
alpha: "v1.22"
beta: "v1.23"
stable: "v1.24"
alpha: "v1.23"
beta: "v1.24"
stable: "v1.25"

feature-gates:
- name: HonorPvReclaim
components:
- kube-controller-manager
- external-provisioner
disable-supported: true

metrics:

0 comments on commit e2d4c55

Please sign in to comment.