|
| 1 | +# KEP-2644: Honor Persistent Volume Reclaim Policy |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 5 | +- [Summary](#summary) |
| 6 | +- [Motivation](#motivation) |
| 7 | +- [Proposal](#proposal) |
| 8 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 9 | +- [Design Details](#design-details) |
| 10 | + - [Test Plan](#test-plan) |
| 11 | + - [E2E tests](#e2e-tests) |
| 12 | + - [Graduation Criteria](#graduation-criteria) |
| 13 | + - [Alpha](#alpha) |
| 14 | + - [Alpha -> Beta](#alpha---beta) |
| 15 | + - [Beta -> GA](#beta---ga) |
| 16 | + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) |
| 17 | + - [Version Skew Strategy](#version-skew-strategy) |
| 18 | +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) |
| 19 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 20 | + - [Scalability](#scalability) |
| 21 | +- [Implementation History](#implementation-history) |
| 22 | +- [Drawbacks](#drawbacks) |
| 23 | +- [Alternatives](#alternatives) |
| 24 | +<!-- /toc --> |
| 25 | + |
| 26 | +## Release Signoff Checklist |
| 27 | + |
| 28 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 29 | + |
| 30 | +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) |
| 31 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 32 | +- [ ] (R) Design details are appropriately documented |
| 33 | +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) |
| 34 | + - [ ] e2e Tests for all Beta API Operations (endpoints) |
| 35 | + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 36 | + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free |
| 37 | +- [ ] (R) Graduation criteria is in place |
| 38 | + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) |
| 39 | +- [ ] (R) Production readiness review completed |
| 40 | +- [ ] (R) Production readiness review approved |
| 41 | +- [ ] "Implementation History" section is up-to-date for milestone |
| 42 | +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] |
| 43 | +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
| 44 | + |
| 45 | +[kubernetes.io]: https://kubernetes.io/ |
| 46 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 47 | +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes |
| 48 | +[kubernetes/website]: https://git.k8s.io/website |
| 49 | + |
| 50 | +## Summary |
| 51 | +Reclaim policy associated with the Persistent Volume is currently ignored under certain circumstance. For a `Bound` |
| 52 | +PV-PVC pair the ordering of PV-PVC deletion determines whether the PV delete reclaim policy is honored. The PV honors |
| 53 | +the reclaim policy if the PVC is deleted prior to deleting the PV, however, if the PV is deleted prior to deleting the |
| 54 | +PVC then the Reclaim policy is not exercised. As a result of this behavior, the associated storage asset in the |
| 55 | +external infrastructure is not removed. |
| 56 | + |
| 57 | +## Motivation |
| 58 | +Prevent volumes from being leaked by honoring the PV Reclaim policy after the `Bound` PVC is also deleted. |
| 59 | + |
| 60 | +## Proposal |
| 61 | +To better understand the existing issue we will initially walk through existing behavior when a Bound PV is deleted prior |
| 62 | +to deleting the PVC. |
| 63 | + |
| 64 | + |
| 65 | +``` |
| 66 | +kubectl delete pv <pv-name> |
| 67 | +``` |
| 68 | +On deleting a `Bound` PV, a `deletionTimestamp` is added on to the PV object, this triggers an update which is consumed |
| 69 | +by the PV-PVC-controller. The PV-PVC-controller observes that the PV is still associated with a PVC, and the PVC hasn't |
| 70 | +been deleted, as a result of this, the PV-PVC-controller attempts to update the PV phase to `Bound` state. Assuming that |
| 71 | +the PV was already bound previously, this update eventually amounts to a no-op. |
| 72 | + |
| 73 | +The PVC is deleted at a later point of time. |
| 74 | +``` |
| 75 | +kubectl -n <namespace> delete pvc <pvc-name> |
| 76 | +``` |
| 77 | + |
| 78 | +1. A `deletionTimestamp` is added on the PVC object. |
| 79 | +2. The PVC-protection-controller picks the update, verifies if `deletionTimestamp` is present and there are no pods that |
| 80 | +are currently using the PVC. |
| 81 | +3. If there are no Pods using the PVC, the PVC finalizers are removed, eventually triggering a PVC delete event. |
| 82 | +4. The PV-PVC-controller processes delete PVC event, leading to removal of the PVC from it's in-memory cache followed |
| 83 | +by triggering an explicit sync on the PV. |
| 84 | +5. The PV-PVC-controller processes the triggered explicit sync, here, it observes that the PVC is not longer available |
| 85 | +and updates the PV phase to `Released` state. |
| 86 | +6. If the PVC-protection-controller picks the update, it observes that there is a `deletionTimestamp` and the PV is not |
| 87 | +in a `Bound` phase, this causes the finalizers to be removed. |
| 88 | +7. This is followed by PV-PVC-controller initiating the reclaim volume workflows. |
| 89 | +8. The reclaim volume workflow observes the `persistentVolumeReclaimPolicy` as `Delete` and schedules a volume deletion. |
| 90 | +9. During delete volume operation, it is observed that there is already a `deletionTimestamp`, and the call is never |
| 91 | +forwarded to the CSI driver to delete the underlying volume. |
| 92 | + |
| 93 | +As observed from the description above, (9) prevents the volume deletion to be forwarded to the plugin. |
| 94 | + |
| 95 | +Preventing the check should allow the volume deletion. |
| 96 | + |
| 97 | + |
| 98 | +### Risks and Mitigations |
| 99 | + |
| 100 | +The primary risk is volume deletion for user that expect the current behavior, i.e, they do not expect the volume to be |
| 101 | +deleted from the underlying storage infrastructure when for a bound PV-PVC pair, the PV is deleted followed by deleting |
| 102 | +the PVC. This can be mitigated by initially introducing the fix behind a feature gate, followed by explicitly deprecating |
| 103 | +the current behavior. |
| 104 | + |
| 105 | +## Design Details |
| 106 | + |
| 107 | +The below code is called when `persistentVolumeReclaimPolicy` is `Delete`, as a part of the proposed fix, removing the |
| 108 | +`deletionTimestamp` check would allow the volume deletion through the plugin. |
| 109 | + |
| 110 | +```go |
| 111 | +// deleteVolumeOperation deletes a volume. This method is running in standalone |
| 112 | +// goroutine and already has all necessary locks. |
| 113 | +func (ctrl *PersistentVolumeController) deleteVolumeOperation(volume *v1.PersistentVolume) (string, error) { |
| 114 | + klog.V(4).Infof("deleteVolumeOperation [%s] started", volume.Name) |
| 115 | + |
| 116 | + // This method may have been waiting for a volume lock for some time. |
| 117 | + // Previous deleteVolumeOperation might just have saved an updated version, so |
| 118 | + // read current volume state now. |
| 119 | + newVolume, err := ctrl.kubeClient.CoreV1().PersistentVolumes().Get(context.TODO(), volume.Name, metav1.GetOptions{}) |
| 120 | + if err != nil { |
| 121 | + klog.V(3).Infof("error reading persistent volume %q: %v", volume.Name, err) |
| 122 | + return "", nil |
| 123 | + } |
| 124 | + |
| 125 | + if newVolume.GetDeletionTimestamp() != nil {//<==========================Remove check. |
| 126 | + klog.V(3).Infof("Volume %q is already being deleted", volume.Name) |
| 127 | + return "", nil |
| 128 | + } |
| 129 | + needsReclaim, err := ctrl.isVolumeReleased(newVolume) |
| 130 | + if err != nil { |
| 131 | + klog.V(3).Infof("error reading claim for volume %q: %v", volume.Name, err) |
| 132 | + return "", nil |
| 133 | + } |
| 134 | + if !needsReclaim { |
| 135 | + klog.V(3).Infof("volume %q no longer needs deletion, skipping", volume.Name) |
| 136 | + return "", nil |
| 137 | + } |
| 138 | + |
| 139 | + pluginName, deleted, err := ctrl.doDeleteVolume(volume) |
| 140 | + // Remaining code below skipped for brevity... |
| 141 | + // Please refer pkg/controller/volume/persistentvolume/pv_controller.go in kubernetes/kubernetes for the full code. |
| 142 | + |
| 143 | +} |
| 144 | +``` |
| 145 | + |
| 146 | +The proposed fix will not have side effects in existing code flows as the `deleteVolumeOperation` is scheduled with an |
| 147 | +identifier generated using the volume name and UID that is being deleted. All duplicate identifiers are skipped. |
| 148 | + |
| 149 | +```go |
| 150 | + case v1.PersistentVolumeReclaimDelete: |
| 151 | + klog.V(4).Infof("reclaimVolume[%s]: policy is Delete", volume.Name) |
| 152 | + opName := fmt.Sprintf("delete-%s[%s]", volume.Name, string(volume.UID)) |
| 153 | + // create a start timestamp entry in cache for deletion operation if no one exists with |
| 154 | + // key = volume.Name, pluginName = provisionerName, operation = "delete" |
| 155 | + ctrl.operationTimestamps.AddIfNotExist(volume.Name, ctrl.getProvisionerNameFromVolume(volume), "delete") |
| 156 | + ctrl.scheduleOperation(opName, func() error { |
| 157 | + _, err := ctrl.deleteVolumeOperation(volume) |
| 158 | + if err != nil { |
| 159 | + // only report error count to "volume_operation_total_errors" |
| 160 | + // latency reporting will happen when the volume get finally |
| 161 | + // deleted and a volume deleted event is captured |
| 162 | + metrics.RecordMetric(volume.Name, &ctrl.operationTimestamps, err) |
| 163 | + } |
| 164 | + return err |
| 165 | + }) |
| 166 | +``` |
| 167 | + |
| 168 | +### Test Plan |
| 169 | + |
| 170 | +#### E2E tests |
| 171 | +An e2e test to exercise deletion of PV prior to deletion of PVC for a `Bound` PV-PVC pair. |
| 172 | + |
| 173 | + |
| 174 | +### Graduation Criteria |
| 175 | + |
| 176 | +#### Alpha |
| 177 | +* Feedback |
| 178 | +* Fix of the issue behind a feature flag. |
| 179 | +* Unit tests and e2e tests outlined in design proposal implemented. |
| 180 | + |
| 181 | +#### Alpha -> Beta |
| 182 | +* Allow the Alpha fix to soak for one release. |
| 183 | + |
| 184 | +#### Beta -> GA |
| 185 | +* Deployed in production and have gone through at least one K8s upgrade. |
| 186 | + |
| 187 | +### Upgrade / Downgrade Strategy |
| 188 | +* Upgrade from old Kubernetes(1.21) to new Kubernetes(1.22) with `HonorPvReclaim` flag enabled |
| 189 | + |
| 190 | +In the above scenario, the upgrade will cause change in default behavior as described in the current KEP. Additionally, |
| 191 | +no efforts are made to delete volumes that were not previously as a result of old behavior. |
| 192 | + |
| 193 | +* Downgrade from new Kubernetes(1.22) to old Kubernetes(1.21). |
| 194 | + |
| 195 | +In the above scenario, downgrade will retain the current behavior. |
| 196 | + |
| 197 | +### Version Skew Strategy |
| 198 | +No Issues, the fix is present on a single component `kube-controller-manager`. |
| 199 | + |
| 200 | +## Production Readiness Review Questionnaire |
| 201 | + |
| 202 | +### Feature Enablement and Rollback |
| 203 | + |
| 204 | +###### How can this feature be enabled / disabled in a live cluster? |
| 205 | + |
| 206 | +- [x] Feature gate (also fill in values in `kep.yaml`) |
| 207 | + - Feature gate name: `HonorPvReclaim` |
| 208 | + - Components depending on the feature gate: kube-controller-manager |
| 209 | + |
| 210 | +###### Does enabling the feature change any default behavior? |
| 211 | + |
| 212 | +Enabling the feature will delete the volume from underlying storage when the PV is deleted followed deleting the PVC for |
| 213 | +a bound PV-PVC pair where the PV reclaim policy is delete. |
| 214 | + |
| 215 | +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? |
| 216 | +Yes. Disabling the feature flag will continue previous behavior. |
| 217 | + |
| 218 | +###### What happens if we reenable the feature if it was previously rolled back? |
| 219 | +Will pick the new behavior as described before. |
| 220 | + |
| 221 | +###### Are there any tests for feature enablement/disablement? |
| 222 | + |
| 223 | +There will be unit tests for the feature `HonorPvReclaim` enablement/disablement. |
| 224 | + |
| 225 | +### Scalability |
| 226 | + |
| 227 | +###### Will enabling / using this feature result in any new API calls? |
| 228 | +No. |
| 229 | + |
| 230 | +###### Will enabling / using this feature result in introducing new API types? |
| 231 | +No. |
| 232 | + |
| 233 | +###### Will enabling / using this feature result in any new calls to the cloud provider? |
| 234 | +Yes, previously the delete volume call would not reach the provider under the described circumstances, with this change |
| 235 | +the volume delete will be invoked to remove the volume from the underlying storage. |
| 236 | + |
| 237 | +###### Will enabling / using this feature result in increasing size or count of the existing API objects? |
| 238 | +No. |
| 239 | + |
| 240 | +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? |
| 241 | +No. |
| 242 | + |
| 243 | +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? |
| 244 | +No. |
| 245 | + |
| 246 | +## Implementation History |
| 247 | + |
| 248 | +1.22 : Alpha |
| 249 | + |
| 250 | +## Drawbacks |
| 251 | +None. The current behavior could be considered as a drawback, the KEP presents the fix to the drawback. The current |
| 252 | +behavior was reported in [Issue-546](https://github.com/kubernetes-csi/external-provisioner/issues/546) and [Issue-195](https://github.com/kubernetes-csi/external-provisioner/issues/195) |
| 253 | + |
| 254 | +## Alternatives |
| 255 | +Other approaches to fix have not been considered and will be considered if presented during the KEP review. |
0 commit comments