-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3926: handling undecryptable resources #3927
Conversation
stlaz
commented
Mar 27, 2023
- One-line PR description: Improve the identification of resources that won't decrypt. Make it possible to remove undecryptable resources by using the API.
- Issue link: Handling undecryptable resources #3926
- Other comments: Enable deleting API objects even when storage-level decryption is not working properly kubernetes#86489
kubernetes/kubernetes#116943 shows a proof of concept for the list options proposed in this KEP. |
// doing. | ||
// WARNING: Vendors will most likely consider using this option to be breaking the | ||
// support of their product. | ||
UnconditionalDeleteWithClusterBreakingPotential bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's unclear what this API lets you skip... I would not expect to be allowed to skip admission or finalizers if the existing persisted object can be decoded successfully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - I thikn it should only allow deleting objects (with all the consequences) that we cannot decode. Nothing else should be allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to get our lingo synchronized - I would consider decoding
and transforming
two separate cases (you'll see why in code ref below).
What about objects we cannot retrieve from storage for other reasons?
The generic storage Get implementation fails in one additional case, that is if key cannot be constructed - https://github.com/kubernetes/kubernetes/blob/3cf9f66e90d560ac080687610933c712bcf37b39/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L756-L769
The etcd3 store implementation fails in 6 other cases:
https://github.com/kubernetes/kubernetes/blob/3cf9f66e90d560ac080687610933c712bcf37b39/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L132-L166
Most of the cases indeed seem like errors we may not want to ignore. Should we focus solely on the transformation error, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we focus solely on the transformation error, then?
I don't think so... transformation and decoding failures have almost identical implications, symptoms, and considerations for deleting anyway. Solving both problems together and consistently makes sense to me
@liggitt @wojtek-t @dgrisonnet The addressable comments were addressed, please take a look when you get a chance |
Currently, removing a resource that causes such failures is not possible. | ||
A cluster administrator must access etcd directly and remove the malformed data manually. | ||
|
||
This KEP proposes a way to identify resources that fail to decrypt, and introduces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mechanics of this seem to belong more to api-machinery (who owns the list and delete functionality that would be modified as part of this) ... there are multiple reasons a resource could fail to decode from storage, decryption is only one of them (e.g. kubernetes/kubernetes#69579)
// doing. | ||
// WARNING: Vendors will most likely consider using this option to be breaking the | ||
// support of their product. | ||
UnconditionalDeleteWithClusterBreakingPotential bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we focus solely on the transformation error, then?
I don't think so... transformation and decoding failures have almost identical implications, symptoms, and considerations for deleting anyway. Solving both problems together and consistently makes sense to me
- [ ] (R) Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: make sure to check (R)
items above.
- Impact of its degraded performance or high-error rates on the feature: | ||
--> | ||
|
||
### Scalability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean more about the scalability impact overall, not only focusing on the feature. Imagine a cluster with hundreds of nodes, where the this feature is enabled, and then look at the questions below and see if you can answer them in a reasonable fashion. It doesn't have to be perfect, it's not a required step for alpha, and based on the initial testing at alpha you'll be able to expand on it when promoting to beta 😄
- Impact of its degraded performance or high-error rates on the feature: | ||
--> | ||
|
||
### Scalability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, David's question is a good way to answer the question: Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
and limiting the number of values you handle at once is a potential safe guard to ensure that CPU and/or RAM isn't increased.
- Impact of its degraded performance or high-error rates on the feature: | ||
--> | ||
|
||
### Scalability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A different question from that set is about SLI/SLO, is it possible that the extra operations required to force delete all the resources will affect the guaranteed time to delete a resource, will it be negligible or the expected increase is significant, etc. I hope these examples will help 😉
c294537
to
0c0a5f6
Compare
I squashed the previous changes and addressed all the most recent comments in the "address comments" commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#prr-shadow
lgtm - thx :)
/label tide-merge-method-squash |
@deads2k: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, stlaz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0c0a5f6
to
c32d3dc
Compare
(I squashed the changes to a single commit in the latest force push) |
/label tide/merge-method-squash |
|
||
The unconditional deletion admission: | ||
1. checks if a "delete" request contains the `IgnoreStoreReadErrorWithClusterBreakingPotential` option | ||
2. if it does, it checks the RBAC of the request's user for the `delete-ignore-read-errors` verb of the given resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we align these two words, the option and the verb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tkashem this is still open in the implementation.
// doing. | ||
// WARNING: Vendors will most likely consider using this option to be breaking the | ||
// support of their product. | ||
IgnoreStoreReadErrorWithClusterBreakingPotential bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be withoutReadingFromStorage
, i.e. a mode of deletion that has nothing to do with errors but is just highly unconditional deletion? (disclaimer: have only read the kep briefly, maybe I missed it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at today's DeleteOptions
, this could even be called unconditional
or force
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref old thread around this topic #3927 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth to summarize as a non-goal:
- give clients control over skipping other steps of a delete request flow than decoding errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth to summarize as a non-goal:
- give clients control over skipping other steps of a delete request flow than decoding errors
Agree.
/lgtm |