Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3926: handling undecryptable resources #3927

Merged
merged 1 commit into from
Oct 5, 2023

Conversation

stlaz
Copy link
Member

@stlaz stlaz commented Mar 27, 2023

  • One-line PR description: Improve the identification of resources that won't decrypt. Make it possible to remove undecryptable resources by using the API.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 27, 2023
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 27, 2023
@stlaz stlaz mentioned this pull request Mar 27, 2023
4 tasks
@stlaz stlaz force-pushed the failed_transform branch from 512970f to f78f0e1 Compare March 27, 2023 10:48
@stlaz
Copy link
Member Author

stlaz commented Mar 27, 2023

/cc @ibihim @deads2k @enj

@k8s-ci-robot k8s-ci-robot requested review from deads2k, enj and ibihim March 27, 2023 10:49
@stlaz stlaz force-pushed the failed_transform branch from f78f0e1 to 3ff9a45 Compare March 27, 2023 10:58
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 27, 2023
@stlaz
Copy link
Member Author

stlaz commented Mar 27, 2023

kubernetes/kubernetes#116943 shows a proof of concept for the list options proposed in this KEP.

// doing.
// WARNING: Vendors will most likely consider using this option to be breaking the
// support of their product.
UnconditionalDeleteWithClusterBreakingPotential bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear what this API lets you skip... I would not expect to be allowed to skip admission or finalizers if the existing persisted object can be decoded successfully

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I thikn it should only allow deleting objects (with all the consequences) that we cannot decode. Nothing else should be allowed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to get our lingo synchronized - I would consider decoding and transforming two separate cases (you'll see why in code ref below).

What about objects we cannot retrieve from storage for other reasons?

The generic storage Get implementation fails in one additional case, that is if key cannot be constructed - https://github.com/kubernetes/kubernetes/blob/3cf9f66e90d560ac080687610933c712bcf37b39/staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go#L756-L769

The etcd3 store implementation fails in 6 other cases:
https://github.com/kubernetes/kubernetes/blob/3cf9f66e90d560ac080687610933c712bcf37b39/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L132-L166

Most of the cases indeed seem like errors we may not want to ignore. Should we focus solely on the transformation error, then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we focus solely on the transformation error, then?

I don't think so... transformation and decoding failures have almost identical implications, symptoms, and considerations for deleting anyway. Solving both problems together and consistently makes sense to me

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 3, 2023
@stlaz stlaz force-pushed the failed_transform branch from a80bf28 to 3b808a1 Compare April 3, 2023 11:11
@stlaz
Copy link
Member Author

stlaz commented Apr 20, 2023

@liggitt @wojtek-t @dgrisonnet The addressable comments were addressed, please take a look when you get a chance

Currently, removing a resource that causes such failures is not possible.
A cluster administrator must access etcd directly and remove the malformed data manually.

This KEP proposes a way to identify resources that fail to decrypt, and introduces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mechanics of this seem to belong more to api-machinery (who owns the list and delete functionality that would be modified as part of this) ... there are multiple reasons a resource could fail to decode from storage, decryption is only one of them (e.g. kubernetes/kubernetes#69579)

// doing.
// WARNING: Vendors will most likely consider using this option to be breaking the
// support of their product.
UnconditionalDeleteWithClusterBreakingPotential bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we focus solely on the transformation error, then?

I don't think so... transformation and decoding failures have almost identical implications, symptoms, and considerations for deleting anyway. Solving both problems together and consistently makes sense to me

@liggitt liggitt self-assigned this Jul 5, 2023
@enj enj added this to the v1.29 milestone Sep 7, 2023
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: make sure to check (R) items above.

- Impact of its degraded performance or high-error rates on the feature:
-->

### Scalability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean more about the scalability impact overall, not only focusing on the feature. Imagine a cluster with hundreds of nodes, where the this feature is enabled, and then look at the questions below and see if you can answer them in a reasonable fashion. It doesn't have to be perfect, it's not a required step for alpha, and based on the initial testing at alpha you'll be able to expand on it when promoting to beta 😄

- Impact of its degraded performance or high-error rates on the feature:
-->

### Scalability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, David's question is a good way to answer the question: Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? and limiting the number of values you handle at once is a potential safe guard to ensure that CPU and/or RAM isn't increased.

- Impact of its degraded performance or high-error rates on the feature:
-->

### Scalability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different question from that set is about SLI/SLO, is it possible that the extra operations required to force delete all the resources will affect the guaranteed time to delete a resource, will it be negligible or the expected increase is significant, etc. I hope these examples will help 😉

@stlaz stlaz force-pushed the failed_transform branch from c294537 to 0c0a5f6 Compare October 5, 2023 10:37
@stlaz
Copy link
Member Author

stlaz commented Oct 5, 2023

I squashed the previous changes and addressed all the most recent comments in the "address comments" commit.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#prr-shadow
lgtm - thx :)

@deads2k
Copy link
Contributor

deads2k commented Oct 5, 2023

/label tide-merge-method-squash
/approve

@k8s-ci-robot
Copy link
Contributor

@deads2k: The label(s) /label tide-merge-method-squash cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, lead-opted-in, tracked/no, tracked/out-of-tree, tracked/yes. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label tide-merge-method-squash
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, stlaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2023
@stlaz stlaz force-pushed the failed_transform branch from 0c0a5f6 to c32d3dc Compare October 5, 2023 13:24
@stlaz
Copy link
Member Author

stlaz commented Oct 5, 2023

(I squashed the changes to a single commit in the latest force push)

@deads2k
Copy link
Contributor

deads2k commented Oct 5, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 5, 2023

The unconditional deletion admission:
1. checks if a "delete" request contains the `IgnoreStoreReadErrorWithClusterBreakingPotential` option
2. if it does, it checks the RBAC of the request's user for the `delete-ignore-read-errors` verb of the given resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we align these two words, the option and the verb?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkashem this is still open in the implementation.

// doing.
// WARNING: Vendors will most likely consider using this option to be breaking the
// support of their product.
IgnoreStoreReadErrorWithClusterBreakingPotential bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be withoutReadingFromStorage, i.e. a mode of deletion that has nothing to do with errors but is just highly unconditional deletion? (disclaimer: have only read the kep briefly, maybe I missed it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at today's DeleteOptions, this could even be called unconditional or force.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref old thread around this topic #3927 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth to summarize as a non-goal:

  • give clients control over skipping other steps of a delete request flow than decoding errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth to summarize as a non-goal:

  • give clients control over skipping other steps of a delete request flow than decoding errors

Agree.

@deads2k
Copy link
Contributor

deads2k commented Oct 5, 2023

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: API review completed, 1.29
Archived in project
Development

Successfully merging this pull request may close these issues.