-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSA failure after removal of old API versions #10051
Comments
Original Slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1706114063308299 |
/triage accepted This looks bad - thanks for reporting it. Need to get time to reproduce. |
Q: do we know what K8s was used when we created the object/update with v1alpha3 (about "2021-09-24T14:24:20Z") and which K8s version we have now when the object was update last (about "2023-12-15T14:02:26Z"). This could help reproduce. General comment (without having done any investigation) Other data point is that, in this example, CAPI is doing plain patch (not SSA patch), but the issue about version removal leads to errors when something else does SSA Last note, might be worth asking guidance in API machinery, but at this stage I even struggle to explain the problem properly... |
/assign |
I could reproduce it locally and have some ideas around mitigating it. Need some more time to provide more details. |
Some more data. I could see old apiVersions in managedFields in these jobs:
Seems like it's a pretty regular case that old apiVersions remain in managedFields. The issue than occurs as soon as the apiVersion (e.g. v1alpha3 / v1alpha4) is entirely dropped from CRDs. Because then conversions cannot be handled anymore and (at least) Here is a full example to reproduce the issue from scratch:
The situation can be recovered via:
Files can be downloaded here: https://gist.github.com/sbueringer/4301e6160e3cfc35b57ff517599efe10 |
I'm now working on:
|
cc @killianmuldoon @fabriziopandini @chrischdi (just fyi) |
@sbueringer This also seems like the cause of the flake that we see in clusterctl upgrade tests from |
No I think there is a difference between The former means the kube-apiserver tries to communicate with the webhook but the webhook is not reachable. The latter means the kube-apiserver doesn't even find the apiVersion in the CRD. |
A wrote a hacky version of an upgrade test to reproduce the issue: #10146 (ran as part of e2e-blocking) Now implementing a short-term workaround |
/reopen |
@vincepri: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
does this mean existing persisted objects were touched to update the stored version to a later version, and v1alpha3 was dropped from spec.versions and status.storedVersions in the CRD? cc @jpbetz for an aspect that seems missing from https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version ... I'm not sure what I would tell someone to do to update existing managedFields tied to no-longer-existing versions, or how to update those as part of storage migration this looks like kubernetes/kubernetes#111937 that would possibly be resolved by kubernetes/kubernetes#115650 (but kubernetes/kubernetes#115650 needs close review to make sure there are no other undesirable side effects) |
Sharing more information that Fabrizio shared in the office hours just now: A nasty issue requires an urgent fix for CAPI v1.6.x and main; this fix will be included in the next patch release, which we deferred to next week to give some time to get the fix merged.
|
Yes. We had a multi-step migration process:
After step 4 we started getting
Agree. This looks exactly like #111937 (cc @MadhavJivrajani, just fyi :)) |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/triage accepted |
What steps did you take and what happened?
We have some capi cluster that have been around for quite a while. After upgrading to CAPI v1.6.x (so after v1alpha3 removal, v1alpha4 deprecation) we saw serverSideApply issues pop up, more specifically flux not being able to apply our resources anymore.
We have seen this on MachineDeployments, MachineHealthChecks and also VSphereClusters (those are CAPV which also had api version removals in the latest version). Im adding the MD here as an example.
The same behaviour occurs when manually using kubectl apply --server-side, it does however not happen when not using SSA
MD in API-Server:
MD in git (so the one flux tries to SSA):
error message:
What did you expect to happen?
SSA continues to work and doesnt reference removed APIVersions
Cluster API version
v1.6.1
Kubernetes version
No response
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: