SSA failure after removal of old API versions #10051

MaxRink · 2024-01-24T17:18:30Z

What steps did you take and what happened?

We have some capi cluster that have been around for quite a while. After upgrading to CAPI v1.6.x (so after v1alpha3 removal, v1alpha4 deprecation) we saw serverSideApply issues pop up, more specifically flux not being able to apply our resources anymore.
We have seen this on MachineDeployments, MachineHealthChecks and also VSphereClusters (those are CAPV which also had api version removals in the latest version). Im adding the MD here as an example.
The same behaviour occurs when manually using kubectl apply --server-side, it does however not happen when not using SSA

MD in API-Server:

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  annotations:
    machinedeployment.clusters.x-k8s.io/revision: "38"
  creationTimestamp: "2021-09-14T14:58:17Z"
  generation: 5447
  labels:
    cluster.x-k8s.io/cluster-name: cocomat-1
    kustomize.toolkit.fluxcd.io/name: tenant-clusters-vsphere-tstsa1-bn
    kustomize.toolkit.fluxcd.io/namespace: schiff-system
    nodepool: cocomat-1-md-0
  managedFields:
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          f:cluster.x-k8s.io/cluster-name: {}
          f:kustomize.toolkit.fluxcd.io/name: {}
          f:kustomize.toolkit.fluxcd.io/namespace: {}
          f:nodepool: {}
      f:spec:
        f:clusterName: {}
        f:minReadySeconds: {}
        f:progressDeadlineSeconds: {}
        f:replicas: {}
        f:revisionHistoryLimit: {}
        f:selector: {}
        f:strategy:
          f:rollingUpdate:
            f:maxSurge: {}
            f:maxUnavailable: {}
          f:type: {}
        f:template:
          f:metadata:
            f:annotations:
              f:schiff.telekom.de/hotfix: {}
            f:labels:
              f:cluster.x-k8s.io/cluster-name: {}
              f:cluster.x-k8s.io/deployment-name: {}
              f:node-role.kubernetes.io/worker: {}
              f:nodepool: {}
          f:spec:
            f:bootstrap:
              f:configRef: {}
            f:clusterName: {}
            f:infrastructureRef: {}
            f:nodeDeletionTimeout: {}
            f:nodeDrainTimeout: {}
            f:nodeVolumeDetachTimeout: {}
            f:version: {}
    manager: kustomize-controller
    operation: Apply
    time: "2023-12-15T14:02:22Z"
  - apiVersion: cluster.x-k8s.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences: {}
      f:status:
        .: {}
        f:selector: {}
    manager: manager
    operation: Update
    time: "2021-09-24T14:24:20Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:machinedeployment.clusters.x-k8s.io/revision: {}
        f:ownerReferences:
          k:{"uid":"c5e2f139-11d5-4293-9a11-ae97795e305d"}: {}
    manager: manager
    operation: Update
    time: "2023-12-15T14:02:26Z"
  - apiVersion: cluster.x-k8s.io/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:availableReplicas: {}
        f:conditions: {}
        f:observedGeneration: {}
        f:phase: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:unavailableReplicas: {}
        f:updatedReplicas: {}
    manager: manager
    operation: Update
    subresource: status
    time: "2023-12-15T15:55:08Z"
  name: cocomat-1-md-0
  namespace: vsphere-tstsa1-bn
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: cocomat-1
    uid: c5e2f139-11d5-4293-9a11-ae97795e305d
  resourceVersion: "3025961533"
  uid: 3225b9ca-7d89-476b-8d0e-aed1a3a1c1d7
spec:
  clusterName: cocomat-1
  minReadySeconds: 20
  progressDeadlineSeconds: 600
  replicas: 6
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: cocomat-1
      cluster.x-k8s.io/deployment-name: cocomat-1-md-0
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        schiff.telekom.de/hotfix: "true"
      labels:
        cluster.x-k8s.io/cluster-name: cocomat-1
        cluster.x-k8s.io/deployment-name: cocomat-1-md-0
        node-role.kubernetes.io/worker: ""
        nodepool: cocomat-1-md-0
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: cocomat-1-md-11
      clusterName: cocomat-1
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: cocomat-1-md-35
      nodeDeletionTimeout: 0s
      nodeDrainTimeout: 15m0s
      nodeVolumeDetachTimeout: 30m0s
      version: v1.27.8
status:
  availableReplicas: 6
  conditions:
  - lastTransitionTime: "2023-10-17T19:53:46Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-10-17T19:53:46Z"
    status: "True"
    type: Available
  observedGeneration: 5447
  phase: Running
  readyReplicas: 6
  replicas: 6
  selector: cluster.x-k8s.io/cluster-name=cocomat-1,cluster.x-k8s.io/deployment-name=cocomat-1-md-0
  unavailableReplicas: 0
  updatedReplicas: 6

MD in git (so the one flux tries to SSA):

apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: cocomat-1
    nodepool: cocomat-1-md-0
  name: cocomat-1-md-0
  namespace: vsphere-tstsa1-bn
spec:
  clusterName: cocomat-1
  minReadySeconds: 20
  progressDeadlineSeconds: 600
  replicas: 6
  revisionHistoryLimit: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: cocomat-1
      cluster.x-k8s.io/deployment-name: cocomat-1-md-0
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        schiff.telekom.de/hotfix: 'true'
      labels:
        cluster.x-k8s.io/cluster-name: cocomat-1
        cluster.x-k8s.io/deployment-name: cocomat-1-md-0
        node-role.kubernetes.io/worker: ''
        nodepool: cocomat-1-md-0
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: cocomat-1-md-12
      clusterName: cocomat-1
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: VSphereMachineTemplate
        name: cocomat-1-md-36
      nodeDeletionTimeout: 0s
      nodeDrainTimeout: 15m0s
      nodeVolumeDetachTimeout: 1800s
      version: v1.27.9

error message:

  Warning  ReconciliationFailed  3s (x5453 over 19d)  kustomize-controller  MachineDeployment/vsphere-tstsa1-bn/cocomat-1-md-0 dry-run failed: request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha3

What did you expect to happen?

SSA continues to work and doesnt reference removed APIVersions

Cluster API version

v1.6.1

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

MaxRink · 2024-01-24T17:18:58Z

Original Slack thread: https://kubernetes.slack.com/archives/C8TSNPY4T/p1706114063308299

killianmuldoon · 2024-01-24T17:39:17Z

/triage accepted

This looks bad - thanks for reporting it. Need to get time to reproduce.

fabriziopandini · 2024-01-25T13:48:19Z

Q: do we know what K8s was used when we created the object/update with v1alpha3 (about "2021-09-24T14:24:20Z") and which K8s version we have now when the object was update last (about "2023-12-15T14:02:26Z"). This could help reproduce.

General comment (without having done any investigation)
Il looks like that support for SSA on CRD has some issues with API version removal. Getting around this on CAPI side seems kind of acting in the wrong layer, but let's avoid early solutioning until we have a better understanding of what's going on

Other data point is that, in this example, CAPI is doing plain patch (not SSA patch), but the issue about version removal leads to errors when something else does SSA

Last note, might be worth asking guidance in API machinery, but at this stage I even struggle to explain the problem properly...

sbueringer · 2024-02-12T17:41:11Z

/assign

sbueringer · 2024-02-12T17:48:16Z

I could reproduce it locally and have some ideas around mitigating it. Need some more time to provide more details.

sbueringer · 2024-02-13T09:58:11Z

Some more data.

I could see old apiVersions in managedFields in these jobs:

Seems like it's a pretty regular case that old apiVersions remain in managedFields. The issue than occurs as soon as the apiVersion (e.g. v1alpha3 / v1alpha4) is entirely dropped from CRDs. Because then conversions cannot be handled anymore and (at least) kubectl apply --server-side fails. Also every other client that uses SSA on client-side like flux or potentially other controllers.

Here is a full example to reproduce the issue from scratch:

Setup env: (tested with Kubernetes v1.27.3 & kind v0.20.0)
- git checkout release-1.6
- PATH=/Users/buringerst/go_sdks/go1.20.7/bin/:$PATH make tilt-up (Tilt with core CAPI + CAPD)
Deploy MD CRD with v1alpha4 storage version
- k apply -f ./testing/ssa/crd-v1alpha4-storage-v1beta1.yaml
Deploy Cluster with v1alpha4 MD
- k apply -f ./testing/ssa/capi-quickstart.yaml
Verify v1alpha4 in managed fields
- k get md capi-quickstart-md-0 -o yaml --show-managed-fields=true
Change storage version to v1beta1
- k apply -f ./testing/ssa/crd-v1alpha4-v1beta1-storage.yaml
Ensure MD was written with v1beta1 storage version
- k label md capi-quickstart-md-0 abc=def
Drop v1alpha4 from .status.storedVersions
- k edit crd machinedeployments.cluster.x-k8s.io --subresource=status
Drop v1alpha4 from CRD
- k apply -f ./testing/ssa/crd-v1beta1.yaml
Fail => Error from server: request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha4
- k apply -f ./testing/ssa/capi-quickstart-md.yaml --server-side=true

The situation can be recovered via:

Add version v1alpha4 again (also works again if v1alpha4 has served: false)
- k apply -f ./testing/ssa/crd-v1alpha4-v1beta1-storage.yaml
Verify that it works again
- k apply -f ./testing/ssa/capi-quickstart-md.yaml --server-side=true
  - Note: This calls the conversion webhook

Files can be downloaded here: https://gist.github.com/sbueringer/4301e6160e3cfc35b57ff517599efe10

sbueringer · 2024-02-13T10:02:33Z

I'm now working on:

Reproducing the error with an extended clusterctl upgrade test
Working on a fix to mitigate the issue short-term. The current idea is to re-introduce v1alpha3/v1alpha4 in a way that conversion works again (but these versions should otherwise be not exposed). Than we can think about mid-term / long-term solutions.

sbueringer · 2024-02-13T10:02:42Z

cc @killianmuldoon @fabriziopandini @chrischdi (just fyi)

adilGhaffarDev · 2024-02-13T10:22:16Z

@sbueringer This also seems like the cause of the flake that we see in clusterctl upgrade tests from 0.4->1.4 or 0.4->1.5. ref: #9688

sbueringer · 2024-02-13T13:42:48Z

No I think there is a difference between conversion webhook for infrastructure.cluster.x-k8s.io/v1alpha4, Kind=DockerCluster failed and request to convert CR to an invalid group/version.

The former means the kube-apiserver tries to communicate with the webhook but the webhook is not reachable. The latter means the kube-apiserver doesn't even find the apiVersion in the CRD.

sbueringer · 2024-02-13T14:27:38Z

A wrote a hacky version of an upgrade test to reproduce the issue: #10146 (ran as part of e2e-blocking)

Now implementing a short-term workaround

vincepri · 2024-02-14T18:17:29Z

/reopen

k8s-ci-robot · 2024-02-14T18:17:33Z

@vincepri: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vincepri · 2024-02-14T18:21:20Z

Maybe @liggitt or @deads2k could help here?

Is this an expected behavior with CRD versioning and SSA? Wondering if this is something that we'd expect the api-server to handle for end users (Cluster API in this case), or if we should handle version removal and managed field cleanup?

liggitt · 2024-02-14T18:26:16Z

after v1alpha3 removal

does this mean existing persisted objects were touched to update the stored version to a later version, and v1alpha3 was dropped from spec.versions and status.storedVersions in the CRD?

cc @jpbetz for an aspect that seems missing from https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#upgrade-existing-objects-to-a-new-stored-version ... I'm not sure what I would tell someone to do to update existing managedFields tied to no-longer-existing versions, or how to update those as part of storage migration

this looks like kubernetes/kubernetes#111937 that would possibly be resolved by kubernetes/kubernetes#115650 (but kubernetes/kubernetes#115650 needs close review to make sure there are no other undesirable side effects)

sbueringer · 2024-02-14T18:52:33Z

Sharing more information that Fabrizio shared in the office hours just now:

A nasty issue requires an urgent fix for CAPI v1.6.x and main; this fix will be included in the next patch release, which we deferred to next week to give some time to get the fix merged.

The issue involves API version removal, SSA and managed fields.
- If you have cluster created with v1alpha3, there is a good chance that you got a v1alpha3 entry in the managed fields for some of your objects
- Despite the fact that we went through a proper API removal process, some of those v1alpha3 entries in managed field stick around - and they never get automatically converted to newer API versions.
- After upgrading to CAPI v1.6, which drops the v1alpha3 API, if controllers/users perform a SSA patch on one object in the state above, they get the following error message:
  - request to convert CR to an invalid group/version: cluster.x-k8s.io/v1alpha3
Root cause
- The root cause why some of those v1alpha3 entries in managed field stick around is not entirely clear
- However, we already know that the issue could be triggered by any client using the v1alpha3 API (including CAPI, but also kubectl, GitOps controllers, literally any other client working on our CRs…)
The fix [short-term mitigation]
- Given the impact (all the CRs), and the almost unlimited use cases that can lead to this condition (any client interacting with CAPI CRs), we decided to restore only API types and conversions for v1alpha3 and v1alpha4 (more details below on how we mimimized the impact of this operation).
Why was this not detected by our CI?
- We were testing CAPI upgrades from vA->vB
- This issue happens only if you upgrade CAPI from vA->vB->vC and in vB an API version is deprecated and if in vC the same API versions is removed (more details below on how we fixed this gap).
What’s next?
- For users using CAPI v1.6, we highly recommend to upgrade to the next patch release (as soon as it is out).
- For providers using CAPI v1.6 in their own CI, we highly recommend to bump to the next patch release ASAP
- WRT to CAPI, the plan is still to remove v1alpha3 and v1alpha4 API versions as soon as possible, but before doing so we need to figure out a way to do so without incurring the same problem again.

sbueringer · 2024-02-14T18:58:11Z

after v1alpha3 removal

does this mean existing persisted objects were touched to update the stored version to a later version, and v1alpha3 was dropped from spec.versions and status.storedVersions in the CRD?

Yes. We had a multi-step migration process:

v1beta1 is storage version
List => Update all CRs of the CRD, remove v1alpha3 from .status.storedVersions
set served: false on v1alpha3
remove v1alpha3 from the CRD

After step 4 we started getting convert CR to an invalid group/version errors. I assume the apiserver tries to find the version in the CRD and doesn't find it so it returns this error.

this looks like kubernetes/kubernetes#111937 that would possibly be resolved by kubernetes/kubernetes#115650 (but kubernetes/kubernetes#115650 needs close review to make sure there are no other undesirable side effects)

Agree. This looks exactly like #111937

(cc @MadhavJivrajani, just fyi :))

k8s-triage-robot · 2024-03-15T19:34:56Z

This issue is labeled with priority/critical-urgent but has not been updated in over 30 days, and should be re-triaged.
Critical-urgent issues must be actively worked on as someone's top priority right now.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority {important-soon, important-longterm, backlog}
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

fabriziopandini · 2024-03-18T13:06:59Z

/triage accepted

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 24, 2024

k8s-ci-robot assigned sbueringer Feb 12, 2024

sbueringer added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 13, 2024

sbueringer mentioned this issue Feb 13, 2024

[WIP] [DON'T REVIEW] 🌱 Add e2e test to reproduce SSA apiVersion issue #10146

Closed

fabriziopandini mentioned this issue Feb 14, 2024

🐛 Restore v1alpha3/v1alpha4 conversion to fix SSA issue & add e2e test coverage #10147

Merged

k8s-ci-robot closed this as completed in #10147 Feb 14, 2024

k8s-ci-robot reopened this Feb 14, 2024

cahillsf mentioned this issue Mar 6, 2024

Tasks for v1.7 release cycle #9800

Closed

30 tasks

guettli mentioned this issue Mar 12, 2024

Update to Cluster API 1.6.x syself/cluster-api-provider-hetzner#1203

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 15, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 18, 2024

sbueringer added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Mar 25, 2024

MaxRink mentioned this issue Apr 16, 2024

Removal of v1alpha5 apiversion metal3-io/cluster-api-provider-metal3#971

Closed

10 tasks

sbueringer mentioned this issue Apr 17, 2024

🌱 ipam: add spec.clusterName to IPAddressClaim #10182

Merged

sbueringer mentioned this issue Apr 29, 2024

🌱 Prepare main branch for v1.8 development #10524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSA failure after removal of old API versions #10051

SSA failure after removal of old API versions #10051

MaxRink commented Jan 24, 2024 •

edited

Loading

MaxRink commented Jan 24, 2024

killianmuldoon commented Jan 24, 2024 •

edited

Loading

fabriziopandini commented Jan 25, 2024

sbueringer commented Feb 12, 2024

sbueringer commented Feb 12, 2024

sbueringer commented Feb 13, 2024 •

edited

Loading

sbueringer commented Feb 13, 2024

sbueringer commented Feb 13, 2024

adilGhaffarDev commented Feb 13, 2024

sbueringer commented Feb 13, 2024 •

edited

Loading

sbueringer commented Feb 13, 2024

vincepri commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024

vincepri commented Feb 14, 2024

liggitt commented Feb 14, 2024

sbueringer commented Feb 14, 2024

sbueringer commented Feb 14, 2024 •

edited

Loading

k8s-triage-robot commented Mar 15, 2024

fabriziopandini commented Mar 18, 2024

SSA failure after removal of old API versions #10051

SSA failure after removal of old API versions #10051

Comments

MaxRink commented Jan 24, 2024 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

MaxRink commented Jan 24, 2024

killianmuldoon commented Jan 24, 2024 • edited Loading

fabriziopandini commented Jan 25, 2024

sbueringer commented Feb 12, 2024

sbueringer commented Feb 12, 2024

sbueringer commented Feb 13, 2024 • edited Loading

sbueringer commented Feb 13, 2024

sbueringer commented Feb 13, 2024

adilGhaffarDev commented Feb 13, 2024

sbueringer commented Feb 13, 2024 • edited Loading

sbueringer commented Feb 13, 2024

vincepri commented Feb 14, 2024

k8s-ci-robot commented Feb 14, 2024

vincepri commented Feb 14, 2024

liggitt commented Feb 14, 2024

sbueringer commented Feb 14, 2024

sbueringer commented Feb 14, 2024 • edited Loading

k8s-triage-robot commented Mar 15, 2024

fabriziopandini commented Mar 18, 2024

MaxRink commented Jan 24, 2024 •

edited

Loading

killianmuldoon commented Jan 24, 2024 •

edited

Loading

sbueringer commented Feb 13, 2024 •

edited

Loading

sbueringer commented Feb 13, 2024 •

edited

Loading

sbueringer commented Feb 14, 2024 •

edited

Loading