Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to 1.6.1 prevents applying of Serving and Eventing CRs using server-side apply #1184

Open
tshak opened this issue Aug 17, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@tshak
Copy link

tshak commented Aug 17, 2022

After upgrading to Operator 1.6.1, some Serving and Eventing CRs will no longer apply using server-side apply. This appears to happen on clusters that at one point had pre v1beta1 CRs.

Repro Steps

  1. Install Operator 1.2.2
  2. Apply a Serving or Eventing CR using v1alpha1 using a server-side apply (e.g. kubectl apply --server-side=true)
  3. Upgrade to Operator version 1.5.3 (note that this repros upgrading with 1.3.2 and 1.4.1 as intermediate steps as well)
  4. Update CRs to use v1beta1 using server-side apply
  5. Upgrade to Operator version 1.6.1
  6. Perform a server-side apply on any CR (no changes necessary)

Expected

Server-side apply succeeds.

Actual

Server-side apply fails with the following error:

Error from server: request to convert CR to an invalid group/version: operator.knative.dev/v1alpha1

Additional Information

This was reproduced on live clusters have have used the Operator since at least release v1.0. This appears to be due to a lingering metadata.managedFields entry that references operator.knative.dev/v1alpha1. You can find a script that easily reproduces this issue in this repo.

@tshak tshak added the kind/bug Categorizes issue or PR as related to a bug. label Aug 17, 2022
@houshengbo
Copy link
Contributor

houshengbo commented Aug 17, 2022

https://github.com/tshak/knoperator-v1beta1-ssa-repro/pull/1 This is the change i requested: upgrade operator incrementally one minor version by one minor version

Give more time like 30s between each installation and before applying the v1beta1 CR.
@tshak

I tested it with docker-desktop, and it worked.

Regarding the managedFields, I am not sure whether it can resolve your issue, by waiting for longer time before applying the v1beta1 CR.

@tshak
Copy link
Author

tshak commented Aug 17, 2022

I do not think that it's a timing issue as our automation in our live cluster retries applying the CR and was failing many hours after the Operator upgrade to v1.6.1. For the synthetic repro I ran your PR and I'm still seeing the issue in a brand new Kind cluster.

@houshengbo
Copy link
Contributor

@evankanderson @dprotaso

Here is what we can possibly do with the migrateresource function in knative/pkg. Now we are doing

_, err := client.Namespace(item.GetNamespace()).
			Patch(ctx, item.GetName(), types.MergePatchType, []byte("{}"), metav1.PatchOptions{})

to migrate the existing CRs from old to new version.
Can we change it into:

_, err := client.Namespace(item.GetNamespace()).
			Patch(ctx, item.GetName(), types.MergePatchType, []byte(`{"metadata":{"managedFields": [{}]}}`), metav1.PatchOptions{})

to clear the managedField in order to solve this issue as described for managedField with old version, blocking the installation of the CR at the new version?

@tshak
Copy link
Author

tshak commented Aug 18, 2022

This should solve the issue. I'm not sure if there are any unintended consequences of deleting all managedField entries. The best solution is probably to remove only operator.knative.dev/v1alpha1. I'm not very opinionated either way though.

@dprotaso
Copy link
Member

My guess is the post-install job isn't being run after installing 1.5.3

https://github.com/knative/operator/blob/knative-v1.5.3/config/post-install/storage-version-migrator.yaml

This should migrate the CRD to a new storage version

@houshengbo
Copy link
Contributor

@dprotaso I actually moved the implementation into the operator controller itself in 1.5.3.
So we do not need to run the migration job, the operator will do it automatically, when it is launched. It is equivalent to run the post-install job.
What I am proposing here is to "clear the managedField" or remove only operator.knative.dev/v1alpha1 as @tshak said.

@tshak
Copy link
Author

tshak commented Aug 18, 2022

It is correct that we did not run the post-install job. My contention is that this goes against the biggest value proposition of the Operator pattern: to automate resource lifecycle issues such as this one.

@dprotaso
Copy link
Member

dprotaso commented Aug 18, 2022

This seems like a k8s bug - let me see if I can repro with a trivial example

@tshak
Copy link
Author

tshak commented Aug 19, 2022

I agree that it's probably a k8s bug. If a managedField entry has an invalid apiVersion it should not prevent a server-side apply. There may be another bug around API conversions missing the managedField entry in the first place. Regardless, I still think that the Operator should still perform either of the workarounds proposed by @houshengbo.

@houshengbo
Copy link
Contributor

@tshak Another way without touching knative/pkg could be only accessing serving and eventing CRs in knative operator and changing the managedField for both of them, if they exist.

@dprotaso
Copy link
Member

Created the upstream issue based on this slack discussion

kubernetes/kubernetes#111937

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 18, 2022
@knative-prow-robot
Copy link
Contributor

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2022
@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2023
@dprotaso
Copy link
Member

/reopen
/lifecycle frozen

The issue isn't resolved upstream

@knative-prow knative-prow bot reopened this Apr 20, 2023
@knative-prow
Copy link

knative-prow bot commented Apr 20, 2023

@dprotaso: Reopened this issue.

In response to this:

/reopen
/lifecycle frozen

The issue isn't resolved upstream

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

4 participants