Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Switch API Groups - pubsub.cloud.google.com -> internal.events.cloud.google.com #905

Closed
Harwayne opened this issue Apr 21, 2020 · 9 comments · Fixed by #1066
Closed

Switch API Groups - pubsub.cloud.google.com -> internal.events.cloud.google.com #905

Harwayne opened this issue Apr 21, 2020 · 9 comments · Fixed by #1066
Assignees
Labels
kind/feature-request New feature or request priority/1 Blocks current release defined by release/* label or blocks current milestone release/1
Milestone

Comments

@Harwayne
Copy link
Contributor

Harwayne commented Apr 21, 2020

Problem
Change the API Groups of existing resources.

  • Topic and PullSubscription should be moved to internal.events.cloud.google.com.

Additional context (optional)
Should we make a breaking change and just drop the existing resources? Or should we move things over safely?

@Harwayne Harwayne added the kind/feature-request New feature or request label Apr 21, 2020
@grantr grantr added priority/1 Blocks current release defined by release/* label or blocks current milestone release/1 labels Apr 21, 2020
@akashrv akashrv added this to the v0.15.0-M1 milestone Apr 21, 2020
@Harwayne Harwayne changed the title Switch API Groups - pubsub.cloud.google.com -> topics.internal.events.cloud.google.com Switch API Groups - pubsub.cloud.google.com -> internal.events.cloud.google.com Apr 21, 2020
@Harwayne
Copy link
Contributor Author

Harwayne commented Apr 22, 2020

Some ideas about how we can do this change.

Note: This assumes that all our controllers atomically swap. Either they are all on the old release or all on the new. Because we only have one replica and can use leader election, I think this is reasonable. But it does mean we also need to think about what happens during downgrades.

1. Ignoring the old

Idea: Pretend the old one doesn't exist at all.

  1. Move the CRD to the new Group.
  2. User's upgrade.
  3. The old CRD and COs continue to exist. The user has to manually delete them. The user also has to delete the now orphaned resources in GCP.

Pros

  • Easy for us, no technical debt.
  • Events are less likely to be lost, as everything will continue to work until a reconcile is required.

Cons

  • Hard to clean up for the user.
  • COs look normal. They don't tell users that they are effectively ghosts. Users would need to read the release notes.

2. Deprecate, Delete, Remove

Idea: Make user do the work, but give them a release to get it done.

  1. Copy the CRDs to the new Group. Adjust all code to use the new groups exclusively.
  2. Keep running the old CRD controllers. Have them mark all the resources with deprecation notices.
  3. Wait a release, let's call it 0.x+1.
  4. Have the Webhook deny creation of any new COs of the old CRDs.
  5. Have the old CRD controllers delete the underlying resources.
  • Possibly via a secondary controller which simply marks them for deletion.
  1. Wait a release to 0.x+2.
  2. Delete the old CRD controllers.
  3. Tell the user to delete the old CRDs.
  • An upgrade script could do this too.

Pros

  • Users see what is going on via introspecting objects (e.g. the CO will have the deprecation warning).
  • All resources are cleaned up.
    • Well, the clean up is attempted. It may fail continuously while release 0.x+1 is installed.

Cons

  • Takes a minimum of three releases.
  • Will likely lead to duplicate events for most of release 0.x+1 because both old and new PullSubscriptions exist.

3. Doppelgänger

Idea: Have the new PullSubscription controller determine if it is a Doppelgänger and if so, copy the other's GCP Subscription name.

  1. Copy the CRDs to the new Group. Adjust all code to use the new groups exclusively.
  2. The old CRD controller adds a deprecation warning.
  3. The Webhoook denies creation of any new COs with the old CRDs.
  4. Add the ability to the new PullSubscription to specify the name of the GCP subscription.
  5. The new CRD controller looks for Doppelgängers during reconciliation (old CRD with the same name and namespace and same owner). If so, set the GCP subscription name override.
  6. Wait a release, let's call it 0.x+1.
  7. The old CRD controller:
  • Adds a final leakage warning if there is no Doppelgänger.
  • No longer creates or deletes anything in GCP.
  • Deletes its receive adapter.
  1. Wait a release, let's call it 0.x+2.
  2. Remove the old CRDs and controllers.
  3. Have a job or script to clean up the old CRDs and COs.

Pros

  • Automatic migration of everything created by our controllers. A path for migration of COs made by things other than our controllers.
  • No event double delivery.

Cons

  • Adds a GCP subscription name field to our API, which will be permanent.
  • Takes a minimum of three releases.
  • Effect at a distance, due to Doppelgängers.
  • Doesn't work with Kubernetes GenerateName.
    • Our controllers don't use GenerateName, but someone else may have.

4. Delegate

Idea: Have the old CRD delegate to the new CRD before being deleted.

  1. Copy the CRDs to the new Group. All code continues to use the old CRD.
  2. Add the ability to the new PullSubscription to specify the name of the GCP subscription.
  3. The old CRD controllers:
  • Mark the resources with deprecation notices.
  • Create a new CO with the new CRD representing this CO. It is owned by this CO.
  • No longer manipulate anything outside the cluster.
  1. Wait a release, let's call it 0.x+1.
  2. The Webhook deny creation of any new COs of the old CRDs.
  3. The old CRD controllers:
  • No longer create or delete anything.
  • Copies its owner references to its owned new CRD CO.
  1. All other code now uses the new CRD
  2. Wait a release, let's call it 0.x+2.
  3. Remove the old CRDs and controllers.
  4. Have a job or script to clean up the old CRDs and COs.

Pros

  • Automatic migration of everything created by our controllers. A path for migration of COs made by things other than our controllers.
  • No event double delivery.

Cons

  • Adds a GCP subscription name field to our API, which will be permanent.
  • Takes a minimum of three releases.
  • Upstream COs will flap un-ready during the initial roll out of 0.x+1, because they will be looking for an owned, new CRD, CO, which will not be marked as owned until after the old CRD controller is finished.
  • Doesn't work with Kubernetes GenerateName.
    • Our controllers don't use GenerateName, but someone else may have.

@yolocs
Copy link
Member

yolocs commented Apr 22, 2020

It would help with the decision if we get the know the potential impact. Since it's an alpha API, maybe option 1?

@akashrv
Copy link

akashrv commented Apr 24, 2020

Notes from GVC call with @Harwayne

Assume next release is X.
There are two cases to take care of here.

  1. Auto created COs by other COs such as PubsubSource or Broker
  • The higher level objects delete the old COs (including the backing GCP resources) and create new ones.
  • Deletion will cause loss of events - (if the cost savings in terms of effort is huge, then it is okay in an alpha product)
  1. Users have created these objects.
  • We want to do a scream test if users need these objects as public objects or if we are good to make them internal.
  • Deprecate CO in vX and add status message that this object is deprecated and made internal and delete the objects to de-risk leaking GCP resources. Users are not supposed to use.
    • The CO and controllers need to exist
    • Webhook must stop accepting this CO - it is a scream test and if we hear strong feedback we can do a vX.1 release that reverts webhook.

Let's make sure our release notes capture this clearly.

Goal is in vX+1 - CRD is removed and Controllers don't exist.

@Harwayne
Copy link
Contributor Author

Plan based on #905 (comment).

For brevity, 'old COs' means 'custom objects made with the old CRDs'. And 'new COs' means 'custom objects made with the new CRDs'.

In this release, 0.x:

  1. Create the new CRDs.
  2. Have all existing higher level objects delete their old COs and create new COs. E.g. CloudAuditLogsSource deletes its old PullSubscription and creates a new one.
    • This means we will likely lose events during the upgrade (anything that was not yet delivered at the time).
      • If desired we can mitigate this by adding a delay in how long we keep the old COs after the new COs have been created. E.g. only delete the old COs if this reconcile found the new COs ready=true. This will cause event duplication and still has the potential to lose events unless we set it to not delete the old COs until one week after the new COs.
  3. The old COs controller will add deprecation warnings to the status of old COs along the lines of:
    This object is deprecated and should be deleted. There is no public replacement. If you want a public replacement, please comment on https://github.com/google/knative-gcp/issues/905. This object should be deleted before upgrading to 0.x+1. If it is not deleted before the 0.x+1 upgrade, then it will leak the Topic/PullSubscription created in Google Cloud Platform.

Just before upgrade to 0.x+1:

  1. Give a pre-upgrade script that will delete all old COs.

In release 0.x+1:

  1. Delete the old CRDs and their controllers.

@alexander-alvarez
Copy link

Chiming in here, if we care about programatically creating PubSub Topics which we previously did with the k8s manifests (we haven't used the event publishing feature), what will be the preferred approach of doing this?

@Harwayne
Copy link
Contributor Author

Harwayne commented May 6, 2020

Chiming in here, if we care about programatically creating PubSub Topics which we previously did with the k8s manifests (we haven't used the event publishing feature), what will be the preferred approach of doing this?

Our eventual plan was to release something that allowed this. We didn't know anyone was using them right now. So that we can better understand your use case, how do you use them? Do you use only Topic and not PullSubscription?

@alexander-alvarez
Copy link

We just Topics as a simple way to create pub sub topics within our CI/CD pipeline, though we could just as easily switch to https://www.terraform.io/docs/providers/google/r/pubsub_topic.html

We don't use PullSubscription

@Harwayne
Copy link
Contributor Author

Harwayne commented May 8, 2020

We just Topics as a simple way to create pub sub topics within our CI/CD pipeline, though we could just as easily switch to https://www.terraform.io/docs/providers/google/r/pubsub_topic.html

We don't use PullSubscription

If you only want to create Topics, but not send events to them, would something like Config Connector work for you? It allows you to make K8s objects and have them reified into GCP resources. Topic in particular is documented here.

@alexander-alvarez
Copy link

It would indeed. I had no idea about this. Thanks @Harwayne

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/feature-request New feature or request priority/1 Blocks current release defined by release/* label or blocks current milestone release/1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants