Helm drift detection causes endless upgrade loop with CRDs #805

rsafonseca · 2023-11-09T17:57:57Z

CR objects don't support StrategicMergePatch out of the box in Kubernetes, because their spec isn't builtin.
As a consequence of that, helm applies a MergePatch instead of StrategicMergePatch for CR objects.

The problem:

Helm-controller drift detection triggers a helm upgrade when there are changes in CR objects
If spec.upgrade.force is set to false (the default) on the helmrelease, this causes an endless loop because the helm upgrade will not effect any changes (effectively it just compares the previously applied configuration with the would-be-applied configuration, not taking into account the current cluster state

The (easier) solution:

Helm-controller drift detection should take into account helm's behaviour when upgrading releases and not trigger an upgrade when spec.upgrade.force = false and the modified object is a CR only

Relevant change of helm's code (which was introduced because previously trying to patch CR objects was simply failing): https://github.com/helm/helm/pull/7269/files

stefanprodan · 2023-11-24T09:14:08Z

Use Flux CRD upgrade feature and the drift should be solved, to let Flux manage the CRDs:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-operator
spec:
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

rsafonseca · 2023-11-24T23:40:52Z

I don't think you got the problem @stefanprodan , the issue is not with CRD definitions, so that option doesn't apply.
The issue is with CRD resources deployed by a helm chart. I work around this by setting upgrade.force=true to use the replace strategy, but it's still a bug when upgrade.force=false because the drift detection doesn't behave in the same way with these resources as helm does, leading to this endless loop condition

stefanprodan · 2023-11-25T06:49:49Z

I think you need to provide an example for this, so we ca replicate the behaviour.

rsafonseca · 2023-11-25T07:26:08Z

Simply apply any helm chart that creates a custom resource with upgrade.force=false specified (or omitted) in the helm release, enable drift detection, change any field on the CR that is managed by helm (in the cluster), drift detection kicks in and creates an endless helm upgrade loop.

I've linked the code above or why this happens in helm

stefanprodan · 2023-11-25T07:50:36Z

Ok I managed to reproduce this using the ServiceMonitor from podinfo's chart. So basically Helm can't correct any changes made in-cluster for custom resources. Crazy that people consider using Helm at this point.

stefanprodan · 2023-11-25T07:53:00Z

@hiddeco looks like we have a major issue with Helm SDK and drift correction. We may need an exclusion rule for all non native APIs.

rsafonseca · 2023-11-25T08:04:52Z

Ok I managed to reproduce this using the ServiceMonitor from podinfo's chart. So basically Helm can't correct any changes made in-cluster for custom resources. Crazy that people consider using Helm at this point.

Indeed, helm is unstable and has problems left and right, but we have years worth of deployments (many hundreds) of apps using it, and it still does solve some templating/simplification/versioning problems for us, which kustomization doesn't and makes a few things harder to manage in a large distributed scope.

rsafonseca · 2023-11-25T08:12:17Z

@hiddeco looks like we have a major issue with Helm SDK and drift correction. We may need an exclusion rule for all non native APIs.

I'd only apply this exclusion if upgrade.force=false, we do want drift to be detected and fixed on CRs and it works properly with the replacement strategy which is used when upgrade.force=true
It would just be nice to not have things landing on that loop when the users haven't specified upgrade.force=true, in some cases they might care about using the strategic merge patching for the native api resources (to keep the manually modified fields which arent managed by helm) and ignore drift on the CR

stefanprodan · 2023-11-25T08:16:13Z

I see upgrade.force as something that you would temporary enable, this operations is destructive as does a delete and create, for all CRD controllers this will imply downtime and major issues, imagine recreating a DB cluster, a Flagger canary, etc. IMO no one should have upgrade.force in their HelmReleases.

rsafonseca · 2023-11-25T08:23:58Z

It depends on how the CR is used, for example we have a number of critical ETCD clusters managed by the ETCD operator, and replacement strategy works fine for us and doesn't really cause any downtime and the CR doesn't actually get deleted, just the strategy replacing the content is different, but IIRC the object metadata and status is unchanged

…

On Sat 25 Nov 2023, 08:16 Stefan Prodan, ***@***.***> wrote: I see upgrade.force as something that you would temporary enable, this operations is destructive as does a delete and create, for all CRD controllers this will imply downtime and major issues, imagine recreating a DB cluster, a Flagger canary, etc. IMO no one should have upgrade.force in their HelmReleases. — Reply to this email directly, view it on GitHub <#805 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC4UIY56GPA2GS42TRW3FADYGGSNPAVCNFSM6AAAAAA7E66T3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGI2TCMZVGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rsafonseca · 2023-11-25T08:32:26Z

Tbh, a decent way to solve this would be to be able to inflate/template the helm chart manifests and then apply the generated output as a kustomization, that is roughly similar to what we were doing with flux1 using generators, but there's no way to do it with flux2 sadly

stefanprodan · 2023-11-25T10:30:13Z

You can do it with helm template | flux push artifact in CI, then use patches in the Flux Kustomization reconciling the manifests to further customize the resources.

hiddeco · 2023-11-25T15:17:54Z

One of the options I see, is to not make use of Helm at all to correct drift.

As with the upcoming changes to the helm-controller, we calculate a JSON patch to be able to report drift, which we could also use to issue an apply to correct it.

stefanprodan · 2023-11-25T15:28:10Z

One of the options I see, is to not make use of Helm at all to correct drift.

Yeah using Flux SSA for this would be ideal and way less heavy on the API server.

hiddeco · 2023-11-28T22:52:38Z

@rsafonseca I think you will be very happy with the solution from the above pull request.

rsafonseca · 2023-11-29T00:16:28Z

Cool, I'm OOO this week in a conference but I'll try to test it out when I'm back. Thanks @hiddeco !

rsafonseca mentioned this issue Nov 9, 2023

Helm release unable to rollback manual changes after enabling drift detection #742

Open

stefanprodan added the area/drift Drift detection/correction related issues and pull requests label Nov 25, 2023

hiddeco mentioned this issue Nov 28, 2023

Correct cluster drift using patches #822

Merged

hiddeco closed this as completed in #822 Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm drift detection causes endless upgrade loop with CRDs #805

Helm drift detection causes endless upgrade loop with CRDs #805

rsafonseca commented Nov 9, 2023 •

edited

Loading

stefanprodan commented Nov 24, 2023

rsafonseca commented Nov 24, 2023 •

edited

Loading

stefanprodan commented Nov 25, 2023

rsafonseca commented Nov 25, 2023 •

edited

Loading

stefanprodan commented Nov 25, 2023

stefanprodan commented Nov 25, 2023 •

edited

Loading

rsafonseca commented Nov 25, 2023

rsafonseca commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

rsafonseca commented Nov 25, 2023 via email

rsafonseca commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

hiddeco commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

hiddeco commented Nov 28, 2023

rsafonseca commented Nov 29, 2023

Helm drift detection causes endless upgrade loop with CRDs #805

Helm drift detection causes endless upgrade loop with CRDs #805

Comments

rsafonseca commented Nov 9, 2023 • edited Loading

stefanprodan commented Nov 24, 2023

rsafonseca commented Nov 24, 2023 • edited Loading

stefanprodan commented Nov 25, 2023

rsafonseca commented Nov 25, 2023 • edited Loading

stefanprodan commented Nov 25, 2023

stefanprodan commented Nov 25, 2023 • edited Loading

rsafonseca commented Nov 25, 2023

rsafonseca commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

rsafonseca commented Nov 25, 2023 via email

rsafonseca commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

hiddeco commented Nov 25, 2023

stefanprodan commented Nov 25, 2023

hiddeco commented Nov 28, 2023

rsafonseca commented Nov 29, 2023

rsafonseca commented Nov 9, 2023 •

edited

Loading

rsafonseca commented Nov 24, 2023 •

edited

Loading

rsafonseca commented Nov 25, 2023 •

edited

Loading

stefanprodan commented Nov 25, 2023 •

edited

Loading