Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm drift detection causes endless upgrade loop with CRDs #805

Closed
rsafonseca opened this issue Nov 9, 2023 · 16 comments · Fixed by #822
Closed

Helm drift detection causes endless upgrade loop with CRDs #805

rsafonseca opened this issue Nov 9, 2023 · 16 comments · Fixed by #822
Labels
area/drift Drift detection/correction related issues and pull requests

Comments

@rsafonseca
Copy link

rsafonseca commented Nov 9, 2023

CR objects don't support StrategicMergePatch out of the box in Kubernetes, because their spec isn't builtin.
As a consequence of that, helm applies a MergePatch instead of StrategicMergePatch for CR objects.

The problem:

  • Helm-controller drift detection triggers a helm upgrade when there are changes in CR objects
  • If spec.upgrade.force is set to false (the default) on the helmrelease, this causes an endless loop because the helm upgrade will not effect any changes (effectively it just compares the previously applied configuration with the would-be-applied configuration, not taking into account the current cluster state

The (easier) solution:

  • Helm-controller drift detection should take into account helm's behaviour when upgrading releases and not trigger an upgrade when spec.upgrade.force = false and the modified object is a CR only

Relevant change of helm's code (which was introduced because previously trying to patch CR objects was simply failing): https://github.com/helm/helm/pull/7269/files

@stefanprodan
Copy link
Member

Use Flux CRD upgrade feature and the drift should be solved, to let Flux manage the CRDs:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: my-operator
spec:
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace

@rsafonseca
Copy link
Author

rsafonseca commented Nov 24, 2023

I don't think you got the problem @stefanprodan , the issue is not with CRD definitions, so that option doesn't apply.
The issue is with CRD resources deployed by a helm chart. I work around this by setting upgrade.force=true to use the replace strategy, but it's still a bug when upgrade.force=false because the drift detection doesn't behave in the same way with these resources as helm does, leading to this endless loop condition

@stefanprodan
Copy link
Member

I think you need to provide an example for this, so we ca replicate the behaviour.

@rsafonseca
Copy link
Author

rsafonseca commented Nov 25, 2023

Simply apply any helm chart that creates a custom resource with upgrade.force=false specified (or omitted) in the helm release, enable drift detection, change any field on the CR that is managed by helm (in the cluster), drift detection kicks in and creates an endless helm upgrade loop.

I've linked the code above or why this happens in helm

@stefanprodan
Copy link
Member

Ok I managed to reproduce this using the ServiceMonitor from podinfo's chart. So basically Helm can't correct any changes made in-cluster for custom resources. Crazy that people consider using Helm at this point.

@stefanprodan
Copy link
Member

stefanprodan commented Nov 25, 2023

@hiddeco looks like we have a major issue with Helm SDK and drift correction. We may need an exclusion rule for all non native APIs.

@stefanprodan stefanprodan added the area/drift Drift detection/correction related issues and pull requests label Nov 25, 2023
@rsafonseca
Copy link
Author

Ok I managed to reproduce this using the ServiceMonitor from podinfo's chart. So basically Helm can't correct any changes made in-cluster for custom resources. Crazy that people consider using Helm at this point.

Indeed, helm is unstable and has problems left and right, but we have years worth of deployments (many hundreds) of apps using it, and it still does solve some templating/simplification/versioning problems for us, which kustomization doesn't and makes a few things harder to manage in a large distributed scope.

@rsafonseca
Copy link
Author

@hiddeco looks like we have a major issue with Helm SDK and drift correction. We may need an exclusion rule for all non native APIs.

I'd only apply this exclusion if upgrade.force=false, we do want drift to be detected and fixed on CRs and it works properly with the replacement strategy which is used when upgrade.force=true
It would just be nice to not have things landing on that loop when the users haven't specified upgrade.force=true, in some cases they might care about using the strategic merge patching for the native api resources (to keep the manually modified fields which arent managed by helm) and ignore drift on the CR

@stefanprodan
Copy link
Member

I see upgrade.force as something that you would temporary enable, this operations is destructive as does a delete and create, for all CRD controllers this will imply downtime and major issues, imagine recreating a DB cluster, a Flagger canary, etc. IMO no one should have upgrade.force in their HelmReleases.

@rsafonseca
Copy link
Author

rsafonseca commented Nov 25, 2023 via email

@rsafonseca
Copy link
Author

Tbh, a decent way to solve this would be to be able to inflate/template the helm chart manifests and then apply the generated output as a kustomization, that is roughly similar to what we were doing with flux1 using generators, but there's no way to do it with flux2 sadly

@stefanprodan
Copy link
Member

You can do it with helm template | flux push artifact in CI, then use patches in the Flux Kustomization reconciling the manifests to further customize the resources.

@hiddeco
Copy link
Member

hiddeco commented Nov 25, 2023

One of the options I see, is to not make use of Helm at all to correct drift.

As with the upcoming changes to the helm-controller, we calculate a JSON patch to be able to report drift, which we could also use to issue an apply to correct it.

@stefanprodan
Copy link
Member

One of the options I see, is to not make use of Helm at all to correct drift.

Yeah using Flux SSA for this would be ideal and way less heavy on the API server.

@hiddeco
Copy link
Member

hiddeco commented Nov 28, 2023

@rsafonseca I think you will be very happy with the solution from the above pull request.

@rsafonseca
Copy link
Author

Cool, I'm OOO this week in a conference but I'll try to test it out when I'm back. Thanks @hiddeco !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/drift Drift detection/correction related issues and pull requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants