-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm drift detection causes endless upgrade loop with CRDs #805
Comments
Use Flux CRD upgrade feature and the drift should be solved, to let Flux manage the CRDs:
|
I don't think you got the problem @stefanprodan , the issue is not with CRD definitions, so that option doesn't apply. |
I think you need to provide an example for this, so we ca replicate the behaviour. |
Simply apply any helm chart that creates a custom resource with upgrade.force=false specified (or omitted) in the helm release, enable drift detection, change any field on the CR that is managed by helm (in the cluster), drift detection kicks in and creates an endless helm upgrade loop. I've linked the code above or why this happens in helm |
Ok I managed to reproduce this using the ServiceMonitor from podinfo's chart. So basically Helm can't correct any changes made in-cluster for custom resources. Crazy that people consider using Helm at this point. |
@hiddeco looks like we have a major issue with Helm SDK and drift correction. We may need an exclusion rule for all non native APIs. |
Indeed, helm is unstable and has problems left and right, but we have years worth of deployments (many hundreds) of apps using it, and it still does solve some templating/simplification/versioning problems for us, which kustomization doesn't and makes a few things harder to manage in a large distributed scope. |
I'd only apply this exclusion if upgrade.force=false, we do want drift to be detected and fixed on CRs and it works properly with the replacement strategy which is used when upgrade.force=true |
I see |
It depends on how the CR is used, for example we have a number of critical
ETCD clusters managed by the ETCD operator, and replacement strategy works
fine for us and doesn't really cause any downtime and the CR doesn't
actually get deleted, just the strategy replacing the content is different,
but IIRC the object metadata and status is unchanged
…On Sat 25 Nov 2023, 08:16 Stefan Prodan, ***@***.***> wrote:
I see upgrade.force as something that you would temporary enable, this
operations is destructive as does a delete and create, for all CRD
controllers this will imply downtime and major issues, imagine recreating a
DB cluster, a Flagger canary, etc. IMO no one should have upgrade.force
in their HelmReleases.
—
Reply to this email directly, view it on GitHub
<#805 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC4UIY56GPA2GS42TRW3FADYGGSNPAVCNFSM6AAAAAA7E66T3GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGI2TCMZVGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Tbh, a decent way to solve this would be to be able to inflate/template the helm chart manifests and then apply the generated output as a kustomization, that is roughly similar to what we were doing with flux1 using generators, but there's no way to do it with flux2 sadly |
You can do it with |
One of the options I see, is to not make use of Helm at all to correct drift. As with the upcoming changes to the helm-controller, we calculate a JSON patch to be able to report drift, which we could also use to issue an apply to correct it. |
Yeah using Flux SSA for this would be ideal and way less heavy on the API server. |
@rsafonseca I think you will be very happy with the solution from the above pull request. |
Cool, I'm OOO this week in a conference but I'll try to test it out when I'm back. Thanks @hiddeco ! |
CR objects don't support StrategicMergePatch out of the box in Kubernetes, because their spec isn't builtin.
As a consequence of that, helm applies a MergePatch instead of StrategicMergePatch for CR objects.
The problem:
The (easier) solution:
Relevant change of helm's code (which was introduced because previously trying to patch CR objects was simply failing): https://github.com/helm/helm/pull/7269/files
The text was updated successfully, but these errors were encountered: