Please provide "helm rollback" option in the HelmRelease #1960

gtseres · 2019-04-19T13:52:18Z

When rolling out new versions of a HelmRelease, there is a case that a release might be marked as FAILED by Helm due to some error. In that case, Flux cannot deploy new versions of the HelmRelease file, even if the error has been corrected, as the following error appears in the logs:

ts=2019-03-15T16:54:32.871180927Z caller=release.go:186 component=release error="Chart release failed: staging-test: &status.statusError{Code:2, Message:\"a release named staging-test already exists. nRun: helm ls --all staging-test; to check the status of the release nOr run: helm del --purge staging-test; to delete it\", Details:[]*any.Any(nil)}"

In order to overcome this issue, there are two options without downtime that I am aware of:

Create a new HelmRelease with a different name, let it deploy, delete the old one.
Do a helm rollback

Given that the last revision of Helm will be in FAILED state, the one that will be running in the cluster will be the last successful one. The helm rollback <release_name> <last_successful_revision> will just make Helm point to the last successful revision so that we overcome this issue.

Regarding the two available options above, the first one involves some manual labor, where the second one can be automated. The downside of the second option is that helm rollback is not stable for stateful applications.

Could we please implement it as an option that one has to explicitly enable for Flux to do the helm rollback if this issue is hit? This will be backwards compatible, not create a problem to stateful workloads and allow someone that wants to force this rollback to do so. This will be very handy for stateless application where this issue is hit.

Thanks!

The text was updated successfully, but these errors were encountered:

kingdonb · 2019-04-19T16:21:43Z

Is that the best way to solve this? I think I've hit this issue too, I had a helm release that failed, I rolled forward manually upgrading to another release that eventually succeeded, but now my HelmRelease has recovered.

I made a failed release by changing the service type from LoadBalancer to ClusterIP, which I've just learned that Kubernetes will not allow because the field is immutable, or something. This crashed my helm release and marked it FAILED, which is the same thing that would have happened without helm-operator, of course.

So I manually deleted the service which was preventing the upgrade, and ran the helm upgrade manually (or waited long enough for helm-operator to try again) and it worked. I suppose this is a bigger problem for UX if the flux users don't have permission to delete a service manually. Maybe also worth noting that a change from ClusterIP to LoadBalancer does not trigger the same issue.

Maybe you can offer this as a flag and also a way to trigger it manually. I haven't used fluxctl yet but going to go try it now.

kingdonb · 2019-04-19T16:27:53Z

I may have said it wrong. It looks like helm-operator will not recover on its own, even if you delete the service. But pretty sure that helm-operator can be recovered by manually triggering another release with helm upgrade. There may be an easier way for helm-operator to handle that situation better.

hiddeco · 2019-05-02T13:05:11Z

I started working on this in #2006, and what is currently there already works.

One of the challenges I am facing is the fact that in case an upgrade fails and it rolls back the release, it will end up in an infinite loop of retrying the upgrade (and rolling back) until a human notices and fixes the error.

This is harmless in theory, it will however result in about 40 nonsense revisions/hour with the default interval settings.

stefanprodan added the helm label Apr 19, 2019

2opremio added the enhancement label Apr 19, 2019

This was referenced Apr 26, 2019

Upgrade helm chart fails #1984

Closed

Provide optional rollback support for HelmReleases #2006

Merged

hiddeco mentioned this issue Jun 6, 2019

Failed Helm releases require manual user intervention to recover #2132

Closed

hiddeco closed this as completed in #2006 Jul 1, 2019

lloydchang mentioned this issue Dec 6, 2021

WIP - Argocd/Flux Comparison Document open-gitops/documents#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please provide "helm rollback" option in the HelmRelease #1960

Please provide "helm rollback" option in the HelmRelease #1960

gtseres commented Apr 19, 2019 •

edited

Loading

kingdonb commented Apr 19, 2019 •

edited

Loading

kingdonb commented Apr 19, 2019 •

edited

Loading

hiddeco commented May 2, 2019

Please provide "helm rollback" option in the HelmRelease #1960

Please provide "helm rollback" option in the HelmRelease #1960

Comments

gtseres commented Apr 19, 2019 • edited Loading

kingdonb commented Apr 19, 2019 • edited Loading

kingdonb commented Apr 19, 2019 • edited Loading

hiddeco commented May 2, 2019

gtseres commented Apr 19, 2019 •

edited

Loading

kingdonb commented Apr 19, 2019 •

edited

Loading

kingdonb commented Apr 19, 2019 •

edited

Loading