Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Please provide "helm rollback" option in the HelmRelease #1960

Closed
gtseres opened this issue Apr 19, 2019 · 3 comments · Fixed by #2006
Closed

Please provide "helm rollback" option in the HelmRelease #1960

gtseres opened this issue Apr 19, 2019 · 3 comments · Fixed by #2006

Comments

@gtseres
Copy link
Contributor

gtseres commented Apr 19, 2019

When rolling out new versions of a HelmRelease, there is a case that a release might be marked as FAILED by Helm due to some error. In that case, Flux cannot deploy new versions of the HelmRelease file, even if the error has been corrected, as the following error appears in the logs:

ts=2019-03-15T16:54:32.871180927Z caller=release.go:186 component=release error="Chart release failed: staging-test: &status.statusError{Code:2, Message:\"a release named staging-test already exists. nRun: helm ls --all staging-test; to check the status of the release nOr run: helm del --purge staging-test; to delete it\", Details:[]*any.Any(nil)}"

In order to overcome this issue, there are two options without downtime that I am aware of:

  1. Create a new HelmRelease with a different name, let it deploy, delete the old one.
  2. Do a helm rollback

Given that the last revision of Helm will be in FAILED state, the one that will be running in the cluster will be the last successful one. The helm rollback <release_name> <last_successful_revision> will just make Helm point to the last successful revision so that we overcome this issue.

Regarding the two available options above, the first one involves some manual labor, where the second one can be automated. The downside of the second option is that helm rollback is not stable for stateful applications.

Could we please implement it as an option that one has to explicitly enable for Flux to do the helm rollback if this issue is hit? This will be backwards compatible, not create a problem to stateful workloads and allow someone that wants to force this rollback to do so. This will be very handy for stateless application where this issue is hit.

Thanks!

@kingdonb
Copy link
Member

kingdonb commented Apr 19, 2019

Is that the best way to solve this? I think I've hit this issue too, I had a helm release that failed, I rolled forward manually upgrading to another release that eventually succeeded, but now my HelmRelease has recovered.

I made a failed release by changing the service type from LoadBalancer to ClusterIP, which I've just learned that Kubernetes will not allow because the field is immutable, or something. This crashed my helm release and marked it FAILED, which is the same thing that would have happened without helm-operator, of course.

So I manually deleted the service which was preventing the upgrade, and ran the helm upgrade manually (or waited long enough for helm-operator to try again) and it worked. I suppose this is a bigger problem for UX if the flux users don't have permission to delete a service manually. Maybe also worth noting that a change from ClusterIP to LoadBalancer does not trigger the same issue.

Maybe you can offer this as a flag and also a way to trigger it manually. I haven't used fluxctl yet but going to go try it now.

@kingdonb
Copy link
Member

kingdonb commented Apr 19, 2019

I may have said it wrong. It looks like helm-operator will not recover on its own, even if you delete the service. But pretty sure that helm-operator can be recovered by manually triggering another release with helm upgrade. There may be an easier way for helm-operator to handle that situation better.

@hiddeco
Copy link
Member

hiddeco commented May 2, 2019

I started working on this in #2006, and what is currently there already works.

One of the challenges I am facing is the fact that in case an upgrade fails and it rolls back the release, it will end up in an infinite loop of retrying the upgrade (and rolling back) until a human notices and fixes the error.

This is harmless in theory, it will however result in about 40 nonsense revisions/hour with the default interval settings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants