-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Please provide "helm rollback" option in the HelmRelease #1960
Comments
Is that the best way to solve this? I think I've hit this issue too, I had a helm release that failed, I rolled forward manually upgrading to another release that eventually succeeded, but now my HelmRelease has recovered. I made a failed release by changing the service type from LoadBalancer to ClusterIP, which I've just learned that Kubernetes will not allow because the field is immutable, or something. This crashed my helm release and marked it FAILED, which is the same thing that would have happened without helm-operator, of course. So I manually deleted the service which was preventing the upgrade, and ran the helm upgrade manually (or waited long enough for helm-operator to try again) and it worked. I suppose this is a bigger problem for UX if the flux users don't have permission to delete a service manually. Maybe also worth noting that a change from ClusterIP to LoadBalancer does not trigger the same issue. Maybe you can offer this as a flag and also a way to trigger it manually. I haven't used fluxctl yet but going to go try it now. |
I may have said it wrong. It looks like helm-operator will not recover on its own, even if you delete the service. But pretty sure that helm-operator can be recovered by manually triggering another release with |
I started working on this in #2006, and what is currently there already works. One of the challenges I am facing is the fact that in case an upgrade fails and it rolls back the release, it will end up in an infinite loop of retrying the upgrade (and rolling back) until a human notices and fixes the error. This is harmless in theory, it will however result in about 40 nonsense revisions/hour with the default interval settings. |
When rolling out new versions of a HelmRelease, there is a case that a release might be marked as
FAILED
by Helm due to some error. In that case, Flux cannot deploy new versions of theHelmRelease
file, even if the error has been corrected, as the following error appears in the logs:In order to overcome this issue, there are two options without downtime that I am aware of:
helm rollback
Given that the last revision of Helm will be in
FAILED
state, the one that will be running in the cluster will be the last successful one. Thehelm rollback <release_name> <last_successful_revision>
will just make Helm point to the last successful revision so that we overcome this issue.Regarding the two available options above, the first one involves some manual labor, where the second one can be automated. The downside of the second option is that
helm rollback
is not stable for stateful applications.Could we please implement it as an option that one has to explicitly enable for Flux to do the
helm rollback
if this issue is hit? This will be backwards compatible, not create a problem to stateful workloads and allow someone that wants to force this rollback to do so. This will be very handy for stateless application where this issue is hit.Thanks!
The text was updated successfully, but these errors were encountered: