-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Ready condition during drift correction #885
Conversation
Some more details based on more testing. I used the metallb example configurations to reproduce fluxcd/flux2#4524 based on the understanding of the issue so far. Since metallb has admission webhooks which cause a drift by injecting CA bundle in CRD, it seems to be a good example to reproduce this easily. The report talks about upgrade which means the release already existed. With helm-controller v0.37.2, I installed metallb helmrelease first in a good state without any To introduce the dependency failure, I updated the helmrelease to add a status:
conditions:
- lastTransitionTime: "2024-02-02T23:52:31Z"
message: 'unable to get ''default/foo'' dependency: HelmRelease.helm.toolkit.fluxcd.io
"foo" not found'
observedGeneration: 4
reason: ProgressingWithRetry
status: "True"
type: Reconciling
- lastTransitionTime: "2024-02-02T23:50:59Z"
message: 'unable to get ''default/foo'' dependency: HelmRelease.helm.toolkit.fluxcd.io
"foo" not found'
observedGeneration: 3
reason: DependencyNotReady
status: "False"
type: Ready
- lastTransitionTime: "2024-02-02T23:47:49Z"
message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
observedGeneration: 1
reason: InstallSucceeded
status: "True"
type: Released Ready=False with Now if I update helm-controller with the changes in this PR (which includes #884) using status:
conditions:
- lastTransitionTime: "2024-02-02T23:53:49Z"
message: correcting cluster drift
observedGeneration: 4
reason: ProgressingWithRetry
status: "True"
type: Reconciling
- lastTransitionTime: "2024-02-02T23:53:46Z"
message: correcting cluster drift
observedGeneration: 4
reason: Progressing
status: Unknown
type: Ready
- lastTransitionTime: "2024-02-02T23:47:49Z"
message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
observedGeneration: 1
reason: InstallSucceeded
status: "True"
type: Released The status accurately shows what's going on. |
b919354
to
59c577a
Compare
Update the Ready condition during drift correction to reflect the current state of reconciliation. Without this, any previous Ready condition value continues to persist on the object. If there was a previous failure due to which Ready=False condition is present on the object, the same value continues to persist if the atomic release reconciliation enters a drift detection and correction loop. Resulting in the status to show inaccurate state of the reconciliation. Examples of two different scenarios that arise due to this issue: - If a release without any dependency is installed, the status shows Ready=True for InstallSucceeded reason. But right after the installation, if a drift is detected the status continues to show the same Ready=True value. There's no indication that a drift correction is going on in the status. The events and logs do show that drift correction is taking place. But it can be confusing to see positive Ready value. Also, since the Ready condition message is copied for Reconciling condition, Reconciling=True with a "Helm install succeeded..." is seen. - If a release depends on another release, and reconciliation results in dependency not ready error at first, Ready=False condition is added on the object. On subsequent runs, even when the dependencies are ready, the Ready=False condition isn't updated, resulting in stale Ready value until atomic release reconciliation completes. But if the atomic reconciliation enters a drift detection and correction loop, the Ready=False with dependency error persists in the status. This gives the impression that something is wrong with dependency check but based on the logs and events, the controller could be stuck in drift detection and correction loop. Updating the Ready condition during drift detection shows the current state of reconciliation, avoiding the confusing scenarios described above. Signed-off-by: Sunny <github@darkowlzz.space>
ac9e62a
to
9def86e
Compare
I just did another manual testing for this to see what happens to the ready condition with this change once the drift gets corrected successfully, since the focus before was on getting the condition correct during the failure. Status during the drift correction failure: status:
conditions:
- lastTransitionTime: "2024-04-29T10:44:21Z"
message: correcting cluster drift
observedGeneration: 1
reason: ProgressingWithRetry
status: "True"
type: Reconciling
- lastTransitionTime: "2024-04-29T10:43:54Z"
message: correcting cluster drift
observedGeneration: 1
reason: Progressing
status: Unknown
type: Ready
- lastTransitionTime: "2024-04-29T10:43:54Z"
message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
observedGeneration: 1
reason: InstallSucceeded
status: "True"
type: Released
helmChart: default/default-metallb
history:
- chartName: metallb
chartVersion: 0.13.12
configDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
digest: sha256:92dd7f2e21d9481ee317110bd887fd218cc6972cb3717b52a32e448ce6204f4b
firstDeployed: "2024-04-29T10:43:20Z"
lastDeployed: "2024-04-29T10:43:20Z"
name: metallb
namespace: default
status: deployed
version: 1
lastAttemptedConfigDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
lastAttemptedGeneration: 1
lastAttemptedReleaseAction: install
lastAttemptedRevision: 0.13.12
observedGeneration: -1
storageNamespace: default Ready condition status is After fixing the configuration to allow drift correction: status:
conditions:
- lastTransitionTime: "2024-04-29T10:46:39Z"
message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
observedGeneration: 2
reason: InstallSucceeded
status: "True"
type: Ready
- lastTransitionTime: "2024-04-29T10:43:54Z"
message: Helm install succeeded for release default/metallb.v1 with chart metallb@0.13.12
observedGeneration: 1
reason: InstallSucceeded
status: "True"
type: Released
helmChart: default/default-metallb
history:
- chartName: metallb
chartVersion: 0.13.12
configDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
digest: sha256:92dd7f2e21d9481ee317110bd887fd218cc6972cb3717b52a32e448ce6204f4b
firstDeployed: "2024-04-29T10:43:20Z"
lastDeployed: "2024-04-29T10:43:20Z"
name: metallb
namespace: default
status: deployed
version: 1
lastAppliedRevision: 0.13.12
lastAttemptedConfigDigest: sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
lastAttemptedGeneration: 2
lastAttemptedReleaseAction: install
lastAttemptedRevision: 0.13.12
observedGeneration: 2
storageNamespace: default Ready status condition gets updated with correct values, status is |
9def86e
to
56478cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @darkowlzz 🏅
Related to the issue described in fluxcd/flux2#4524 and observations from #855. And another instance of the issue discussed on slack that revealed that stale Ready=False value results in such scenarios (described in #884 and more below).
This change builds on top of #884, to solve the issue at different levels. Please read #884 for more details about the issue.
Update the Ready condition during drift correction to reflect the current state of reconciliation. Without this, any previous Ready condition value continues to persist on the object. If there was a previous failure due to which Ready=False condition is present on the object, the same value continues to persist if the atomic release reconciliation enters a drift detection and correction loop. Resulting in the status to show inaccurate state of the reconciliation.
Examples of two different scenarios that arise due to this issue:
helm-controller/internal/reconcile/atomic_release.go
Lines 222 to 225 in c2c1064
driftDetection: mode: enabled
failed to apply revision #855 and it results in the following status:Updating the Ready condition during drift detection shows the current state of reconciliation, avoiding the confusing scenarios described above. Following status is observed with the fix:
The Ready=Unknown and Reconciling=True with their messages show correctly about the current state of reconciliation.