-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771
Comments
This sounds like a bug, but given the versions, possibly addressed in later releases. It's affected by the health.lua script that is baked into Argo CD. The hooks should only fire once the rollout becomes healthy. |
Hello. I'm using B/G deployment and struggling the same issue which needs to perform a cache purge after the new version successfully rollouts. Do we have any workaround for this? I'm using ArgoCD 1.7.5 and Argo Rollout 0.9.1. Thanks |
@jessesuen and @hoantran3108 thank you very much for your quick responses. I was thinking in upgrading both ArgoCD and Argo Rollouts versions supposing that the new releases would contain the fix for this issue, but after @hoantran3108 comment, I'm starting to doubt about it. Any recommendations? |
Thanks for the data points. I will need some time to reproduce this. |
I've being doing some tests and I noticed that the PostSync job gets triggered BEFORE |
We can confirm that these also happens with ArgoCD version 1.7.7 |
We found a workaround. We run the following bash script as the PostSync job with sync-wave: 1. That script waits until the preview service and the active service point to the same replica set (we noticed that when the first PostSync job is triggered, they point to different replica sets already). Once this script finishes, it means that the new rollout is healthy, so the following PostSync jobs (sync-wave: 2 and so forth) are triggered.
Maybe you can use this information to detect where is the issue. |
That helps. I think I may have identified a flaw in the health.lua which is returning healthy too early. I can try and provide what I think is a workaround health.lua until it is fixed in Argo CD. |
@jessesuen thank you very much for your help. I've checked the health.lua script and according to this code
the healthy status seems to be setted appropriately. Could the issue exists in the code that "listens" this change? |
@hoantran3108 can you try the script above and confirm that it fixes your issue? |
I can reproduce this, and I believe I understand the problem. The problem is that right after applying the Rollout, Argo CD performs the health check of a Rollout too quickly, before the Rollout controller has a chance to update the status with the new information. So the sequence of events is:
The way Kubernetes solves can be seen in A similar fix needs to happen in Rollouts. The rollout controller should record: After the above controller fix in place, the Rollout |
Looked into this. I've verified that as far back as kubernetes v1.13,
An easy way to reproduce this, is to shut down the argo-rollouts controller, and sync an Application with an already healthy Rollout and PostSync hook in Argo CD. Since the rollout controller is offline, and can't update status, Argo CD will always consider it Healthy and the hook will fire. |
The fix for this is much more involved than I anticipated. In order to properly use metadata.generation, the controller will need to switch to using status subresources in the CRD definition. When I attempted to record metadata.generation without using status subresources, the controller enters an infinite reconciliation loop which continually records metadata.generation in status, which caused metadata.generation to get incremented, which then the controller will record the new metadata.generation again under status, and rinse and repeat. The use of status subresource feature in Kubernetes has some challenges:
|
@jessesuen thank you for the research. Looking forward to have news about the fix. |
@jessesuen is there a fix in 0.9.2? |
There won't be a fix in v0.9.x. The fix carries some risk as it requires us to use the status subresource, which forces us to deal with the challenges that I mentioned above, touching more code than I think is safe in v0.9. |
@jessesuen thanks, I missed that. It would probably be better to mention that PostSync hooks are unusable with Argo Rollouts ATM because people might lose a lot of time before they find this issue (just like I did :) ) |
@jessesuen does this fix needs to be implemented in |
I discussed this with @alexmt today. There are a combination of things we'll need to do: Near term:
Longer term:
|
@jessesuen I see. Thank you for the update. Willing to test the near term solution as soon as is released. |
The near term solution (argoproj/argo-cd#4669) is merged into Argo CD and will be available in Argo CD v1.8. Unfortunately it cannot be backported safely into v1.7 because of the risk of other changes made to gitops-engine. |
This has been implemented in Argo CD as well. |
So is there anything left other than releasing Argo CD 1.8? |
The Argo CD v1.8 fix to delay waves by a few seconds, should work most of the time. The 2-second delay (which is configurable) between waves, is generally sufficient to allow the rollout controller (or other controllers) to react, and update the rollout into a Progressing state. That said, it is not 100% reliable. For 100% reliability, Argo Rollouts needs to start using status subresource. Use of the status subresource will allow generation changes (i.e. spec changes) to be confirmed via the rollout status, if the rollout controller acted on it. Status subresource may not make v0.10 as it is a riskier change. |
Awesome thanks for clarifying this! |
This will be in v0.10. However if using Argo Rollouts v0.10 with Argo CD, it will need to be used with Argo CD v1.8+. |
Hello. I'm using argo-cd version 1.6.2 with argo-rollouts version 0.7.2 to perform BlueGreen deployments. I've configured a bunch of PostSync hooks. My expectation is that these hooks should be triggered after the BlueGreen deploy finishes (the new version is already deployed, all the new pods have been created and the "active" service is pointing to the new version). However, I noticed that the hooks are triggered while the pods of the new application version are still being created.
Is this a bug or my expectation was incorrect? If my expectation is incorrect, how could I implement something like a PostSync job but for argo-rollouts?
Thanks in advance!
The text was updated successfully, but these errors were encountered: