Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

Closed
ferwasy opened this issue Oct 7, 2020 · 26 comments · Fixed by #789
Closed

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

ferwasy opened this issue Oct 7, 2020 · 26 comments · Fixed by #789
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ferwasy
Copy link

ferwasy commented Oct 7, 2020

Hello. I'm using argo-cd version 1.6.2 with argo-rollouts version 0.7.2 to perform BlueGreen deployments. I've configured a bunch of PostSync hooks. My expectation is that these hooks should be triggered after the BlueGreen deploy finishes (the new version is already deployed, all the new pods have been created and the "active" service is pointing to the new version). However, I noticed that the hooks are triggered while the pods of the new application version are still being created.
Is this a bug or my expectation was incorrect? If my expectation is incorrect, how could I implement something like a PostSync job but for argo-rollouts?
Thanks in advance!

@ferwasy ferwasy added the question Further information is requested label Oct 7, 2020
@jessesuen
Copy link
Member

However, I noticed that the hooks are triggered while the pods of the new application version are still being created.
Is this a bug or my expectation was incorrect

This sounds like a bug, but given the versions, possibly addressed in later releases. It's affected by the health.lua script that is baked into Argo CD. The hooks should only fire once the rollout becomes healthy.

@hoantran3108
Copy link

Hello. I'm using B/G deployment and struggling the same issue which needs to perform a cache purge after the new version successfully rollouts. Do we have any workaround for this? I'm using ArgoCD 1.7.5 and Argo Rollout 0.9.1.

Thanks

@ferwasy
Copy link
Author

ferwasy commented Oct 8, 2020

@jessesuen and @hoantran3108 thank you very much for your quick responses. I was thinking in upgrading both ArgoCD and Argo Rollouts versions supposing that the new releases would contain the fix for this issue, but after @hoantran3108 comment, I'm starting to doubt about it. Any recommendations?

@jessesuen
Copy link
Member

Thanks for the data points. I will need some time to reproduce this.

@ferwasy
Copy link
Author

ferwasy commented Oct 8, 2020

I've being doing some tests and I noticed that the PostSync job gets triggered BEFORE status.blueGreen.activeSelector is equals to status.blueGreen.previewSelector which is the condition for "healthy" state in health.lua

@ferwasy
Copy link
Author

ferwasy commented Oct 9, 2020

We can confirm that these also happens with ArgoCD version 1.7.7

@ferwasy
Copy link
Author

ferwasy commented Oct 9, 2020

We found a workaround. We run the following bash script as the PostSync job with sync-wave: 1. That script waits until the preview service and the active service point to the same replica set (we noticed that when the first PostSync job is triggered, they point to different replica sets already). Once this script finishes, it means that the new rollout is healthy, so the following PostSync jobs (sync-wave: 2 and so forth) are triggered.


#!/bin/bash
previewActiveSelectorsAreEqual() {
  ROLLOUT=$1
  NAMESPACE=$2
  selectors=($(kubectl get rollout $ROLLOUT -o json -n $NAMESPACE | jq -r '.status.blueGreen.previewSelector + " " + .status.blueGreen.activeSelector'))
  previewSelector=${selectors[0]}
  activeSelector=${selectors[1]}
  [ "$previewSelector" = "$activeSelector" ]
}

if [ "$#" -ne 2 ]
then
  echo "Usage: ./wait-for-active-service-update.sh <ROLLOUT> <NAMESPACE>"
  exit 1
fi

ROLLOUT=$1
NAMESPACE=$2
echo -e "Starting to scan active and preview services ...\n"
if previewActiveSelectorsAreEqual $ROLLOUT $NAMESPACE; then
  echo -e "Preview and active services point to the same replica set. The sync operation does not involve an application deployment\n"
  exit 0
fi
echo -e "Preview and active services do NOT point to the same replica set. Waiting until both point to the same one ...\n"
while ! previewActiveSelectorsAreEqual $ROLLOUT $NAMESPACE
do
  echo -e "Preview and active services do NOT point to the same replica set. Waiting until both point to the same one ...\n"
  sleep 2
done
echo -e "Preview and active services point to the same replica set. Exiting."

Maybe you can use this information to detect where is the issue.

@jessesuen
Copy link
Member

That helps. I think I may have identified a flaw in the health.lua which is returning healthy too early. I can try and provide what I think is a workaround health.lua until it is fixed in Argo CD.

@ferwasy
Copy link
Author

ferwasy commented Oct 12, 2020

@jessesuen thank you very much for your help. I've checked the health.lua script and according to this code

if obj.status.blueGreen ~= nil and obj.status.blueGreen.activeSelector ~= nil and obj.status.blueGreen.activeSelector == obj.status.currentPodHash then
        hs.status = "Healthy"
        hs.message = "The active Service is serving traffic to the current pod spec"
        return hs
      end

the healthy status seems to be setted appropriately. Could the issue exists in the code that "listens" this change?

@ferwasy
Copy link
Author

ferwasy commented Oct 13, 2020

Hello. I'm using B/G deployment and struggling the same issue which needs to perform a cache purge after the new version successfully rollouts. Do we have any workaround for this? I'm using ArgoCD 1.7.5 and Argo Rollout 0.9.1.

Thanks

@hoantran3108 can you try the script above and confirm that it fixes your issue?

@jessesuen jessesuen self-assigned this Oct 19, 2020
@jessesuen
Copy link
Member

I can reproduce this, and I believe I understand the problem. The problem is that right after applying the Rollout, Argo CD performs the health check of a Rollout too quickly, before the Rollout controller has a chance to update the status with the new information. So the sequence of events is:

  1. argo-cd applies new rollout spec
  2. argo-cd immediately assesses health of rollout (before the rollout-controller is finished reconciling/updating the rollout)
  3. argo-cd thinks the Rollout is Healthy because it is assessing it based on stale data.
  4. since app is healthy, the PostSync job fires too early.

The way Kubernetes solves can be seen in kubectl rollout status:
https://github.com/kubernetes/kubernetes/blob/5232ad4a00ec93942d0b2c6359ee6cd1201b46bc/pkg/kubectl/rollout_status.go#L80

A similar fix needs to happen in Rollouts. The rollout controller should record: status.observedGeneration from metadata.generation. Currently because of historical reasons, status.observedGeneration is based off a hash of the rollout spec, and not the numeric value of metadata.generation. This is because with earlier version of kubernetes (not sure which), metadata.generation of CRDs was not updated by the API server and was always 0.

After the above controller fix in place, the Rollout health.lua needs to be updated to have a check if observedGeneration is < metadata.generation. If so, the Rollout should be considered Progressing and not Healthy.

@jessesuen jessesuen added bug Something isn't working and removed question Further information is requested labels Oct 19, 2020
@jessesuen jessesuen added this to the v0.10 milestone Oct 19, 2020
@jessesuen jessesuen changed the title Should argo-cd PostSync hooks be triggered after BlueGreen deploy finishes? Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) Oct 19, 2020
@jessesuen
Copy link
Member

jessesuen commented Oct 19, 2020

This is because with earlier version of kubernetes (not sure which), metadata.generation of CRDs was not updated by the API server and was always 0.

Looked into this. I've verified that as far back as kubernetes v1.13, metadata.generation is being populated with values. So with that piece of information, it should be safe to completely switch to integer status.observedGeneration. I didn't bother trying v1.12.

The problem is that right after applying the Rollout, Argo CD performs the health check of a Rollout too quickly, before the Rollout controller has a chance to update the status with the new information.

An easy way to reproduce this, is to shut down the argo-rollouts controller, and sync an Application with an already healthy Rollout and PostSync hook in Argo CD. Since the rollout controller is offline, and can't update status, Argo CD will always consider it Healthy and the hook will fire.

@jessesuen
Copy link
Member

The fix for this is much more involved than I anticipated. In order to properly use metadata.generation, the controller will need to switch to using status subresources in the CRD definition. When I attempted to record metadata.generation without using status subresources, the controller enters an infinite reconciliation loop which continually records metadata.generation in status, which caused metadata.generation to get incremented, which then the controller will record the new metadata.generation again under status, and rinse and repeat.

The use of status subresource feature in Kubernetes has some challenges:

  1. spec/status cannot be updated in the same HTTP request (which we currently do during plugin promotion CLI command)
  2. status would not be able to be updated by Argo CD, which currently doesn't have a feature to leverage the status subresource endpoint.

@ferwasy
Copy link
Author

ferwasy commented Oct 19, 2020

@jessesuen thank you for the research. Looking forward to have news about the fix.

@dobegor
Copy link

dobegor commented Oct 26, 2020

@jessesuen is there a fix in 0.9.2?

@jessesuen
Copy link
Member

There won't be a fix in v0.9.x. The fix carries some risk as it requires us to use the status subresource, which forces us to deal with the challenges that I mentioned above, touching more code than I think is safe in v0.9.

@dobegor
Copy link

dobegor commented Oct 26, 2020

@jessesuen thanks, I missed that. It would probably be better to mention that PostSync hooks are unusable with Argo Rollouts ATM because people might lose a lot of time before they find this issue (just like I did :) )

@ferwasy
Copy link
Author

ferwasy commented Oct 26, 2020

@jessesuen does this fix needs to be implemented in argo-cd or argo-rollouts?
Thanks!

@jessesuen
Copy link
Member

jessesuen commented Oct 26, 2020

I discussed this with @alexmt today. There are a combination of things we'll need to do:

Near term:

  1. Introduce an artificial delay between sync waves/phases in Argo CD. This gives 3rd party controllers the opportunity to respond to spec changes, and update status. This will benefit all custom resources with health checks, not just Rollouts. This will address the problem in the near term, however it is not foolproof. This is issue Argo CD might progress through waves/PostSync prematurely argo-cd#4669.

Longer term:

  1. Rollouts should switch to status subresources, and record metadata.generation to status.observedGeneration.

  2. The health check (in both Argo CD and plugin) will be updated to consider a Rollout as Progressing if observedGeneration != metadata.generation

  3. After rollouts switches to status subresources, this will break the "Resume" action in Argo CD, because it does not support patching the /status endpoint. Argo CD resource actions need to transparently support CRDs which use status subresources by understanding to patch the /status endpoint. Argo CD needs to implement Support actions for CRDs that use status subresource argo-cd#4670

@ferwasy
Copy link
Author

ferwasy commented Oct 27, 2020

@jessesuen I see. Thank you for the update. Willing to test the near term solution as soon as is released.

@jessesuen
Copy link
Member

jessesuen commented Nov 2, 2020

Introduce an artificial delay between sync waves/phases in Argo CD. This gives 3rd party controllers the opportunity to respond to spec changes, and update status. This will benefit all custom resources with health checks, not just Rollouts. This will address the problem in the near term, however it is not foolproof. This is issue argoproj/argo-cd#4669.

The near term solution (argoproj/argo-cd#4669) is merged into Argo CD and will be available in Argo CD v1.8. Unfortunately it cannot be backported safely into v1.7 because of the risk of other changes made to gitops-engine.

@jessesuen
Copy link
Member

Argo CD resource actions need to transparently support CRDs which use status subresources by understanding to patch the /status endpoint. Argo CD needs to implement argoproj/argo-cd#4670

This has been implemented in Argo CD as well.

@dobegor
Copy link

dobegor commented Nov 5, 2020

So is there anything left other than releasing Argo CD 1.8?

@jessesuen
Copy link
Member

jessesuen commented Nov 5, 2020

The Argo CD v1.8 fix to delay waves by a few seconds, should work most of the time. The 2-second delay (which is configurable) between waves, is generally sufficient to allow the rollout controller (or other controllers) to react, and update the rollout into a Progressing state.

That said, it is not 100% reliable. For 100% reliability, Argo Rollouts needs to start using status subresource. Use of the status subresource will allow generation changes (i.e. spec changes) to be confirmed via the rollout status, if the rollout controller acted on it.

Status subresource may not make v0.10 as it is a riskier change.

@jessesuen jessesuen modified the milestones: v0.10, v0.11 Nov 6, 2020
@dobegor
Copy link

dobegor commented Nov 6, 2020

Awesome thanks for clarifying this!

@jessesuen
Copy link
Member

Status subresource may not make v0.10 as it is a riskier change.

This will be in v0.10. However if using Argo Rollouts v0.10 with Argo CD, it will need to be used with Argo CD v1.8+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants