Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

ferwasy · 2020-10-07T18:42:02Z

Hello. I'm using argo-cd version 1.6.2 with argo-rollouts version 0.7.2 to perform BlueGreen deployments. I've configured a bunch of PostSync hooks. My expectation is that these hooks should be triggered after the BlueGreen deploy finishes (the new version is already deployed, all the new pods have been created and the "active" service is pointing to the new version). However, I noticed that the hooks are triggered while the pods of the new application version are still being created.
Is this a bug or my expectation was incorrect? If my expectation is incorrect, how could I implement something like a PostSync job but for argo-rollouts?
Thanks in advance!

jessesuen · 2020-10-07T21:44:52Z

However, I noticed that the hooks are triggered while the pods of the new application version are still being created.
Is this a bug or my expectation was incorrect

This sounds like a bug, but given the versions, possibly addressed in later releases. It's affected by the health.lua script that is baked into Argo CD. The hooks should only fire once the rollout becomes healthy.

hoantran3108 · 2020-10-08T10:59:49Z

Hello. I'm using B/G deployment and struggling the same issue which needs to perform a cache purge after the new version successfully rollouts. Do we have any workaround for this? I'm using ArgoCD 1.7.5 and Argo Rollout 0.9.1.

Thanks

ferwasy · 2020-10-08T11:39:19Z

@jessesuen and @hoantran3108 thank you very much for your quick responses. I was thinking in upgrading both ArgoCD and Argo Rollouts versions supposing that the new releases would contain the fix for this issue, but after @hoantran3108 comment, I'm starting to doubt about it. Any recommendations?

jessesuen · 2020-10-08T20:19:56Z

Thanks for the data points. I will need some time to reproduce this.

ferwasy · 2020-10-08T21:38:29Z

I've being doing some tests and I noticed that the PostSync job gets triggered BEFORE status.blueGreen.activeSelector is equals to status.blueGreen.previewSelector which is the condition for "healthy" state in health.lua

ferwasy · 2020-10-09T15:31:26Z

We can confirm that these also happens with ArgoCD version 1.7.7

ferwasy · 2020-10-09T21:15:01Z

We found a workaround. We run the following bash script as the PostSync job with sync-wave: 1. That script waits until the preview service and the active service point to the same replica set (we noticed that when the first PostSync job is triggered, they point to different replica sets already). Once this script finishes, it means that the new rollout is healthy, so the following PostSync jobs (sync-wave: 2 and so forth) are triggered.


#!/bin/bash
previewActiveSelectorsAreEqual() {
  ROLLOUT=$1
  NAMESPACE=$2
  selectors=($(kubectl get rollout $ROLLOUT -o json -n $NAMESPACE | jq -r '.status.blueGreen.previewSelector + " " + .status.blueGreen.activeSelector'))
  previewSelector=${selectors[0]}
  activeSelector=${selectors[1]}
  [ "$previewSelector" = "$activeSelector" ]
}

if [ "$#" -ne 2 ]
then
  echo "Usage: ./wait-for-active-service-update.sh <ROLLOUT> <NAMESPACE>"
  exit 1
fi

ROLLOUT=$1
NAMESPACE=$2
echo -e "Starting to scan active and preview services ...\n"
if previewActiveSelectorsAreEqual $ROLLOUT $NAMESPACE; then
  echo -e "Preview and active services point to the same replica set. The sync operation does not involve an application deployment\n"
  exit 0
fi
echo -e "Preview and active services do NOT point to the same replica set. Waiting until both point to the same one ...\n"
while ! previewActiveSelectorsAreEqual $ROLLOUT $NAMESPACE
do
  echo -e "Preview and active services do NOT point to the same replica set. Waiting until both point to the same one ...\n"
  sleep 2
done
echo -e "Preview and active services point to the same replica set. Exiting."

Maybe you can use this information to detect where is the issue.

jessesuen · 2020-10-09T22:22:00Z

That helps. I think I may have identified a flaw in the health.lua which is returning healthy too early. I can try and provide what I think is a workaround health.lua until it is fixed in Argo CD.

ferwasy · 2020-10-12T11:55:18Z

@jessesuen thank you very much for your help. I've checked the health.lua script and according to this code

if obj.status.blueGreen ~= nil and obj.status.blueGreen.activeSelector ~= nil and obj.status.blueGreen.activeSelector == obj.status.currentPodHash then
        hs.status = "Healthy"
        hs.message = "The active Service is serving traffic to the current pod spec"
        return hs
      end

the healthy status seems to be setted appropriately. Could the issue exists in the code that "listens" this change?

ferwasy · 2020-10-13T14:01:51Z

Hello. I'm using B/G deployment and struggling the same issue which needs to perform a cache purge after the new version successfully rollouts. Do we have any workaround for this? I'm using ArgoCD 1.7.5 and Argo Rollout 0.9.1.

Thanks

@hoantran3108 can you try the script above and confirm that it fixes your issue?

jessesuen · 2020-10-19T05:56:42Z

I can reproduce this, and I believe I understand the problem. The problem is that right after applying the Rollout, Argo CD performs the health check of a Rollout too quickly, before the Rollout controller has a chance to update the status with the new information. So the sequence of events is:

argo-cd applies new rollout spec
argo-cd immediately assesses health of rollout (before the rollout-controller is finished reconciling/updating the rollout)
argo-cd thinks the Rollout is Healthy because it is assessing it based on stale data.
since app is healthy, the PostSync job fires too early.

The way Kubernetes solves can be seen in kubectl rollout status:
https://github.com/kubernetes/kubernetes/blob/5232ad4a00ec93942d0b2c6359ee6cd1201b46bc/pkg/kubectl/rollout_status.go#L80

A similar fix needs to happen in Rollouts. The rollout controller should record: status.observedGeneration from metadata.generation. Currently because of historical reasons, status.observedGeneration is based off a hash of the rollout spec, and not the numeric value of metadata.generation. This is because with earlier version of kubernetes (not sure which), metadata.generation of CRDs was not updated by the API server and was always 0.

After the above controller fix in place, the Rollout health.lua needs to be updated to have a check if observedGeneration is < metadata.generation. If so, the Rollout should be considered Progressing and not Healthy.

jessesuen · 2020-10-19T07:36:10Z

This is because with earlier version of kubernetes (not sure which), metadata.generation of CRDs was not updated by the API server and was always 0.

Looked into this. I've verified that as far back as kubernetes v1.13, metadata.generation is being populated with values. So with that piece of information, it should be safe to completely switch to integer status.observedGeneration. I didn't bother trying v1.12.

The problem is that right after applying the Rollout, Argo CD performs the health check of a Rollout too quickly, before the Rollout controller has a chance to update the status with the new information.

An easy way to reproduce this, is to shut down the argo-rollouts controller, and sync an Application with an already healthy Rollout and PostSync hook in Argo CD. Since the rollout controller is offline, and can't update status, Argo CD will always consider it Healthy and the hook will fire.

jessesuen · 2020-10-19T10:10:01Z

The fix for this is much more involved than I anticipated. In order to properly use metadata.generation, the controller will need to switch to using status subresources in the CRD definition. When I attempted to record metadata.generation without using status subresources, the controller enters an infinite reconciliation loop which continually records metadata.generation in status, which caused metadata.generation to get incremented, which then the controller will record the new metadata.generation again under status, and rinse and repeat.

The use of status subresource feature in Kubernetes has some challenges:

spec/status cannot be updated in the same HTTP request (which we currently do during plugin promotion CLI command)
status would not be able to be updated by Argo CD, which currently doesn't have a feature to leverage the status subresource endpoint.

ferwasy · 2020-10-19T11:36:32Z

@jessesuen thank you for the research. Looking forward to have news about the fix.

dobegor · 2020-10-26T09:35:43Z

@jessesuen is there a fix in 0.9.2?

jessesuen · 2020-10-26T18:03:35Z

There won't be a fix in v0.9.x. The fix carries some risk as it requires us to use the status subresource, which forces us to deal with the challenges that I mentioned above, touching more code than I think is safe in v0.9.

dobegor · 2020-10-26T20:26:23Z

@jessesuen thanks, I missed that. It would probably be better to mention that PostSync hooks are unusable with Argo Rollouts ATM because people might lose a lot of time before they find this issue (just like I did :) )

ferwasy · 2020-10-26T21:28:40Z

@jessesuen does this fix needs to be implemented in argo-cd or argo-rollouts?
Thanks!

jessesuen · 2020-10-26T22:51:37Z

I discussed this with @alexmt today. There are a combination of things we'll need to do:

Near term:

Introduce an artificial delay between sync waves/phases in Argo CD. This gives 3rd party controllers the opportunity to respond to spec changes, and update status. This will benefit all custom resources with health checks, not just Rollouts. This will address the problem in the near term, however it is not foolproof. This is issue Argo CD might progress through waves/PostSync prematurely argo-cd#4669.

Longer term:

Rollouts should switch to status subresources, and record metadata.generation to status.observedGeneration.
The health check (in both Argo CD and plugin) will be updated to consider a Rollout as Progressing if observedGeneration != metadata.generation
After rollouts switches to status subresources, this will break the "Resume" action in Argo CD, because it does not support patching the /status endpoint. Argo CD resource actions need to transparently support CRDs which use status subresources by understanding to patch the /status endpoint. Argo CD needs to implement Support actions for CRDs that use status subresource argo-cd#4670

ferwasy · 2020-10-27T14:11:17Z

@jessesuen I see. Thank you for the update. Willing to test the near term solution as soon as is released.

jessesuen · 2020-11-02T20:19:11Z

Introduce an artificial delay between sync waves/phases in Argo CD. This gives 3rd party controllers the opportunity to respond to spec changes, and update status. This will benefit all custom resources with health checks, not just Rollouts. This will address the problem in the near term, however it is not foolproof. This is issue argoproj/argo-cd#4669.

The near term solution (argoproj/argo-cd#4669) is merged into Argo CD and will be available in Argo CD v1.8. Unfortunately it cannot be backported safely into v1.7 because of the risk of other changes made to gitops-engine.

jessesuen · 2020-11-05T11:38:35Z

Argo CD resource actions need to transparently support CRDs which use status subresources by understanding to patch the /status endpoint. Argo CD needs to implement argoproj/argo-cd#4670

This has been implemented in Argo CD as well.

dobegor · 2020-11-05T12:26:10Z

So is there anything left other than releasing Argo CD 1.8?

jessesuen · 2020-11-05T22:48:28Z

The Argo CD v1.8 fix to delay waves by a few seconds, should work most of the time. The 2-second delay (which is configurable) between waves, is generally sufficient to allow the rollout controller (or other controllers) to react, and update the rollout into a Progressing state.

That said, it is not 100% reliable. For 100% reliability, Argo Rollouts needs to start using status subresource. Use of the status subresource will allow generation changes (i.e. spec changes) to be confirmed via the rollout status, if the rollout controller acted on it.

Status subresource may not make v0.10 as it is a riskier change.

dobegor · 2020-11-06T09:42:13Z

Awesome thanks for clarifying this!

jessesuen · 2020-11-09T23:05:20Z

Status subresource may not make v0.10 as it is a riskier change.

This will be in v0.10. However if using Argo Rollouts v0.10 with Argo CD, it will need to be used with Argo CD v1.8+.

ferwasy added the question Further information is requested label Oct 7, 2020

jessesuen self-assigned this Oct 19, 2020

jessesuen added bug Something isn't working and removed question Further information is requested labels Oct 19, 2020

jessesuen added this to the v0.10 milestone Oct 19, 2020

jessesuen changed the title ~~Should argo-cd PostSync hooks be triggered after BlueGreen deploy finishes?~~ Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) Oct 19, 2020

jessesuen mentioned this issue Oct 19, 2020

feat(controller): use CRD status subresource #789

Merged

1 task

This was referenced Oct 26, 2020

Argo CD might progress through waves/PostSync prematurely argoproj/argo-cd#4669

Closed

Support actions for CRDs that use status subresource argoproj/argo-cd#4670

Closed

jessesuen modified the milestones: v0.10, v0.11 Nov 6, 2020

jessesuen closed this as completed in #789 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

ferwasy commented Oct 7, 2020

jessesuen commented Oct 7, 2020

hoantran3108 commented Oct 8, 2020

ferwasy commented Oct 8, 2020

jessesuen commented Oct 8, 2020

ferwasy commented Oct 8, 2020

ferwasy commented Oct 9, 2020 •

edited

Loading

ferwasy commented Oct 9, 2020

jessesuen commented Oct 9, 2020

ferwasy commented Oct 12, 2020

ferwasy commented Oct 13, 2020 •

edited

Loading

jessesuen commented Oct 19, 2020

jessesuen commented Oct 19, 2020 •

edited

Loading

jessesuen commented Oct 19, 2020

ferwasy commented Oct 19, 2020

dobegor commented Oct 26, 2020

jessesuen commented Oct 26, 2020

dobegor commented Oct 26, 2020

ferwasy commented Oct 26, 2020

jessesuen commented Oct 26, 2020 •

edited

Loading

ferwasy commented Oct 27, 2020

jessesuen commented Nov 2, 2020 •

edited

Loading

jessesuen commented Nov 5, 2020

dobegor commented Nov 5, 2020

jessesuen commented Nov 5, 2020 •

edited

Loading

dobegor commented Nov 6, 2020

jessesuen commented Nov 9, 2020

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

Argo CD can fire PostSync hooks prematurely (before Rollout is Healthy) #771

Comments

ferwasy commented Oct 7, 2020

jessesuen commented Oct 7, 2020

hoantran3108 commented Oct 8, 2020

ferwasy commented Oct 8, 2020

jessesuen commented Oct 8, 2020

ferwasy commented Oct 8, 2020

ferwasy commented Oct 9, 2020 • edited Loading

ferwasy commented Oct 9, 2020

jessesuen commented Oct 9, 2020

ferwasy commented Oct 12, 2020

ferwasy commented Oct 13, 2020 • edited Loading

jessesuen commented Oct 19, 2020

jessesuen commented Oct 19, 2020 • edited Loading

jessesuen commented Oct 19, 2020

ferwasy commented Oct 19, 2020

dobegor commented Oct 26, 2020

jessesuen commented Oct 26, 2020

dobegor commented Oct 26, 2020

ferwasy commented Oct 26, 2020

jessesuen commented Oct 26, 2020 • edited Loading

ferwasy commented Oct 27, 2020

jessesuen commented Nov 2, 2020 • edited Loading

jessesuen commented Nov 5, 2020

dobegor commented Nov 5, 2020

jessesuen commented Nov 5, 2020 • edited Loading

dobegor commented Nov 6, 2020

jessesuen commented Nov 9, 2020

ferwasy commented Oct 9, 2020 •

edited

Loading

ferwasy commented Oct 13, 2020 •

edited

Loading

jessesuen commented Oct 19, 2020 •

edited

Loading

jessesuen commented Oct 26, 2020 •

edited

Loading

jessesuen commented Nov 2, 2020 •

edited

Loading

jessesuen commented Nov 5, 2020 •

edited

Loading