Downtime On initial deployment #1731

nissy34 · 2021-12-26T17:42:11Z

Summary

On first migration from deployment to rollouts I get a few seconds of downtime when using trafficRouting.
I assume that argo rollouts doesn't wait for the StableRS to be ready before adding the pod hash to the stable service selector.

This can be seen from the logs and events.

In the events, we can see that the switch happened before the scaling up.

Events:
  Type    Reason                 Age   From                 Message
  ----    ------                 ----  ----                 -------
  Normal  RolloutUpdated         25m   rollouts-controller  Rollout updated to revision 1
  Normal  NewReplicaSetCreated   25m   rollouts-controller  Created ReplicaSet vod-stitcher-f7cc77d48 (revision 1)
  Normal  CreatingCanaryIngress  25m   rollouts-controller  Creating canary ingress `vod-stitcher-vod-stitcher-canary` with weight `0`
  Normal  RolloutCompleted       25m   rollouts-controller  Rollout completed update to revision 1 (f7cc77d48): Initial deploy
  Normal  SwitchService          25m   rollouts-controller  Switched selector for service 'vod-stitcher' from '' to 'f7cc77d48'
  Normal  ScalingReplicaSet      25m   rollouts-controller  Scaled up ReplicaSet vod-stitcher-f7cc77d48 (revision 1) from 0 to 1

and in the logs first we have

time="2021-12-26T17:03:37Z" level=info msg="Switched selector for service 'linear-stitcher-public' from '' to '78cf8cfb69'" event_reason=SwitchService namespace=dai rollout=linear-stitcher

and a few logs line after we see

time="2021-12-26T17:03:37Z" level=info msg="Scaled up ReplicaSet linear-stitcher-78cf8cfb69 (revision 1) from 0 to 1" event_reason=ScalingReplicaSet namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling new ReplicaSet 'linear-stitcher-78cf8cfb69'" namespace=dai rollout=linear-stitcher

Diagnostics

What version of Argo Rollouts are you running? 1.1.1

Paste the logs from the rollout controller

time="2021-12-26T16:03:20Z" level=info msg="rollout deleted" namespace=dai rollout=linear-stitcher
time="2021-12-26T16:03:20Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-canary
time="2021-12-26T16:03:20Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T16:03:20Z" level=info msg="cleaned service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T16:03:20Z" level=info msg="cleaned service" namespace=dai rollout=linear-stitcher service=linear-stitcher-canary
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Defaulting .spec.replica to 1" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Reconciliation completed" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher time_ms=21.285433
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Assuming 78cf8cfb69 for new replicaset pod hash" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout \\\"linear-stitcher\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"currentPodHash\":\"78cf8cfb69\",\"currentStepHash\":\"869d64f5c6\",\"currentStepIndex\":0,\"message\":\"more replicas need to be updated\",\"observedGeneration\":\"2\",\"phase\":\"Progressing\",\"selector\":\"app.kubernetes.io/component=linear-stitcher,app.kubernetes.io/name=dai-ott-bundle,componentname=linear-stitcher\"}}" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher time_ms=22.358567999999998
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Updating replica set 'linear-stitcher-78cf8cfb69' revision from 0 to 1" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Created ReplicaSet linear-stitcher-78cf8cfb69" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Rollout updated to revision 1" event_reason=RolloutUpdated namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Created ReplicaSet linear-stitcher-78cf8cfb69 (revision 1)" event_reason=NewReplicaSetCreated namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Set rollout condition: &RolloutCondition{Type:Progressing,Status:True,LastUpdateTime:2021-12-26 17:03:27.957074912 +0000 UTC m=+16019.962491502,LastTransitionTime:2021-12-26 17:03:27.957075002 +0000 UTC m=+16019.962491582,Reason:NewReplicaSetCreated,Message:Created new replica set \"linear-stitcher-78cf8cfb69\",}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="canary ingress not found" ingress=linear-stitcher-linear-stitcher-canary namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Creating canary ingress `linear-stitcher-linear-stitcher-canary` with weight `0`" event_reason=CreatingCanaryIngress namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Previous weights: nil" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Skipping analysis: isAborted: false, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: true" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Reconciling canary pause step (stepIndex: 0/19)" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Not finished reconciling Canary Pause" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Rollout completed update to revision 1 (78cf8cfb69): Initial deploy" event_reason=RolloutCompleted namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"podTemplateHash\":\"78cf8cfb69\",\"serviceName\":\"linear-stitcher-canary\",\"weight\":0},\"stable\":{\"serviceName\":\"linear-stitcher-public\",\"weight\":100}}},\"conditions\":[{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:37Z\",\"message\":\"ReplicaSet \\\"linear-stitcher-78cf8cfb69\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"currentStepIndex\":19,\"stableRS\":\"78cf8cfb69\"}}" generation=2 namespace=dai resourceVersion=100362538 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher time_ms=10818.726643
time="2021-12-26T17:03:37Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Switched selector for service 'linear-stitcher-public' from '' to '78cf8cfb69'" event_reason=SwitchService namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T17:03:37Z" level=info msg="No changes to canary ingress - skipping patch" ingress=linear-stitcher-linear-stitcher-canary namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:78cf8cfb69,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Scaled up ReplicaSet linear-stitcher-78cf8cfb69 (revision 1) from 0 to 1" event_reason=ScalingReplicaSet namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling new ReplicaSet 'linear-stitcher-78cf8cfb69'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling ReplicaSets" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Timed out (false) [last progress check: 2021-12-26 17:03:37 +0000 UTC - now: 2021-12-26 17:03:37.546175448 +0000 UTC m=+16029.551592058]" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"stable\":{\"podTemplateHash\":\"78cf8cfb69\"}}}}}" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362651 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher time_ms=55.992812

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

harikrongali · 2022-01-03T19:20:58Z

@nissy34 this is expected per design. to avoid downtime, the recommendation is to have deployment & rollout run side by side and after rollout becomes stable, remove deployment.

harikrongali · 2022-01-03T19:22:10Z

https://argoproj.github.io/argo-rollouts/migrating/
check warning message in the doc

nissy34 · 2022-01-03T19:33:55Z

@harikrongali
My point is that in order to not have any downtime it's not enough to have the rollout and deployment side by aide we need also to have a separate service for the deployment and rollout (as argo-controller adds the stable hash to the service before the rollout is ready).

From the docs (migration section)

If you are using this feature then Rollout switches production traffic to Pods that it manages. The switch happens only when the required number of Pod is running and healthy so it is safe in production as well.

This is not the actual behavior from my experience.

harikrongali · 2022-01-03T19:36:21Z

I see your point, you need to have three services and cant reuse services from old deployment. We can update docs to point it out.

nissy34 · 2022-01-03T19:51:44Z

@harikrongali thanks for the clarification.

Just to understand, why is this behavior by design?
Doesn't it make sense to wait for all stable pods to be ready on initial deployment before adding the label selector?

harikrongali · 2022-01-03T20:12:10Z

Achieving readiness gates is one reason but I would let @jessesuen provide detailed info on this.

jessesuen · 2022-01-15T00:31:57Z

Achieving readiness gates is one reason but I would let @jessesuen provide detailed info on this.

Yes, it is true that making service selectors are in place before pods are created is important for some ingress controller implementations like ALB LoadBalancer controller which inject readiness gates based on the labels matching the pods. But with that said, for the ALB use case, we will steer users to the upcoming ping-pong service management feature (as opposed to using canaryService vs. stableService).

I agree we can improve this behavior for the migration case. In fact, we actually currently avoid changing label selector to the canary service until the pods are ready by calling replicasetutil.IsReplicaSetReady(c.newRS). But we don't do this for the stable case:

func (c *rolloutContext) reconcileStableAndCanaryService() error {
	if c.rollout.Spec.Strategy.Canary == nil {
		return nil
	}
	err := c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.StableService, c.stableRS)
	if err != nil {
		return err
	}

	if replicasetutil.IsReplicaSetReady(c.newRS) { // ensures we don't switch canary selector until pods are ready for canary service
		err = c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.CanaryService, c.newRS)
		if err != nil {
			return err
		}
	}
	return nil
}

Similarly, for blue-green, we will not modify the active service until replicasetutil.ReadyForPause is true

func (c *rolloutContext) reconcileActiveService(activeSvc *corev1.Service) error {
...
	if !replicasetutil.ReadyForPause(c.rollout, c.newRS, c.allRSs) || !annotations.IsSaturated(c.rollout, c.newRS) {
		c.log.Infof("skipping active service switch: New RS '%s' is not fully saturated", c.newRS.Name)
		return nil
	}

So with that said, I agree that one thing we can improve is to never add selector to the service until ReplicaSet is ready. Adding the selector before any pods with that label are is ready is guaranteed downtime for that Service.

MarkSRobinson · 2022-01-18T18:58:59Z

I am also being impacted by this

dearboll23 · 2023-01-05T07:37:40Z

Hi, team, unfortunately we faced the similar issue while initing deployment with Istio traffic routing.

When we inited rollout for a deployment with Istio trafficrouting, we found rollout controller created a new RS from the current RS template with a new label key named 'rollouts-pod-template-hash', and the controller update Istio DR with the label immediately without considering whether the new RS is available or not. This leads 503 http error, no healthy upstreams.

Could we add some determine statements to ensure the new RS is available before updating the Istio DR, which is the same like ensureSVCTarget does?
@jessesuen @zachaller Thank you for your attention.

zachaller · 2023-01-05T15:23:30Z

@dearboll23 can you create a new issue and describe how to reproduce etc?

dearboll23 · 2023-01-07T06:13:29Z

@zachaller Sure, have created it .

nissy34 added the bug Something isn't working label Dec 26, 2021

nissy34 changed the title ~~Down time On initial deployment~~ Downtime On initial deployment Dec 26, 2021

harikrongali added the answered label Jan 3, 2022

harikrongali added the docs Needs documentation/release notes label Jan 3, 2022

jessesuen removed docs Needs documentation/release notes answered labels Jan 15, 2022

jessesuen mentioned this issue Jan 15, 2022

fix: delay service injection of selector labels until ReplicaSet available #1777

Merged

jessesuen closed this as completed in #1777 Jan 19, 2022

deadlysyn mentioned this issue Jul 27, 2023

Preview service in blue/green deployment switches to new RS before pods are ready #2641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime On initial deployment #1731

Downtime On initial deployment #1731

nissy34 commented Dec 26, 2021 •

edited

Loading

harikrongali commented Jan 3, 2022

harikrongali commented Jan 3, 2022 •

edited

Loading

nissy34 commented Jan 3, 2022 •

edited

Loading

harikrongali commented Jan 3, 2022

nissy34 commented Jan 3, 2022

harikrongali commented Jan 3, 2022

jessesuen commented Jan 15, 2022 •

edited

Loading

MarkSRobinson commented Jan 18, 2022

dearboll23 commented Jan 5, 2023 •

edited

Loading

zachaller commented Jan 5, 2023

dearboll23 commented Jan 7, 2023

Downtime On initial deployment #1731

Downtime On initial deployment #1731

Comments

nissy34 commented Dec 26, 2021 • edited Loading

Summary

Diagnostics

Paste the logs from the rollout controller

harikrongali commented Jan 3, 2022

harikrongali commented Jan 3, 2022 • edited Loading

nissy34 commented Jan 3, 2022 • edited Loading

harikrongali commented Jan 3, 2022

nissy34 commented Jan 3, 2022

harikrongali commented Jan 3, 2022

jessesuen commented Jan 15, 2022 • edited Loading

MarkSRobinson commented Jan 18, 2022

dearboll23 commented Jan 5, 2023 • edited Loading

zachaller commented Jan 5, 2023

dearboll23 commented Jan 7, 2023

nissy34 commented Dec 26, 2021 •

edited

Loading

harikrongali commented Jan 3, 2022 •

edited

Loading

nissy34 commented Jan 3, 2022 •

edited

Loading

jessesuen commented Jan 15, 2022 •

edited

Loading

dearboll23 commented Jan 5, 2023 •

edited

Loading