Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime On initial deployment #1731

Closed
nissy34 opened this issue Dec 26, 2021 · 11 comments · Fixed by #1777
Closed

Downtime On initial deployment #1731

nissy34 opened this issue Dec 26, 2021 · 11 comments · Fixed by #1777
Labels
bug Something isn't working

Comments

@nissy34
Copy link
Contributor

nissy34 commented Dec 26, 2021

Summary

On first migration from deployment to rollouts I get a few seconds of downtime when using trafficRouting.
I assume that argo rollouts doesn't wait for the StableRS to be ready before adding the pod hash to the stable service selector.

This can be seen from the logs and events.

In the events, we can see that the switch happened before the scaling up.

Events:
  Type    Reason                 Age   From                 Message
  ----    ------                 ----  ----                 -------
  Normal  RolloutUpdated         25m   rollouts-controller  Rollout updated to revision 1
  Normal  NewReplicaSetCreated   25m   rollouts-controller  Created ReplicaSet vod-stitcher-f7cc77d48 (revision 1)
  Normal  CreatingCanaryIngress  25m   rollouts-controller  Creating canary ingress `vod-stitcher-vod-stitcher-canary` with weight `0`
  Normal  RolloutCompleted       25m   rollouts-controller  Rollout completed update to revision 1 (f7cc77d48): Initial deploy
  Normal  SwitchService          25m   rollouts-controller  Switched selector for service 'vod-stitcher' from '' to 'f7cc77d48'
  Normal  ScalingReplicaSet      25m   rollouts-controller  Scaled up ReplicaSet vod-stitcher-f7cc77d48 (revision 1) from 0 to 1

and in the logs first we have

time="2021-12-26T17:03:37Z" level=info msg="Switched selector for service 'linear-stitcher-public' from '' to '78cf8cfb69'" event_reason=SwitchService namespace=dai rollout=linear-stitcher

and a few logs line after we see

time="2021-12-26T17:03:37Z" level=info msg="Scaled up ReplicaSet linear-stitcher-78cf8cfb69 (revision 1) from 0 to 1" event_reason=ScalingReplicaSet namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling new ReplicaSet 'linear-stitcher-78cf8cfb69'" namespace=dai rollout=linear-stitcher

Diagnostics

What version of Argo Rollouts are you running? 1.1.1

Paste the logs from the rollout controller

time="2021-12-26T16:03:20Z" level=info msg="rollout deleted" namespace=dai rollout=linear-stitcher
time="2021-12-26T16:03:20Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-canary
time="2021-12-26T16:03:20Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T16:03:20Z" level=info msg="cleaned service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T16:03:20Z" level=info msg="cleaned service" namespace=dai rollout=linear-stitcher service=linear-stitcher-canary
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Defaulting .spec.replica to 1" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Reconciliation completed" generation=1 namespace=dai resourceVersion=100362525 rollout=linear-stitcher time_ms=21.285433
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Assuming 78cf8cfb69 for new replicaset pod hash" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout \\\"linear-stitcher\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}],\"currentPodHash\":\"78cf8cfb69\",\"currentStepHash\":\"869d64f5c6\",\"currentStepIndex\":0,\"message\":\"more replicas need to be updated\",\"observedGeneration\":\"2\",\"phase\":\"Progressing\",\"selector\":\"app.kubernetes.io/component=linear-stitcher,app.kubernetes.io/name=dai-ott-bundle,componentname=linear-stitcher\"}}" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362526 rollout=linear-stitcher time_ms=22.358567999999998
time="2021-12-26T17:03:26Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Updating replica set 'linear-stitcher-78cf8cfb69' revision from 0 to 1" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:26Z" level=info msg="Created ReplicaSet linear-stitcher-78cf8cfb69" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Rollout updated to revision 1" event_reason=RolloutUpdated namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Created ReplicaSet linear-stitcher-78cf8cfb69 (revision 1)" event_reason=NewReplicaSetCreated namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Set rollout condition: &RolloutCondition{Type:Progressing,Status:True,LastUpdateTime:2021-12-26 17:03:27.957074912 +0000 UTC m=+16019.962491502,LastTransitionTime:2021-12-26 17:03:27.957075002 +0000 UTC m=+16019.962491582,Reason:NewReplicaSetCreated,Message:Created new replica set \"linear-stitcher-78cf8cfb69\",}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="canary ingress not found" ingress=linear-stitcher-linear-stitcher-canary namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:27Z" level=info msg="Creating canary ingress `linear-stitcher-linear-stitcher-canary` with weight `0`" event_reason=CreatingCanaryIngress namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Previous weights: nil" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Skipping analysis: isAborted: false, promoteFull: false, rollbackToScaleDownDelay: false, initialDeploy: true" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Reconciling canary pause step (stepIndex: 0/19)" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:28Z" level=info msg="Not finished reconciling Canary Pause" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Rollout completed update to revision 1 (78cf8cfb69): Initial deploy" event_reason=RolloutCompleted namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"podTemplateHash\":\"78cf8cfb69\",\"serviceName\":\"linear-stitcher-canary\",\"weight\":0},\"stable\":{\"serviceName\":\"linear-stitcher-public\",\"weight\":100}}},\"conditions\":[{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:26Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"},{\"lastTransitionTime\":\"2021-12-26T17:03:26Z\",\"lastUpdateTime\":\"2021-12-26T17:03:37Z\",\"message\":\"ReplicaSet \\\"linear-stitcher-78cf8cfb69\\\" is progressing.\",\"reason\":\"ReplicaSetUpdated\",\"status\":\"True\",\"type\":\"Progressing\"}],\"currentStepIndex\":19,\"stableRS\":\"78cf8cfb69\"}}" generation=2 namespace=dai resourceVersion=100362538 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362527 rollout=linear-stitcher time_ms=10818.726643
time="2021-12-26T17:03:37Z" level=info msg="Started syncing rollout" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Switched selector for service 'linear-stitcher-public' from '' to '78cf8cfb69'" event_reason=SwitchService namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Found 1 TrafficRouting Reconcilers" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciling TrafficRouting with type 'Nginx'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="syncing service" namespace=dai rollout=linear-stitcher service=linear-stitcher-public
time="2021-12-26T17:03:37Z" level=info msg="No changes to canary ingress - skipping patch" ingress=linear-stitcher-linear-stitcher-canary namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="New weights: &TrafficWeights{Canary:WeightDestination{Weight:0,ServiceName:linear-stitcher-canary,PodTemplateHash:78cf8cfb69,},Stable:WeightDestination{Weight:100,ServiceName:linear-stitcher-public,PodTemplateHash:78cf8cfb69,},Additional:[]WeightDestination{},Verified:nil,}" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="No StableRS exists to reconcile or matches newRS" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Scaled up ReplicaSet linear-stitcher-78cf8cfb69 (revision 1) from 0 to 1" event_reason=ScalingReplicaSet namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling new ReplicaSet 'linear-stitcher-78cf8cfb69'" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Not finished reconciling ReplicaSets" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Timed out (false) [last progress check: 2021-12-26 17:03:37 +0000 UTC - now: 2021-12-26 17:03:37.546175448 +0000 UTC m=+16029.551592058]" namespace=dai rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Patched: {\"status\":{\"canary\":{\"weights\":{\"stable\":{\"podTemplateHash\":\"78cf8cfb69\"}}}}}" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="persisted to informer" generation=2 namespace=dai resourceVersion=100362651 rollout=linear-stitcher
time="2021-12-26T17:03:37Z" level=info msg="Reconciliation completed" generation=2 namespace=dai resourceVersion=100362643 rollout=linear-stitcher time_ms=55.992812

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@nissy34 nissy34 added the bug Something isn't working label Dec 26, 2021
@nissy34 nissy34 changed the title Down time On initial deployment Downtime On initial deployment Dec 26, 2021
@harikrongali
Copy link
Contributor

@nissy34 this is expected per design. to avoid downtime, the recommendation is to have deployment & rollout run side by side and after rollout becomes stable, remove deployment.

@harikrongali
Copy link
Contributor

harikrongali commented Jan 3, 2022

https://argoproj.github.io/argo-rollouts/migrating/
check warning message in the doc

@nissy34
Copy link
Contributor Author

nissy34 commented Jan 3, 2022

@harikrongali
My point is that in order to not have any downtime it's not enough to have the rollout and deployment side by aide we need also to have a separate service for the deployment and rollout (as argo-controller adds the stable hash to the service before the rollout is ready).

From the docs (migration section)

If you are using this feature then Rollout switches production traffic to Pods that it manages. The switch happens only when the required number of Pod is running and healthy so it is safe in production as well.

This is not the actual behavior from my experience.

@harikrongali
Copy link
Contributor

I see your point, you need to have three services and cant reuse services from old deployment. We can update docs to point it out.

@harikrongali harikrongali added the docs Needs documentation/release notes label Jan 3, 2022
@nissy34
Copy link
Contributor Author

nissy34 commented Jan 3, 2022

@harikrongali thanks for the clarification.

Just to understand, why is this behavior by design?
Doesn't it make sense to wait for all stable pods to be ready on initial deployment before adding the label selector?

@harikrongali
Copy link
Contributor

Achieving readiness gates is one reason but I would let @jessesuen provide detailed info on this.

@jessesuen
Copy link
Member

jessesuen commented Jan 15, 2022

Achieving readiness gates is one reason but I would let @jessesuen provide detailed info on this.

Yes, it is true that making service selectors are in place before pods are created is important for some ingress controller implementations like ALB LoadBalancer controller which inject readiness gates based on the labels matching the pods. But with that said, for the ALB use case, we will steer users to the upcoming ping-pong service management feature (as opposed to using canaryService vs. stableService).

I agree we can improve this behavior for the migration case. In fact, we actually currently avoid changing label selector to the canary service until the pods are ready by calling replicasetutil.IsReplicaSetReady(c.newRS). But we don't do this for the stable case:

func (c *rolloutContext) reconcileStableAndCanaryService() error {
	if c.rollout.Spec.Strategy.Canary == nil {
		return nil
	}
	err := c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.StableService, c.stableRS)
	if err != nil {
		return err
	}

	if replicasetutil.IsReplicaSetReady(c.newRS) { // ensures we don't switch canary selector until pods are ready for canary service
		err = c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.CanaryService, c.newRS)
		if err != nil {
			return err
		}
	}
	return nil
}

Similarly, for blue-green, we will not modify the active service until replicasetutil.ReadyForPause is true

func (c *rolloutContext) reconcileActiveService(activeSvc *corev1.Service) error {
...
	if !replicasetutil.ReadyForPause(c.rollout, c.newRS, c.allRSs) || !annotations.IsSaturated(c.rollout, c.newRS) {
		c.log.Infof("skipping active service switch: New RS '%s' is not fully saturated", c.newRS.Name)
		return nil
	}

So with that said, I agree that one thing we can improve is to never add selector to the service until ReplicaSet is ready. Adding the selector before any pods with that label are is ready is guaranteed downtime for that Service.

@MarkSRobinson
Copy link
Contributor

I am also being impacted by this

@dearboll23
Copy link

dearboll23 commented Jan 5, 2023

Hi, team, unfortunately we faced the similar issue while initing deployment with Istio traffic routing.

When we inited rollout for a deployment with Istio trafficrouting, we found rollout controller created a new RS from the current RS template with a new label key named 'rollouts-pod-template-hash', and the controller update Istio DR with the label immediately without considering whether the new RS is available or not. This leads 503 http error, no healthy upstreams.

Could we add some determine statements to ensure the new RS is available before updating the Istio DR, which is the same like ensureSVCTarget does?
@jessesuen @zachaller Thank you for your attention.

@zachaller
Copy link
Collaborator

@dearboll23 can you create a new issue and describe how to reproduce etc?

@dearboll23
Copy link

@zachaller Sure, have created it .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants