-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Downtime On initial deployment #1731
Comments
@nissy34 this is expected per design. to avoid downtime, the recommendation is to have deployment & rollout run side by side and after rollout becomes stable, remove deployment. |
https://argoproj.github.io/argo-rollouts/migrating/ |
@harikrongali From the docs (migration section)
This is not the actual behavior from my experience. |
I see your point, you need to have three services and cant reuse services from old deployment. We can update docs to point it out. |
@harikrongali thanks for the clarification. Just to understand, why is this behavior by design? |
Achieving readiness gates is one reason but I would let @jessesuen provide detailed info on this. |
Yes, it is true that making service selectors are in place before pods are created is important for some ingress controller implementations like ALB LoadBalancer controller which inject readiness gates based on the labels matching the pods. But with that said, for the ALB use case, we will steer users to the upcoming ping-pong service management feature (as opposed to using canaryService vs. stableService). I agree we can improve this behavior for the migration case. In fact, we actually currently avoid changing label selector to the canary service until the pods are ready by calling func (c *rolloutContext) reconcileStableAndCanaryService() error {
if c.rollout.Spec.Strategy.Canary == nil {
return nil
}
err := c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.StableService, c.stableRS)
if err != nil {
return err
}
if replicasetutil.IsReplicaSetReady(c.newRS) { // ensures we don't switch canary selector until pods are ready for canary service
err = c.ensureSVCTargets(c.rollout.Spec.Strategy.Canary.CanaryService, c.newRS)
if err != nil {
return err
}
}
return nil
} Similarly, for blue-green, we will not modify the active service until func (c *rolloutContext) reconcileActiveService(activeSvc *corev1.Service) error {
...
if !replicasetutil.ReadyForPause(c.rollout, c.newRS, c.allRSs) || !annotations.IsSaturated(c.rollout, c.newRS) {
c.log.Infof("skipping active service switch: New RS '%s' is not fully saturated", c.newRS.Name)
return nil
} So with that said, I agree that one thing we can improve is to never add selector to the service until ReplicaSet is ready. Adding the selector before any pods with that label are is ready is guaranteed downtime for that Service. |
I am also being impacted by this |
Hi, team, unfortunately we faced the similar issue while initing deployment with Istio traffic routing. When we inited rollout for a deployment with Istio trafficrouting, we found rollout controller created a new RS from the current RS template with a new label key named 'rollouts-pod-template-hash', and the controller update Istio DR with the label immediately without considering whether the new RS is available or not. This leads 503 http error, no healthy upstreams. Could we add some determine statements to ensure the new RS is available before updating the Istio DR, which is the same like ensureSVCTarget does? |
@dearboll23 can you create a new issue and describe how to reproduce etc? |
@zachaller Sure, have created it . |
Summary
On first migration from deployment to rollouts I get a few seconds of downtime when using trafficRouting.
I assume that argo rollouts doesn't wait for the StableRS to be ready before adding the pod hash to the stable service selector.
This can be seen from the logs and events.
In the events, we can see that the switch happened before the scaling up.
and in the logs first we have
and a few logs line after we see
Diagnostics
What version of Argo Rollouts are you running? 1.1.1
Paste the logs from the rollout controller
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: