Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPA scaling while in scale down delay window causes perpetual "progressing" state on rollout #3848

Closed
2 tasks done
yohanb opened this issue Sep 24, 2024 · 1 comment
Closed
2 tasks done
Assignees
Labels
bug Something isn't working

Comments

@yohanb
Copy link
Contributor

yohanb commented Sep 24, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

I’ve noticed a bug in the Rollout behaviour when these specific conditions met:

  • old revision is still active (in scale down delay window)
  • HPA is changed for the rollout (ex: scale to 10 to 20 pods)
  • canary rollout is triggered
    It seems like the HPA only scales the stable replica set and not the old revision. When a new canary rollout is triggered, this causes it to be in a perpetual Progressing state with the message "more replicas need to be updated". Looking at the code, it seems to be because the UpdatedReplicas doesn’t match the spec.replicas

To Reproduce

  1. Create a Rollout with a scale down delay and an attached HorizontalPodAutoscaler:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  revisionHistoryLimit: 1
  rollbackWindow:
    revisions: 1
  strategy:
    canary:
      scaleDownDelaySeconds: 3600 # 1 hours
      scaleDownDelayRevisionLimit: 1
...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app
spec:
  maxReplicas: 5
  metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: argoproj.io/v1alpha1
    kind: Rollout
    name: app
  1. Run a canary rollout to completion in order to have a running old revision within the scaleDownDelaySeconds window
  2. Trigger a replica count change with the HorizontalPodAutoscaler. For example, change the minReplicas to 2. Notice the HPA only affects the latest revision of the ReplicaSet and not the previous
  3. Trigger another canary rollout
  4. Notice the Rollout is in perpetual Progressing state with the message "more replicas need to be updated"

Expected behavior

I think the HorizontalPodAutoscaler should scale all ReplicaSets so they are in sync and if ever a rollback if performed, the previous revision will be able to handle the load. If that's not possible then it should at least not block the rollout progression.

Screenshots

image

Version

v1.7.2+59e5bd3

Logs

None for the moment. Will try to reproduce and post them.

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@yohanb yohanb added the bug Something isn't working label Sep 24, 2024
@zachaller zachaller self-assigned this Sep 24, 2024
@yohanb
Copy link
Contributor Author

yohanb commented Dec 16, 2024

I've tested locally and it seems like this issue is solved in the upcoming v1.8.0 release with this commit 🎉 .

@yohanb yohanb closed this as completed Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants