Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

Open
elankath opened this issue Dec 22, 2024 · 1 comment · May be fixed by #341
Open

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

elankath opened this issue Dec 22, 2024 · 1 comment · May be fixed by #341
Assignees
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/bug Bug

Comments

@elankath
Copy link

elankath commented Dec 22, 2024

What happened:

CA requested an undesired additional reduction of the MCM MachineDeployment deployment while the first request to remove the machine was still in processing. This occurred due to api-server timeouts.

Example log

As shown below, the machine set was scaled down at 21:40:04 but till 21:40:38 it was not processed due to client-side throttling

2024-12-14 21:40:03 | {"log":"Processing the machinedeployment \"shoot--hc-eu10--prod-haas-edge-g-z3\" (with replicas 3)","pid":"1","severity":"INFO","source":"deployment.go:433"}
  |   | 2024-12-14 21:40:03 | {"log":"Desired replicas annotation value: 6 on machineSet shoot--hc-eu10--prod-haas-edge-g-z3-85b5b, Spec Desired Replicas value: 3 on corresponding machineDeployment, so scaling has happened.","pid":"1","severity":"INFO","source":"deployment_sync.go:693"}
  |   | 2024-12-14 21:40:03 | {"log":"Scaling detected for machineDeployment shoot--hc-eu10--prod-haas-edge-g-z3","pid":"1","severity":"INFO","source":"deployment.go:533"}
  |   | 2024-12-14 21:40:03 | {"log":"Scaling latest/theOnlyActive machineSet shoot--hc-eu10--prod-haas-edge-g-z3-85b5b","pid":"1","severity":"INFO","source":"deployment_sync.go:412"}
2024-12-14 21:40:04 | {"log":"Waited for 980.323975ms due to client-side throttling, not priority and fairness, request: PUT:https://api.aws-eu2.garden.internal.live.k8s.ondemand.com:443/apis/machine.sapcloud.io/v1alpha1/namespaces/shoot--hc-eu10--prod-haas/machinesets/shoot--hc-eu10--prod-haas-edge-g-z3-85b5b?timeout=1m0s","pid":"1","severity":"INFO","source":"request.go:632"}
  |   | 2024-12-14 21:40:04 | {"log":"shoot--hc-eu10--prod-haas-edge-g-z3-85b5b updated. Desired machine count change: 6-\u003e3","pid":"1","severity":"INFO","source":"machineset.go:155"}
  |   | 2024-12-14 21:40:04 | {"log":"Event(v1.ObjectReference{Kind:\"MachineDeployment\", Namespace:\"shoot--hc-eu10--prod-haas\", Name:\"shoot--hc-eu10--prod-haas-edge-g-z3\", UID:\"e537417f-4659-4f95-80fe-5bc448a5992e\", APIVersion:\"machine.sapcloud.io/v1alpha1\", ResourceVersion:\"54387612340\", FieldPath:\"\"}): type: 'Normal' reason: 'ScalingMachineSet' Scaled down machine set shoot--hc-eu10--prod-haas-edge-g-z3-85b5b to 3","pid":"1","severity":"INFO","source":"event.go:377"}

During this time, CA again requested to delete the nodes, which caused another scale-down.

2024-12-14 21:40:38 | {"log":"Processing the machineset \"shoot--hc-eu10--prod-haas-edge-g-z3-85b5b\" with replicas 0 associated with machine class: \"\"","pid":"1","severity":"INFO","source":"machineset.go:501"}

What you expected to happen:

CA scale down should be idempotent and should only occur once regardless of any timeouts or throttling. There should be no further reduction of replications by the CA just because MCM is delayed or there are problems executing the scale down.

How to reproduce it (as minimally and precisely as possible):

Hang the scale down in the MCM with artificial extended delay so that the CA issues erroneous scaledowns in subsequent processing cycles

@elankath elankath added the kind/bug Bug label Dec 22, 2024
@elankath elankath assigned rishabh-11 and elankath and unassigned elankath Dec 22, 2024
@elankath elankath added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label Dec 22, 2024
@elankath
Copy link
Author

elankath commented Jan 1, 2025

associated PR under review/testing at: #341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related kind/bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants