-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply machine deployment changes only to latest machine set, never both (e.g. during update) #754
Comments
Still under analysis for root cause of bug. The current code distributes replicas across old and new machine sets in dc.scale. This logic appears to be correct and was done to reduce number of un-available machines and was apparently introduced to follow: Scale Deployments Proportionally |
After internal discussion and going through the code we found , that the double scale-up , scale-down happens , because of the two MCM logics running one after another.
Both logics have responded properly to the state of the systems individually but the problem is that they both depend on the d.replicas, and reducing it leads to change in the range mentioned above and then rolling update removes a machine(considering scale-down case) even though there isn't a spare machine to take the workload of this machine. This has caused problems as we know with the way autoscaler asks to remove machines from a machineDeployment.
The above problem of workload not having node to be placed on is because of 2 major reasons:
To deal with the above, for now we have done the following:
For users using MCM open-source and if they want desired results after reducing machineDeployment's replicas they should reduce the d.Replicas in a way that the range doesn't get affected (for that he might have to tweak maxSurge , maxUnavailable as they are used to calculate the range.) |
What happened:
While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity, we found out that changes to the machine deployment (e.g. cluster auto-scaler out- or in-scaling) are applied to both machine sets, the old one and the new one. This is at least a bug, but it is also the most likely culprit for the pending workload:
Ready
and will be tainted immediately, but it may confuse the progress tracking and may result in a wrong node being drained.Scale-Out
3 x 3 nodes, 2 zones completed their rolling update, 1 zone hangs,
maxSurge
is 6 / 3 = 2 (maxUnavailable
is 0):After adding more workload, all machine deployments scale 3 -> 4, but the backing machine sets for the blocked zone are both incremented, leading to not 3 (as per machine deployment requested), but 4 new machines:
Scale-In
Same situation as above, now manually reducing the machine deployment replica count 3 -> 2:
Outcome after the 3 -> 2 change, the old machine set replica count was reduced, now violating the rolling update contract (2 and no longer the old 3 nodes are
Ready
even though no new machine has joined the cluster yet). Now further reducing the machine deployment replica count 2 -> 1:Outcome after the 2 -> 1 change, now even both machine set replica counts were reduced:
What was expected:
Changes to the machine deployment should only be brought out to the new machine set and never to the old machine set while a rolling update is under way. This is definitely true for scale-out. As for scale-in, that's something we generally should avoid while performing a rolling update. That's why we annotate nodes, I believe. I am not sure whether we do that also for new nodes? Either way, as an additional safeguard, we should at a minimum only apply the change to the new machine set, if at all.
How to reproduce it:
Have a cluster with a worker pool. Either block new machine creation (verified) or have blocking PDBs (not verified) that slow down the drain. Then initiate a cluster update (e.g. new Kubernetes minor version or OS version). While the process in progress, modify the machine deployment (scale-out can be tested also indirectly with more workload that needs capacity). You will see that both machine sets are modified.
Anything else we need to know?
As this may break sensitive workload during rolling updates, the priority is critical.
The text was updated successfully, but these errors were encountered: