Fix Race conditions in Targeted Deletion of machines by CA #341

rishabh-11 · 2024-12-20T11:12:29Z

What this PR does / why we need it:
This PR fixes the issues noticed in live issues 6120 and 6101. We introduce a mutex in the machineDeployment struct (node group implementation) which must be acquired before performing a scale-down or scale-up operation.

We also introduce an annotation on the machine deployment cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion whose value denotes the machines CA wants to remove. This will help recognize machines for which CA has already reduced the replicas of machine deployment and prevent it from being duplicated.

Which issue(s) this PR fixes:
Fixes #342

Special notes for your reviewer:

Release note:

elankath · 2024-12-24T10:49:42Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
+		return nil
+	}
+	markedMachines := sets.New(strings.Split(mcd.Annotations[machinesMarkedByCAForDeletion], ",")...)


Please move this to a small function mcm.getMachinesMarkedByCAForDeletion(mcd) (machineNames sets.Set[string]) which is unit testable and can be consumed by both the mcm_cloud_provider.go and mcm_manager.go

elankath · 2024-12-24T10:50:28Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
+		return err
+	}
+	var incorrectlyMarkedMachines []*Ref


I believe we can omit the Ref struct and just used types.NamespacedName which is already being used anyways in this file

elankath · 2024-12-26T06:39:47Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
 	return machinedeployment, nil
 }

+func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {


Perhaps using the "comma, ok" idiom here for mcd.Annotations[machinesMarkedByCAForDeletion] would be good for clarity to avoid unnecessary split of empty string.

Isn't it allowed to read from a nil map? Only writes will cause panic, right?

getMachinesMarkedByCAForDeletion->getMachineNamesMarkedByCAForDeletion

elankath · 2024-12-26T06:41:52Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+	// ignore the machine deployment if it is in rolling update
+	if !isRollingUpdateFinished(mcd) {
+		klog.Infof("machine deployment %s is under rolling update, skipping", machineDeployment.Name)
+		return nil


Shouldn't we return an error here ? We are doing it for other cases of !isRollingUpdateFinished.

I think we should remove this check. Even if a machineDeployment is under rolling update, we should allow the annotation update if needed. wdyt?

elankath · 2024-12-26T06:52:30Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+// addNodeGroup adds node group defined in string spec. Format:
+// minNodes:maxNodes:namespace.machineDeploymentName
+func (m *McmManager) addNodeGroup(spec string) error {
+	machineDeployment, err := buildMachineDeploymentFromSpec(spec, m)


buildMachineDeploymentFromSpec should also be moved to mcm_manager.go.

elankath · 2024-12-26T06:59:10Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+func (machineDeployment *MachineDeployment) Refresh() error {
+	machineDeployment.scalingMutex.Lock()
+	defer machineDeployment.scalingMutex.Unlock()
+	mcd, err := machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name)


This is repeated nearly a dozen times everywhere including common error handling: machineDeployment.mcmManager.machineDeploymentLister.MachineDeployments(machineDeployment.Namespace).Get(machineDeployment.Name) . Move to a method in mcmManager called GetMachineDeploymentResource which returns a formatted error that can simply be returned, so that error message is fully consistent with all uses. we are already having methods like mcmManager.getMachinesForMachineDeployment so this matches the existing convention.

elankath · 2024-12-26T07:04:50Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		if machine.Status.CurrentStatus.Phase == v1alpha1.MachineTerminating || machine.Status.CurrentStatus.Phase == v1alpha1.MachineFailed {
+			continue
+		}
+		if annotValue, ok := machine.Annotations[machinePriorityAnnotation]; ok && annotValue == priorityValueForCandidateMachines && !markedMachines.Has(machine.Name) {


Out of curiosity, Why is this called priorityValueForCandidateMachines ? Shouldn't it be defined as const PriorityDeletionValueForCandidateMachine = 1 ?

I think we can rename it to PriorityValueForDeletionCandidateMachine. wdyt?

we should also add a comment here to explain what we are doing so next guy making a patch doesn't scratch his head.

elankath · 2024-12-26T07:15:40Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		}
+	}
+	clone := mcd.DeepCopy()
+	clone.Annotations[machinesMarkedByCAForDeletion] = strings.Join(updatedMarkedMachines, ",")


This strings.Join to construct the annotation repeated elsewhere. Make a utility method called getMarkedForDeletionAnnotationValue(machineNames []string) string

elankath · 2024-12-26T07:20:01Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		klog.Errorf("[Refresh] failed to get machines for machine deployment %s, hence skipping it. Err: %v", machineDeployment.Name, err.Error())
+		return err
+	}
+	var incorrectlyMarkedMachines []*types.NamespacedName


minor (need not correct): types.NamespacedName is a simple struct - a value object. Modern programming practices discourages use of pointers to value objects for keeping data inline for processor cache performance. Pointers should be used only for structs holding active resource (like locks/files/etc) or class/service objects. Using a simple []types.NamespacedName and using types.NamespacedName instead of *types.NamespacedName should be followed for newer code.

elankath · 2024-12-26T07:26:32Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

@@ -539,12 +567,17 @@ func buildMachineDeploymentFromSpec(value string, mcmManager *McmManager) (*Mach
 	return machinedeployment, nil
 }

+func getMachinesMarkedByCAForDeletion(mcd *v1alpha1.MachineDeployment) sets.Set[string] {


Name function to getMachineNamesMarkedByCAForDeletion

elankath · 2024-12-26T07:26:51Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

@@ -508,27 +474,32 @@ func (m *McmManager) DeleteMachines(targetMachineRefs []*Ref) error {
 	if !isRollingUpdateFinished(md) {
 		return fmt.Errorf("MachineDeployment %s is under rolling update , cannot reduce replica count", commonMachineDeployment.Name)
 	}
+	markedMachines := getMachinesMarkedByCAForDeletion(md)


markedMachines -> machineNamesMarkedByCA

elankath · 2024-12-26T07:30:13Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+			incorrectlyMarkedMachines = append(incorrectlyMarkedMachines, &types.NamespacedName{Name: machine.Name, Namespace: machine.Namespace})
+		}
+	}
+	var updatedMarkedMachines []string


updatedMarkedMachines -> updatedMarkedMachineNames. Also please add a comment explaining that we do to ensure that annotation do not have non-existent machine names.

elankath · 2024-12-26T07:30:50Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+		}
+	}
+	var updatedMarkedMachines []string
+	for machineName := range markedMachines {


minor: Technically, you can create allMachineNames and use Set.Intersection here - which is cheaper than nested for loop, but its OK.

AFAIK k8s.io/apimachinery/pkg/util/sets does not have an Intersection method.

https://github.com/kubernetes/apimachinery/blob/v0.32.0/pkg/util/sets/set.go#L156

yes but it is not part of the apimachinery version we currently use. Anyways, we do not need sets, I have replaced them with slices.

elankath · 2024-12-26T07:44:00Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

-			collectiveError = errors.Join(collectiveError, m.resetPriorityForMachines(targetRefs))
-		}
+	for _, machineDeployment := range m.machineDeployments {
+		collectiveError = errors.Join(collectiveError, machineDeployment.Refresh())


non-optimal use of errors.Join which will create repeated struct instances. errors.Join creates a new error object by combining multiple errors. If used in a loop, it allocates memory each time it's called. The right usage is first make a var errs []error outside the loop, append individual errors into this slice within the loop and then use errors.Join outside the loop to create the final error.

elankath · 2024-12-26T07:49:36Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

 	// update priorities of machines to be deleted except the ones already in termination to 1
-	scaleDownAmount, err := m.prioritizeMachinesForDeletion(targetMachineRefs)
+	machinesWithPrio1, err := m.prioritizeMachinesForDeletion(targetMachineRefs)


machinesWithPrio1->machineNamesWithPrio1

elankath · 2024-12-26T07:55:13Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+	var incorrectlyMarkedMachines []*types.NamespacedName
+	for _, machine := range machines {
+		// no need to reset priority for machines already in termination or failed phase
+		if machine.Status.CurrentStatus.Phase == v1alpha1.MachineTerminating || machine.Status.CurrentStatus.Phase == v1alpha1.MachineFailed {


we already have a utiltiy method isMachineFailedOrTerminating . Is that not suitable or make it suitable ?

elankath · 2024-12-26T08:04:45Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

 	if err != nil {
 		return err
 	}
+	markedMachines.Insert(machinesWithPrio1...)
 	// Trying to update the machineDeployment till the deadline
 	err = m.retry(func(ctx context.Context) (bool, error) {


Please consider whether one of the apimachinery wait.Poll* utility functions can be used or whether retry can delete to one of the apimachinery wait.Poll* functions. wait.PollUntilContextTimeout seems suitable.

We are using our retry since wait.PollUntilContextTimeout does not give logs. We were able to debug the issue thanks to these logs so I think we should still keep it

elankath · 2024-12-26T08:05:45Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

 	var expectedToTerminateMachineNodePairs = make(map[string]string)
+	var machinesMarkedWithPrio1 []string


machinesMarkedWithPrio1 -> prio1MarkedMachineNames. (optionally, move this variable to a named return value on this method)

MINOR: Since you are already having a map expectedToTerminateMachineNodePairs which should be called prio1MarkedMachineNodeNamePairs , you don't need another slice. You can use Go's maps.Keys(prio1MarkedMachineNodeNamePairs).Collect() as return value instead of separate heap allocation for duplicate slice.

maps.Keys was introduced in go 1.23 - https://pkg.go.dev/maps@go1.22.10 does not have Keys function

Hmm...my bad this function was added later to the package which was newly introduced in Go 1.21

elankath · 2024-12-26T09:18:54Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

+		if kube_errors.IsNotFound(err) {
+			klog.Warningf("Machine %s not found, skipping resetting priority annotation", mcRef.Name)
+			continue
+		}
 		if err != nil {
 			collectiveError = errors.Join(collectiveError, fmt.Errorf("unable to get Machine object %s, Error: %v", mcRef, err))


same point regarding errors.Join - should be invoked just once outside loop to avoid repeated mem alloc inside for loop.

elankath · 2024-12-26T09:22:14Z

@rishabh-11 thanks for PR. Added review comments. Can you make a list of all test-cases that you can think of - including success, failures, crashes etc ? I will test them out and attach test log.

elankath · 2025-01-01T06:58:55Z

This PR is currently under review/testing. will update with further details later.

unmarshall · 2025-01-07T11:07:03Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

 }

 // MinSize returns minimum size of the node group.
-func (machinedeployment *MachineDeployment) MinSize() int {
-	return machinedeployment.minSize
+func (machineDeployment *MachineDeployment) MinSize() int {


it is usually a practice to use short names. In this case you can perhaps just use:

func (md *MachineDeployment) MinSize {}

Reference: https://google.github.io/styleguide/go/decisions.html#receiver-names

Yes, I actually mentioned this to rishabh that our machineDeployment receiver and variable names and CA nodeimpl struct names are all overloaded. But he wanted to minimize the changes. I will make this change anyways.

unmarshall · 2025-01-07T11:14:32Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

 	if delta >= 0 {
 		return fmt.Errorf("size decrease size must be negative")
 	}
-	size, err := machinedeployment.mcmManager.GetMachineDeploymentSize(machinedeployment)
+	machineDeployment.scalingMutex.Lock()


When i looked at the code, this method is called from fixNodeGroupSize which in our fork is not getting called from anywhere. So having a mutex here is kind of useless, right? So while the issue stated that the issue is with duplicate scale down. I am unable to relate this code change with the actual issue.

unmarshall · 2025-01-07T11:18:10Z

cluster-autoscaler/cloudprovider/mcm/mcm_cloud_provider.go

+
+// Refresh resets the priority annotation for the machines that are not present in machines-marked-by-ca-for-deletion annotation on the machineDeployment
+func (machineDeployment *MachineDeployment) Refresh() error {
+	machineDeployment.scalingMutex.Lock()


why should a scaling mutex be used in Refresh?
Scaling mutex by its very name signifies mutex meant to be taken for any scaling event and not when its trying to read/get machine deployments.

Also can this result in the next reconcile of RunOnce to be stuck because it cannot go beyond CloudProvider.Refresh call?

We need a mutex when running scaling operations or modifying attributes that directly affect the scaling operation like the newly introduced annotation. There is not any doubt about this as we HAVE to prevent concurrent scaleups and scsledowns due to the fact that our MCM scaling is async and that multiple Go routines can perform scale downs, leading to non-deterministic behaviour.

You have a point that RunOnce should return an error and not block if the mutex is already acquired. Will change the code to check if mutex is acquired and if so return an informative error. Will also test this change.

unmarshall · 2025-01-07T11:29:44Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

 	// Trying to update the machineDeployment till the deadline
 	err = m.retry(func(ctx context.Context) (bool, error) {
-		return m.scaleDownMachineDeployment(ctx, commonMachineDeployment.Name, scaleDownAmount)
+		return m.scaleDownAndAnnotateMachineDeployment(ctx, commonMachineDeployment.Name, len(machineNamesWithPrio1), createMachinesMarkedForDeletionAnnotationValue(machineNamesMarkedByCA))


Right now we make one call to KAPI to update the priorities and another call to change the replicas + add scale down annotation. Why can't these two be combined to reduce the number of calls to KAPI and further reduce the complexity of handling failure cases between the two calls?

aaronfern · 2025-01-07T11:25:53Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

-	md, err := m.machineDeploymentLister.MachineDeployments(m.namespace).Get(mdName)
+// scaleDownAndAnnotateMachineDeployment scales down the machine deployment by the provided scaleDownAmount and returns the updated spec.Replicas after scale down.
+// It also updates the machines-marked-by-ca-for-deletion annotation on the machine deployment with the list of existing machines marked for deletion.
+func (m *McmManager) scaleDownAndAnnotateMachineDeployment(ctx context.Context, mdName string, scaleDownAmount int, markedMachines string) (bool, error) {


Should lock the mutex before updating the mcd

There is a check in this function to skip mcd update if there is no change to the number or replicas. Ig this should also be updated to consider the annotation as well

aaronfern · 2025-01-07T11:59:51Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

@@ -98,6 +100,9 @@ const (
 	machineDeploymentPausedReason = "DeploymentPaused"
 	// machineDeploymentNameLabel key for Machine Deployment name in machine labels
 	machineDeploymentNameLabel = "name"
+	// machinesMarkedByCAForDeletion is the annotation set by CA on machine deployment. Its value denotes the machines that
+	// CA marked for deletion by updating the priority annotation to 1 and scaling down the machine deployment.
+	machinesMarkedByCAForDeletion = "cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion"


Suggested change

machinesMarkedByCAForDeletion = "cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion"

machinesMarkedByCAForDeletionAnnotation = "cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion"

Need not be considered, but this is usually how annotations are defined

aaronfern · 2025-01-07T12:03:41Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

-		min,
-		max,
-	}, nil
+	return machineDeployment, nil
 }

 // Refresh method, for each machine deployment, will reset the priority of the machines if the number of annotated machines is more than desired.
 // It will select the machines to reset the priority based on the descending order of creation timestamp.
 func (m *McmManager) Refresh() error {


Docstring should be updated to be more accurate

aaronfern · 2025-01-07T12:08:20Z

cluster-autoscaler/cloudprovider/mcm/mcm_manager.go

@@ -916,6 +899,16 @@ func (m *McmManager) GetMachineDeploymentNodeTemplate(machinedeployment *Machine
 	return nodeTmpl, nil
 }

+// GetMachineDeploymentResource returns the MachineDeployment object for the provided machine deployment name
+func (m *McmManager) GetMachineDeploymentResource(mdName string) (*v1alpha1.MachineDeployment, error) {


Suggested change

func (m *McmManager) GetMachineDeploymentResource(mdName string) (*v1alpha1.MachineDeployment, error) {

func (m *McmManager) GetMachineDeploymentObject(mdName string) (*v1alpha1.MachineDeployment, error) {

Ig Object is more accurate then Resource?

rishabh-11 added 4 commits December 19, 2024 14:59

[WIP] fix CA marking machines for deletion

43cffaf

[WIP] add mutex for machine deployment

0d939b0

initialise machinedeployment map in mcmManager

98f20d3

add Refresh method in nodegrp implementation

56d80ac

gardener-robot added needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Dec 20, 2024

rishabh-11 marked this pull request as ready for review December 24, 2024 08:31

rishabh-11 requested review from unmarshall and a team as code owners December 24, 2024 08:31

elankath reviewed Dec 24, 2024

View reviewed changes

address review comments

f3774f4

gardener-robot added size/l Size of pull request is large (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Dec 24, 2024

rishabh-11 requested a review from elankath December 24, 2024 12:23

gardener-robot removed the size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) label Dec 24, 2024

gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 24, 2024

elankath reviewed Dec 26, 2024

View reviewed changes

address review comments - part 2

9063248

gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Dec 26, 2024

update unit tests, misc code changes

68d2046

gardener-robot added size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) and removed size/l Size of pull request is large (see gardener-robot robot/bots/size.py) labels Dec 28, 2024

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 28, 2024

gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Dec 28, 2024

rishabh-11 requested a review from elankath December 28, 2024 10:54

elankath mentioned this pull request Jan 1, 2025

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

Open

elankath requested a review from aaronfern January 7, 2025 07:46

unmarshall reviewed Jan 7, 2025

View reviewed changes

aaronfern reviewed Jan 7, 2025

View reviewed changes

		var expectedToTerminateMachineNodePairs = make(map[string]string)
		var machinesMarkedWithPrio1 []string

	machinesMarkedByCAForDeletion = "cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion"
	machinesMarkedByCAForDeletionAnnotation = "cluster-autoscaler.kubernetes.io/machines-marked-by-ca-for-deletion"

	func (m McmManager) GetMachineDeploymentResource(mdName string) (v1alpha1.MachineDeployment, error) {
	func (m McmManager) GetMachineDeploymentObject(mdName string) (v1alpha1.MachineDeployment, error) {

Fix Race conditions in Targeted Deletion of machines by CA #341

Are you sure you want to change the base?

Fix Race conditions in Targeted Deletion of machines by CA #341

Conversation

rishabh-11 commented Dec 20, 2024 • edited by elankath Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elankath commented Dec 26, 2024 • edited Loading

elankath commented Jan 1, 2025

Choose a reason for hiding this comment

elankath Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishabh-11 commented Dec 20, 2024 •

edited by elankath

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath Dec 26, 2024 •

edited

Loading

elankath commented Dec 26, 2024 •

edited

Loading

elankath Jan 7, 2025 •

edited

Loading