fix race condition #52

hyang200 · 2022-10-07T00:18:39Z

This PR fixes race conditions where nodes are terminated (i.e by cluster-autoscaler) while in the progress of cycling nodes, which is identify in:

Cordon Phase - Trying to cordon a node that has been removed from k8s api, CNR failed with

Node "xxxx" not found

Healing Phase - Trying to re-attach a ec2 instance to ASG that's in terminated/shutting down

To simulate the senario we can insert below to each of the transition phases

// simulate unexpected termination 
for _, node := range t.cycleNodeRequest.Status.CurrentNodes {
err := t.rm.CloudProvider.TerminateInstance(node.ProviderID)
t.rm.LogEvent(t.cycleNodeRequest, "TestTerminate", "Terminating: %v, err: %v", node.Name, err)
_, err = t.rm.GetNode(node.Name)
	}

- cordon removed nodes - reattach terminated nodes

atlassian-cla-bot · 2022-10-07T00:18:42Z

Hooray! All contributors have signed the CLA.

hyang200 · 2022-10-07T00:22:53Z

pkg/controller/cyclenoderequest/transitioner/transitions.go

@@ -303,8 +303,8 @@ func (t *CycleNodeRequestTransitioner) transitionScalingUp() (reconcile.Result,

 	// Check we have waited long enough - give the node some time to start up
 	if time.Since(scaleUpStarted.Time) <= scaleUpWait {
-		t.rm.LogEvent(t.cycleNodeRequest, "ScalingUpWaiting", "Waiting for new nodes to be ready")
-		return reconcile.Result{Requeue: true, RequeueAfter: requeueDuration}, nil
+		t.rm.LogEvent(t.cycleNodeRequest, "ScalingUpWaiting", "Waiting for new nodes to be warmed up")


reduce the number of unnecessary requeue

pkg/controller/cyclenoderequest/transitioner/transitions.go

hyang200 · 2022-10-07T00:24:10Z

pkg/controller/cyclenoderequest/transitioner/util.go

@@ -160,7 +160,7 @@ func (t *CycleNodeRequestTransitioner) finalReapChildren() (shouldRequeue bool,
 	}

 	switch t.cycleNodeRequest.Status.Phase {
-	case v1.CycleNodeRequestInitialised:
+	case v1.CycleNodeRequestInitialised, v1.CycleNodeRequestFailed:


Fix Failed CNR in infinity loop of re-queue

vincentportella · 2022-10-07T00:28:05Z

Change the version in the makefile to 1.8.2

dtnyn · 2022-10-07T06:32:47Z

Change the version in the makefile to 1.8.2

We can wait to cut release in a separate PR to not mix this issue changes with the release cut trigger

vincentportella · 2022-10-07T08:37:42Z

You are correct 🤦‍♂️

pkg/cloudprovider/aws/aws.go

pkg/k8s/node.go

hyang200 · 2022-10-11T23:11:00Z

closing to investigate intermediate failure in pre-termination trigger

- soft fail pretermination trigger and checks

pkg/controller/cyclenoderequest/transitioner/transitions.go

dtnyn

LGTM

pkg/controller/cyclenoderequest/transitioner/transitions.go

hyang200 · 2022-10-13T21:58:49Z

@vincentportella there're many other places where %v is used for string values, i don't really know what difference would %s make, but feel free to raise another PR if you wish to get rid of all of them. please let me know if you have any other feedback.

vincentportella

lgtm

- improvement for race condition handling #52

hyang200 added 5 commits October 5, 2022 16:18

requeue in scaleUpWait instead of requeueDuration on initial check

7923533

ignore incorrect instance state in attach

4f7aefc

set shouldRequeue to false on Failed

f7dc1ff

fix race condition in cordon and healing

156a782

- cordon removed nodes - reattach terminated nodes

clean up go mod

3b10dcf

hyang200 commented Oct 7, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

hyang200 commented Oct 7, 2022

View reviewed changes

hyang200 added 2 commits October 10, 2022 15:39

handle node not found error explicitly

59eb8ae

add verifyIfErrorOccuredWithDefaults func

cf47b29

hyang200 marked this pull request as ready for review October 10, 2022 05:09

vincentportella reviewed Oct 10, 2022

View reviewed changes

pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved

vincentportella reviewed Oct 10, 2022

View reviewed changes

pkg/k8s/node.go Outdated Show resolved Hide resolved

update error handling

ce62e4f

dtnyn previously approved these changes Oct 11, 2022

View reviewed changes

remove error handling

104c8eb

hyang200 dismissed dtnyn’s stale review via 104c8eb October 11, 2022 04:23

include respond body in termTrigger err

84820e4

hyang200 closed this Oct 11, 2022

fix race condition in pretermination handlings

91ebd81

- soft fail pretermination trigger and checks

hyang200 reopened this Oct 12, 2022

hyang200 requested review from vincentportella and dtnyn October 12, 2022 05:53

dtnyn reviewed Oct 12, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

revert softfail term trigger/hcs

33e2a2a

dtnyn previously approved these changes Oct 13, 2022

View reviewed changes

vincentportella reviewed Oct 13, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

vincentportella reviewed Oct 13, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

vincentportella reviewed Oct 13, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

vincentportella reviewed Oct 13, 2022

View reviewed changes

pkg/controller/cyclenoderequest/transitioner/transitions.go Outdated Show resolved Hide resolved

hyang200 dismissed dtnyn’s stale review via 92b4646 October 13, 2022 21:50

hyang200 requested review from vincentportella and dtnyn October 13, 2022 21:59

update error message

7078d46

dtnyn approved these changes Oct 13, 2022

View reviewed changes

vincentportella approved these changes Oct 14, 2022

View reviewed changes

hyang200 merged commit eaf643c into atlassian-labs:master Oct 16, 2022

hyang200 deleted the fix-race-condition branch October 16, 2022 22:49

hyang200 added a commit that referenced this pull request Oct 25, 2022

release 1.8.2

1371401

- improvement for race condition handling #52

hyang200 mentioned this pull request Oct 25, 2022

release 1.8.2 #53

Merged

hyang200 added a commit that referenced this pull request Oct 25, 2022

release 1.8.2 (#53)

2df7147

- improvement for race condition handling #52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix race condition #52

fix race condition #52

hyang200 commented Oct 7, 2022 •

edited

Loading

atlassian-cla-bot bot commented Oct 7, 2022 •

edited

Loading

hyang200 Oct 7, 2022

hyang200 Oct 7, 2022

vincentportella commented Oct 7, 2022

dtnyn commented Oct 7, 2022

vincentportella commented Oct 7, 2022

hyang200 commented Oct 11, 2022

dtnyn left a comment

hyang200 commented Oct 13, 2022 •

edited

Loading

vincentportella left a comment

fix race condition #52

fix race condition #52

Conversation

hyang200 commented Oct 7, 2022 • edited Loading

atlassian-cla-bot bot commented Oct 7, 2022 • edited Loading

hyang200 Oct 7, 2022

Choose a reason for hiding this comment

hyang200 Oct 7, 2022

Choose a reason for hiding this comment

vincentportella commented Oct 7, 2022

dtnyn commented Oct 7, 2022

vincentportella commented Oct 7, 2022

hyang200 commented Oct 11, 2022

dtnyn left a comment

Choose a reason for hiding this comment

hyang200 commented Oct 13, 2022 • edited Loading

vincentportella left a comment

Choose a reason for hiding this comment

hyang200 commented Oct 7, 2022 •

edited

Loading

atlassian-cla-bot bot commented Oct 7, 2022 •

edited

Loading

hyang200 commented Oct 13, 2022 •

edited

Loading