Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

bpineau · 2022-05-30T17:02:18Z

Various cloudproviders' NodeGroupForNode() implementations (including aws, azure, and gce) can returns a nil error and a nil nodegroup. For instance we're seeing AWS returning that on failed upscales on live clusters, with a recent cluster-autoscaler build.

So checking that deleteCreatedNodesWithErrors doesn't return an error is not enough to safely dereference the nodegroup (as returned by NodeGroupForNode()) by calling nodegroup.Id().

In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge, rather than resuming with an inconsistent internal state.

===

With regards to the AWS cloudprovider triggering that issue with recent CA builds:

What we're seeing is:

I0523 16:37:53.847080     228 aws_manager.go:262] Refreshed ASG list, next refresh after 2022-05-23 16:38:53.847076185 +0000 UTC m=+559.468661724
W0523 16:37:53.905286     228 clusterstate.go:594] Nodegroup is nil for aws:///us-east-1b/i-placeholder-redacted-766c0982e71f-2
I0523 16:37:53.905533     228 clusterstate.go:1008] Found 1 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup redacted-d63729665fff
I0523 16:37:53.905546     228 clusterstate.go:1026] Failed adding 1 nodes (0 unseen previously) to group redacted-d63729665fff due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
[...]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x3639435]
goroutine 63 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc03e307d00)
        /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:677 +0x275
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc03e307d00, {0x4, 0x0, 0x7b9b840})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:360 +0x12a5
main.run(0xc00088ae00, {0x4dc8998, 0xc000855230})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:392 +0x2ad
main.main.func2({0x0, 0x0})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:479 +0x25
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x154

I can't easily reproduce the issue to verify that hypothesis, but it seems likely that this is due to the clusterstateregistry and the AWS caches being out-of-sync. That failure path is new, because flagging "failed at creation" instances was recently introduced for AWS provider (8ac87b3).

We might get the following sequence:

RunOnce 1 : an AWS upscale is attempted (but the AWS ASG fails to create an instance)
RunOnce 2 : a Refresh() call prompts the AWS cloudprovider to regenerate the instances lists and ASG mappings, including by generating a new fake/placeholder for that failed instance
RunOnce 2 : clusterstateregistry gather and caches (for 2mn) the nodes list (which includes the placeholder instance)
RunOnce 2 : deleteCreatedNodesWithErrors() is called to gc the failed instance
RunOnce 3 : a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore
RunOnce 3 : deleteCreatedNodesWithErrors() is called and uses the outdated clusterstateregistry cache, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

/kind bug
/kind regression

Various cloudproviders' `NodeGroupForNode()` implementations (including aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup. Eg. we're seeing AWS returning that on failed upscales on live clusters. Checking that `deleteCreatedNodesWithErrors` doesn't return an error is not enough to safely dereference the nodegroup (as returned by `NodeGroupForNode()`) by calling nodegroup.Id(). In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge.

mwielgus

/lgtm
/approve

k8s-ci-robot · 2022-05-30T17:09:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

…hErrors-no-segfault - Don't deref nil nodegroup in deleteCreatedNodesWithErrors - Bugfix: Expander Priority warns misleading log

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

qianlei90 · 2023-02-16T10:08:50Z

cluster-autoscaler/core/static_autoscaler.go

@@ -676,6 +681,9 @@ func (a *StaticAutoscaler) deleteCreatedNodesWithErrors() bool {
 			klog.Warningf("Cannot determine nodeGroup for node %v; %v", id, err)
 			continue
 		}
+		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
+			return false, fmt.Errorf("node %s has no known nodegroup", node.GetName())
+		}


RunOnce will always returned when nodegroup is nil and CA can not go on until CloudProviderNodeInstancesCache is refreshed, which may up to 2 min.
Should we refresh CloudProviderNodeInstancesCache here to sync up with cloud provider?

@bpineau @mwielgus

qianlei90 · 2023-02-16T12:55:10Z

We might get the following sequence:

RunOnce 1 : an AWS upscale is attempted (but the AWS ASG fails to create an instance)

RunOnce 2 : a Refresh() call prompts the AWS cloudprovider to regenerate the instances lists and ASG mappings, including by generating a new fake/placeholder for that failed instance

RunOnce 2 : clusterstateregistry gather and caches (for 2mn) the nodes list (which includes the placeholder instance)

RunOnce 2 : deleteCreatedNodesWithErrors() is called to gc the failed instance

RunOnce 3 : a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore

RunOnce 3 : deleteCreatedNodesWithErrors() is called and uses the outdated clusterstateregistry cache, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

I think the correct sequence is:

RunOnce 1: scale up and failed
RunOnce 2: cloudprovider Refresh(), generate a new fake/placeholder for that failed instance
RunOnce 2: deleteCreatedNodesWithErrors() is called to gc the failed instance, and the instance cache in clusterstateregistry is removed.
Between RunOnce 2 and RunOnce 3: clusterstateregistry refresh instance cache at this time and get the out-of-sync instances from provider. Notice that CloudProviderNodeInstancesCache runs every 2 min to refresh cache.
RunOnce 3: a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore
RunOnce 3: the outdated clusterstateregistry cache is used and we get a failed instance, then deleteCreatedNodesWithErrors() is called again, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer May 30, 2022 17:02

mwielgus approved these changes May 30, 2022

View reviewed changes

k8s-ci-robot assigned mwielgus May 30, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2022

k8s-ci-robot merged commit 6558bed into kubernetes:master May 30, 2022

drmorr0 mentioned this pull request Sep 8, 2022

Autoscaler getting into segmentation faults #4741

Closed

evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Sep 8, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

1f8adaa

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Sep 8, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

f2f5923

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

bcostabatista mentioned this pull request Sep 27, 2022

cluster autoscaler patch cluster autoscaler 1.20.3 airbnb2 airbnb/autoscaler#28

Merged

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

c0232bc

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

fa16c5e

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

akirillov mentioned this pull request Oct 27, 2022

cluster autoscaler patch cluster autoscaler 1.21.3 airbnb0 airbnb/autoscaler#29

Merged

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

81dbba0

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

akirillov mentioned this pull request Nov 2, 2022

cluster autoscaler patch cluster autoscaler 1.23.1 airbnb0 airbnb/autoscaler#30

Merged

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

b6d2097

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

akirillov mentioned this pull request Nov 2, 2022

cluster autoscaler patch cluster autoscaler 1.24.0 airbnb0 airbnb/autoscaler#31

Merged

bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022

Merge pull request kubernetes#4926 from DataDog/deleteCreatedNodesWit…

9e75cf4

…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors

bcostabatista mentioned this pull request Nov 7, 2022

cluster autoscaler 1.22.14 release airbnb/autoscaler#32

Merged

qianlei90 reviewed Feb 16, 2023

View reviewed changes

qianlei90 mentioned this pull request Feb 20, 2023

fix(*): refresh node instance cache when nodegroup not found in deleteCreatedNodesWithErrors #5521

Merged

qianlei90 mentioned this pull request May 24, 2023

[cherry-pick to 1.24] Don't deref nil nodegroup in deleteCreatedNodesWithErrors #5804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

bpineau commented May 30, 2022

mwielgus left a comment

k8s-ci-robot commented May 30, 2022

qianlei90 Feb 16, 2023 •

edited

Loading

qianlei90 commented Feb 16, 2023

Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

Conversation

bpineau commented May 30, 2022

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 30, 2022

qianlei90 Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

qianlei90 commented Feb 16, 2023

qianlei90 Feb 16, 2023 •

edited

Loading