Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented May 30, 2022

Various cloudproviders' NodeGroupForNode() implementations (including aws, azure, and gce) can returns a nil error and a nil nodegroup. For instance we're seeing AWS returning that on failed upscales on live clusters, with a recent cluster-autoscaler build.

So checking that deleteCreatedNodesWithErrors doesn't return an error is not enough to safely dereference the nodegroup (as returned by NodeGroupForNode()) by calling nodegroup.Id().

In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge, rather than resuming with an inconsistent internal state.

===

With regards to the AWS cloudprovider triggering that issue with recent CA builds:

What we're seeing is:

I0523 16:37:53.847080     228 aws_manager.go:262] Refreshed ASG list, next refresh after 2022-05-23 16:38:53.847076185 +0000 UTC m=+559.468661724
W0523 16:37:53.905286     228 clusterstate.go:594] Nodegroup is nil for aws:///us-east-1b/i-placeholder-redacted-766c0982e71f-2
I0523 16:37:53.905533     228 clusterstate.go:1008] Found 1 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup redacted-d63729665fff
I0523 16:37:53.905546     228 clusterstate.go:1026] Failed adding 1 nodes (0 unseen previously) to group redacted-d63729665fff due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
[...]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x3639435]
goroutine 63 [running]:
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).deleteCreatedNodesWithErrors(0xc03e307d00)
        /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:677 +0x275
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc03e307d00, {0x4, 0x0, 0x7b9b840})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:360 +0x12a5
main.run(0xc00088ae00, {0x4dc8998, 0xc000855230})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:392 +0x2ad
main.main.func2({0x0, 0x0})
        /go/src/k8s.io/autoscaler/cluster-autoscaler/main.go:479 +0x25
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /go/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x154

I can't easily reproduce the issue to verify that hypothesis, but it seems likely that this is due to the clusterstateregistry and the AWS caches being out-of-sync. That failure path is new, because flagging "failed at creation" instances was recently introduced for AWS provider (8ac87b3).

We might get the following sequence:

  1. RunOnce 1 : an AWS upscale is attempted (but the AWS ASG fails to create an instance)
  2. RunOnce 2 : a Refresh() call prompts the AWS cloudprovider to regenerate the instances lists and ASG mappings, including by generating a new fake/placeholder for that failed instance
  3. RunOnce 2 : clusterstateregistry gather and caches (for 2mn) the nodes list (which includes the placeholder instance)
  4. RunOnce 2 : deleteCreatedNodesWithErrors() is called to gc the failed instance
  5. RunOnce 3 : a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore
  6. RunOnce 3 : deleteCreatedNodesWithErrors() is called and uses the outdated clusterstateregistry cache, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

/kind bug
/kind regression

Various cloudproviders' `NodeGroupForNode()` implementations (including
aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup.
Eg. we're seeing AWS returning that on failed upscales on live clusters.
Checking that `deleteCreatedNodesWithErrors` doesn't return an error is
not enough to safely dereference the nodegroup (as returned by
`NodeGroupForNode()`) by calling nodegroup.Id().

In that situation, logging and returning early seems the safest option,
to give various caches (eg. clusterstateregistry's and cloud provider's)
the opportunity to eventually converge.
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 30, 2022
Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2022
@k8s-ci-robot k8s-ci-robot merged commit 6558bed into kubernetes:master May 30, 2022
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Sep 8, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Sep 8, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Sep 27, 2022
…hErrors-no-segfault

- Don't deref nil nodegroup in deleteCreatedNodesWithErrors
- Bugfix: Expander Priority warns misleading log
navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022
…hErrors-no-segfault

Don't deref nil nodegroup in deleteCreatedNodesWithErrors
@@ -676,6 +681,9 @@ func (a *StaticAutoscaler) deleteCreatedNodesWithErrors() bool {
klog.Warningf("Cannot determine nodeGroup for node %v; %v", id, err)
continue
}
if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
return false, fmt.Errorf("node %s has no known nodegroup", node.GetName())
}
Copy link
Contributor

@qianlei90 qianlei90 Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunOnce will always returned when nodegroup is nil and CA can not go on until CloudProviderNodeInstancesCache is refreshed, which may up to 2 min.
Should we refresh CloudProviderNodeInstancesCache here to sync up with cloud provider?

@bpineau @mwielgus

@qianlei90
Copy link
Contributor

We might get the following sequence:

  1. RunOnce 1 : an AWS upscale is attempted (but the AWS ASG fails to create an instance)
  2. RunOnce 2 : a Refresh() call prompts the AWS cloudprovider to regenerate the instances lists and ASG mappings, including by generating a new fake/placeholder for that failed instance
  3. RunOnce 2 : clusterstateregistry gather and caches (for 2mn) the nodes list (which includes the placeholder instance)
  4. RunOnce 2 : deleteCreatedNodesWithErrors() is called to gc the failed instance
  5. RunOnce 3 : a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore
  6. RunOnce 3 : deleteCreatedNodesWithErrors() is called and uses the outdated clusterstateregistry cache, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

I think the correct sequence is:

  1. RunOnce 1: scale up and failed
  2. RunOnce 2: cloudprovider Refresh(), generate a new fake/placeholder for that failed instance
  3. RunOnce 2: deleteCreatedNodesWithErrors() is called to gc the failed instance, and the instance cache in clusterstateregistry is removed.
  4. Between RunOnce 2 and RunOnce 3: clusterstateregistry refresh instance cache at this time and get the out-of-sync instances from provider. Notice that CloudProviderNodeInstancesCache runs every 2 min to refresh cache.
  5. RunOnce 3: a Refresh() call regenerates the nodes/asg mapping, now there's no dangling placeholder nodes anymore
  6. RunOnce 3: the outdated clusterstateregistry cache is used and we get a failed instance, then deleteCreatedNodesWithErrors() is called again, tries to deref a nil nodegroup for a deposed placeholder instance, causes a segfault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants