-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926
Don't deref nil nodegroup in deleteCreatedNodesWithErrors #4926
Conversation
Various cloudproviders' `NodeGroupForNode()` implementations (including aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup. Eg. we're seeing AWS returning that on failed upscales on live clusters. Checking that `deleteCreatedNodesWithErrors` doesn't return an error is not enough to safely dereference the nodegroup (as returned by `NodeGroupForNode()`) by calling nodegroup.Id(). In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bpineau, mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault - Don't deref nil nodegroup in deleteCreatedNodesWithErrors - Bugfix: Expander Priority warns misleading log
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
…hErrors-no-segfault Don't deref nil nodegroup in deleteCreatedNodesWithErrors
@@ -676,6 +681,9 @@ func (a *StaticAutoscaler) deleteCreatedNodesWithErrors() bool { | |||
klog.Warningf("Cannot determine nodeGroup for node %v; %v", id, err) | |||
continue | |||
} | |||
if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() { | |||
return false, fmt.Errorf("node %s has no known nodegroup", node.GetName()) | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the correct sequence is:
|
Various cloudproviders'
NodeGroupForNode()
implementations (including aws, azure, and gce) can returns anil
error and anil
nodegroup. For instance we're seeing AWS returning that on failed upscales on live clusters, with a recent cluster-autoscaler build.So checking that
deleteCreatedNodesWithErrors
doesn't return an error is not enough to safely dereference the nodegroup (as returned byNodeGroupForNode()
) by calling nodegroup.Id().In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge, rather than resuming with an inconsistent internal state.
===
With regards to the AWS cloudprovider triggering that issue with recent CA builds:
What we're seeing is:
I can't easily reproduce the issue to verify that hypothesis, but it seems likely that this is due to the clusterstateregistry and the AWS caches being out-of-sync. That failure path is new, because flagging "failed at creation" instances was recently introduced for AWS provider (8ac87b3).
We might get the following sequence:
/kind bug
/kind regression