-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract criteria for removing unneded nodes to a separate package #5147
Conversation
/hold |
95de888
to
d269e39
Compare
080e48c
to
b4925ea
Compare
} | ||
toRemove := simulator.RemoveNodeFromTracker(sd.usageTracker, node, unneeded) | ||
for _, n := range toRemove { | ||
sd.unneededNodes.Drop(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm guessing the strange split of responsibilities (identifying nodes to filter out in one place and filtering them out elsewhere) is to avoid an import cycle between unneeded and simulator? A local interface in tracker.go that exposes AsList() and Drop() and renaming the function to simulator.FilterOutUsedNodes() or similar would IMO be a more elegant solution, but I'm not sure if it's worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly wanted to get rid of the weird semantic RemoveNodeFromTracker
had: silently modifying the map passed as an argument. Now this is explicitly done here, which I think makes reasoning about the code easier. I wouldn't invest too much into getting this part pretty though, since the tracker itself should go away together with legacy scale down implementation.
// RemovableAt returns all nodes that can be removed at a given time, divided | ||
// into empty and non-empty node lists, as well as a list of nodes that were | ||
// unneeded, but are not removable, annotated by reason. | ||
func (n *Nodes) RemovableAt(context *context.AutoscalingContext, ts time.Time, resourcesLeft resource.Limits, resourcesWithLimits []string, as scaledown.ActuationStatus) (empty, needDrain []*apiv1.Node, unremovable []*simulator.UnremovableNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure if this function should be responsible for dividing nodes into empty/non-empty:
- The checks performed by Nodes() are all related to node itself and not pods running on it (which after your refactor is the domain of eligibility I think?). The fact that this package is grouping nodes by pods here seems inconsistent with what the rest of this package is doing and, even more so, outside of what I'd expect to be the responsibility of this function.
- The grouping is done based on cached result passed to Update(). In current implementation that's fresh result, but that's a pretty big assumption. Obviously we want to avoid stale data leading to node being incorrectly considered empty.
- Current implementation doesn't use that grouping anyway (it puts both groups into one list) and relies on re-checking this in getEmptyNodesToRemove(). Do we expect new scale-down implementation to act differently?
If we drop this groupby we should also update this to cache nodes and not ntbr, since as far as I can tell that's the only place where we use ntbr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually planned to avoid running the simulation twice in new scale down implementation, so the split done here would be final. We can get it wrong, and sometimes already do, which actually proves re-checking doesn't solve the problem, just reduces the chance of a race condition. The proper solution, I think, is to first apply the taints and only then look at the state of the cluster to determine which ones are empty. This was discussed and should be done in #3886. If you disagree we can discuss it there.
/hold cancel |
} | ||
|
||
type node struct { | ||
ntbr simulator.NodeToBeRemoved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this some more, I have to say I don't really like the idea of caching this. We end up caching ntbr across CA loops, but I think the data it contains should not be considered valid outside of the lifetime of the ClusterSnapshot used in the simulation (ie. it's local to CA loop). The most obvious example is PodsToReschedule which can definitely change during the time in between loops. Even more importantly ClusterSnapshot is subject to various possible in-memory modifications (ex. FilterOutSchedulable or any custom shenanigans in PodListProcessor) that can be reflected in PodsToReschedule and the liftetime of those in-memory changes definitely should not be assumed to extend past a single loop.
I realize we expect to always call Update() and RemovableAt() in the same loop and the only thing we really cache between loops is 'since', so in practice it shouldn't break anything. But the fact that we rely on an unwritten assumption that we ignore cached ntbr outside the loop it was cached is not great.
TBH I don't think ntbr should be cached at all. It's never valid outside of a single simulation context, so to me it makes much more sense to just pass it around as a variable with a clearly limited lifetime.
I think this ties back to the problem I had before with RemovableAt grouping nodes into empty / non-empty. After discussing it with you, I agree it will work, but I just think it's completely unrelated to everything else this package is doing and I think limiting the package scope would be a better design choice. The fact that we're mixing responsibilities leads to secondary issues like the inconsistent cache lifetime I'm now complaining about.
nodeGroupSize := utils.GetNodeGroupSizeMap(context.CloudProvider) | ||
for nodeName, v := range n.byName { | ||
klog.V(2).Infof("%s was unneeded for %s", nodeName, ts.Sub(v.since).String()) | ||
node := v.ntbr.Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the same comment as the one about cache above. I think this will work in this particular implementation, but I also think caching nodeName and getting a fresh Node from ClusterSnapshot is inherently a safer option than caching Nodes.
limitations under the License. | ||
*/ | ||
|
||
package unneeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a unittest for RemovableAt / unremovableReason would be a nice bonus. The logic in unremovableReason seems much more testable now. I realize this PR is just moving existing code around, so I'm not going to block on this, but if you decide to make more changes in this PR maybe that could be an extra target of opportunity.
I think this will work and I'm going to block it anymore, but I don't really like this refactor for the reasons I listed in comments above. I'm leaving hold so you can have a chance to read my comments and decide if you want to address them or keep the current version. Please feel free to remove the hold. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I understand where the concern is coming from, but I disagree caching information is wrong in this case. /hold cancel |
Extract criteria for removing unneded nodes to a separate package
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
One more refactor of existing scale down code in order to make parallel scale down (#5079) possible.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is a follow-up to #5133. Only last 2 commits should be reviewed as a part of this PR. It will be on hold until the previous one is merged.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: