Price Improvement Threshold #1440

wmgroot · 2024-07-17T22:37:26Z

Description

One piece of implementation for RFC #1433 (Consolidation Policies)

What problem are you trying to solve?
Make it possible to balance the cost of disrupting pods against the cost of using a more expensive instance type.

Currently, single-node consolidation will replace any instance with a cheaper version that can house the same pods, resulting in very frequent pod disruption. Disruption pods has its own cost, which for certain workloads can negate savings or result in a net loss of money.

This concept has already been documented as an improvement for spot consolidation, but I see no reason why it wouldn't be useful for generalized consolidation.
https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md#2-price-improvement-factor

How important is this feature to you?
Very important. We are currently running a fork of Karpenter with single-node consolidation completely disabled due to the frequent node disruption we are seeing for small price improvements (individual pods are frequently being disrupted multiple times an hour).

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

ellistarn · 2024-07-17T23:15:13Z

Just brainstorming:

I'm curious if we might want to reason about this as spec.disruption.cost. We factor in a percentage of the cost of the node when making consolidation decisions. We might use this to sort nodes, and favor smaller nodes for consolidation. It also happens to directly map to priceImprovementFactor: 1.1 == disruption.cost: .1. Another thing to think about is that Kubernetes APIs don't allow floats, so we may need to do something like priceImprovementPercent: 110 or disruption.costPercent: 10.

wmgroot · 2024-07-19T17:56:46Z

To check my understanding of disruption.costPercent: 10, you're effectively giving users a way to say that "I'm losing 10% of what I'm paying for this node, so make sure the savings for the replacement node are more than 10% before replacing this node"?

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

ellistarn · 2024-07-19T17:59:16Z

If I'm understanding that correctly, it sounds to me like it would be better placed on an individual NodeClaim spec rather than in NodePool.spec.disruption.

This is a good point. Similar line of reasoning to TGP and expireAfter.

If we model it as priceImprovementFactor, then it might be considered a property of the NodePool. I think it makes more sense at the NodeClaim level, though. Given this, how would you expect it to interact with drift?

wmgroot · 2024-07-19T18:00:38Z

Additionally, it could potentially support fixed or percent values via IntOrString, such as:

NodeClaim.spec.disruptionCost: 10%
NodeClaim.spec.disruptionCost: 20 (effectively saying a one-time disruption of this node would cost $20)

The problem with fixed value disruption costs is that karpenter has no way to predict how long the node would be around in the future to gauge if a one-time $20 cost is worth it. Maybe it could use historical data on node age to make a prediction.

wmgroot · 2024-07-19T18:03:00Z

Given this, how would you expect it to interact with drift?

I expect drift to not take this feature into account at all. If the node needs to be replaced, it needs to be replaced regardless of cost. When a new nodeclaim is provisioned to replace each node, the usual provisioning logic should still choose the cheapest available, which could result in minor cost improvements that are below the configured threshold.

njtran · 2024-08-12T04:44:56Z

/triage accepted

njtran · 2024-08-12T04:45:23Z

I've marked this as accepted, but I think we definitely need to talk this out. Excited to discuss more about this

wmgroot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 17, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024

wmgroot mentioned this issue Jul 17, 2024

Consolidation ttl: spec.disruption.consolidateAfter #735

Closed

wmgroot mentioned this issue Jul 22, 2024

feat: nodeclaim.spec.minimumPriceImprovementPercent #1454

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 12, 2024

andrewhibbert mentioned this issue Sep 18, 2024

Ability to define threshold usage percentage for Karpenter to consider nodes for consolidation when using WhenEmptyOrUnderutilized #1686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Price Improvement Threshold #1440

Price Improvement Threshold #1440

wmgroot commented Jul 17, 2024 •

edited

Loading

ellistarn commented Jul 17, 2024

wmgroot commented Jul 19, 2024

ellistarn commented Jul 19, 2024

wmgroot commented Jul 19, 2024

wmgroot commented Jul 19, 2024 •

edited

Loading

njtran commented Aug 12, 2024

njtran commented Aug 12, 2024

Price Improvement Threshold #1440

Price Improvement Threshold #1440

Comments

wmgroot commented Jul 17, 2024 • edited Loading

Description

ellistarn commented Jul 17, 2024

wmgroot commented Jul 19, 2024

ellistarn commented Jul 19, 2024

wmgroot commented Jul 19, 2024

wmgroot commented Jul 19, 2024 • edited Loading

njtran commented Aug 12, 2024

njtran commented Aug 12, 2024

wmgroot commented Jul 17, 2024 •

edited

Loading

wmgroot commented Jul 19, 2024 •

edited

Loading