Auto Node Sizing #642

harche · 2021-02-11T14:19:58Z

This enhancement attempts to add a capability to dynamically select node sizing values for the kubelet.

Signed-off-by: Harshal Patil harpatil@redhat.com

openshift-ci-robot · 2021-02-11T14:20:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign adambkaplan after the PR has been reviewed.
You can assign the PR to them by writing /assign @adambkaplan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rphillips · 2021-02-11T14:22:08Z

enhancements/kubelet/kubelet-node-sizing.md

+
+## Summary
+
+Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu reservations.


memory and cpu system reserved*

harche · 2021-02-11T14:22:50Z

/cc @rphillips @mrunalp

rphillips · 2021-02-11T14:23:12Z

enhancements/kubelet/kubelet-node-sizing.md

+
+We have observed that varying the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values.
+
+Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manullay prior to Kubelet start.


nit: manually

rphillips · 2021-02-11T14:23:49Z

enhancements/kubelet/kubelet-node-sizing.md

+metadata:
+  creationTimestamp: null
+  name: testcluster1
+  autoNodeSizing: true


rphillips · 2021-02-11T14:30:27Z

enhancements/kubelet/kubelet-node-sizing.md

+
+Node sizing values will be calculated by the script `/etc/kubernetes/node-sizing`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` which will be used for setting the kubelet parameter `system-reserved`
+
+The script `/etc/kubernetes/node-sizing` will be executed using `ExecStartPre`


I did a quick audit of MCO... Let write the script to /usr/local/bin. _base/files/configure-ovs-network.yaml is one example that writes to /usr/local/bin that already exists.

Thanks, updated.

sdodson · 2021-02-11T15:10:37Z

enhancements/kubelet/kubelet-node-sizing.md

+
+### Installer Changes
+
+If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. 


What percentage of clusters would you expect to enable this versus disable it? Does it require in depth analysis to decide? If we have 80% one way or the other I don't think we want to promote this into install-config.
CC @staebler

We will eventually want this in the default install, because cluster memory reservations can change the workload capacity within the system. It'll be defaulted to off (for now), but some clouds - like Azure - may be enabled by default.

staebler · 2021-02-11T16:30:15Z

enhancements/kubelet/kubelet-node-sizing.md

+
+### Installer Changes
+
+If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. 


The field should not be under metadata.

My first thought is that this should be in the machine pool.

We want to have a cluster wide option that's applicable to all nodes. Either all nodes (workers and masters) have this enabled or not. We didn't want user to specify this option for compute or controlPlane individually.

What is the reasoning for that? Is it required that it be applied cluster-wide for some technical reason? Or is it just for user convenience?

This enhancement will set the optimal values of the system-reserved for the kubelet on the node. Having the optimal value of system-reserved makes a huge difference when node is under severe resource crunch, such as pod running with extremely high memory requirements.

Without the right value of system-reserved the risk of node getting into lock up with NotReady status increases. You can set the system-reserved value only on specific nodes, since it's just a parameter to a kubelet. But this would not be very ideal from the customer support point of view. Customers will see a non-uniform behaviour in performance within their cluster nodes. Nodes with optimal system-reserve will perform substantially better than those without optimal values.

We want to avoid any ambiguity that may arise if customers see their workload run fine on some nodes but not others. This is the reason, we want to make this a cluster-wide option. That way any in future should we receive any customer support ticket related to node lock ups, it's easy for us to understand if the system-reserve was set across the cluster properly or not.

/cc @rphillips

staebler · 2021-02-11T18:49:23Z

enhancements/kubelet/kubelet-node-sizing.md

+## Alternatives
+
+1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea. 
+2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`. 


This is how I would have expected node sizing to be handled. I would have expected a KubeletConfig resource add by the user to either set explicit values or set it to use auto-sizing.

👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.

For the day-1 operation, until the kubelet is deployed and node has joined the cluster the MCO controller does not know the node size.

staebler · 2021-02-11T18:50:54Z

enhancements/kubelet/kubelet-node-sizing.md

+WantedBy=multi-user.target
+```
+
+Node sizing values will be calculated by the script `/usr/local/bin/node-sizing.sh`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` in `/etc/kubernetes/node-sizing-env` which will be used for setting the kubelet parameter `system-reserved`


Can you provide a bit more detail about how the actual values used will be decided in the script?

Thanks, I have added a subsection Node sizing script that talks more about how the script will use the guidance values to determine optimal system reserved.

staebler · 2021-02-11T19:23:59Z

enhancements/kubelet/kubelet-node-sizing.md

+
+Irrespective of `auto node sizing` is enabled or not, MCO will always place the script file `/usr/local/bin/node-sizing.sh` on the node.
+
+However the content of the script file will vary depending upon whether the `auto node sizing` is enabled or disabled. When the `auto node sizing` is disabled, `/usr/local/bin/node-sizing.sh` will set variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` to their existing default values, but when the `auto node sizing` is enabled the script will dynamically select the optimal value based on the resources available on the node. 


What is the plan for how the MCO will know whether auto node sizing is enabled? Is this something that we will allow the user to change as a day-2 operation?

ehashman · 2021-02-18T16:53:04Z

enhancements/kubelet/kubelet-node-sizing.md

+
+## Summary
+
+Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved.


Suggested change

Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved.

Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size.

I don't think the mechanism necessarily needs to live inside the kubelet; some other component of a cluster could be responsible for doing the calculation.

enhancements/kubelet/kubelet-node-sizing.md

ehashman · 2021-02-18T16:57:18Z

enhancements/kubelet/kubelet-node-sizing.md

+
+### Installer Changes
+
+If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g. 


Because this name is end user facing, it's going to be extremely confusing I think; I would expect this to mean "the node automatically will scale in size with respect to my workloads", and not "the system reserved will scale to size of the node". I would suggest another name, like autoSystemReserved.

Thanks, I will update the name to autoSystemReserved

ehashman · 2021-02-18T16:59:40Z

enhancements/kubelet/kubelet-node-sizing.md

+## Alternatives
+
+1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea. 
+2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`. 


👍 Rather than laying down a script to calculate this on the node, the MCO should already know what the node capacity is; could we add logic there to dynamically set the system reserved for nodes in each pool based on some formula? Otherwise the node will potentially be fighting changes in the MCO.

romfreiman · 2021-02-22T07:57:15Z

enhancements/kubelet/kubelet-node-sizing.md

+
+This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated.
+
+## Alternatives


Isnt IPI has the knowledge in advance? Installer it the one who spawns the VMs in the cloud, so it knows the size. Same for BM with the assisted installer where it knows the values in advance. Not sure about Metal3/ironic though.

We need a solution that works across all the clouds and install flavors. Metal would be problematic here to figure out up front, and sometimes machines have different hardware layouts. We plan on having this script run on each node which should alleviate cloud to bare metal discrepancies.

rphillips · 2021-03-26T13:20:52Z

/lgtm

harche · 2021-06-18T08:30:23Z

Not sure why the job ci/prow/markdownlint is failing, when I run it locally it doesn't show any error.

$ hack/markdownlint.sh 
+ '[' -z '' ']'
+ case "$-" in
+ __lmod_vx=x
+ '[' -n x ']'
+ set +x
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for this output (/usr/share/lmod/lmod/init/bash)
Shell debugging restarted
+ unset __lmod_vx
+ markdownlint-cli2 '**/*.md'
markdownlint-cli2 v0.0.15 (markdownlint v0.23.1)
Finding: **/*.md
Linting: 218 file(s)
Summary: 0 error(s)

harche · 2021-06-18T08:31:44Z

Oh got it,

enhancements/kubelet/kubelet-node-sizing.md missing "### User Stories"
enhancements/kubelet/kubelet-node-sizing.md missing "### Risks and Mitigations"
enhancements/kubelet/kubelet-node-sizing.md missing "## Design Details"
enhancements/kubelet/kubelet-node-sizing.md missing "### Graduation Criteria"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Dev Preview -> Tech Preview"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Tech Preview -> GA"
enhancements/kubelet/kubelet-node-sizing.md missing "#### Removing a deprecated feature"
enhancements/kubelet/kubelet-node-sizing.md missing "### Upgrade / Downgrade Strategy"
enhancements/kubelet/kubelet-node-sizing.md missing "## Implementation History"

Signed-off-by: Harshal Patil <harpatil@redhat.com>

rphillips · 2021-06-18T12:54:34Z

/lgtm

openshift-ci · 2021-06-18T14:42:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2021

openshift-ci-robot requested review from deads2k and squeed February 11, 2021 14:20

rphillips reviewed Feb 11, 2021

View reviewed changes

harche force-pushed the dynamic-node-sizing branch from 2247a41 to 39c26d8 Compare February 11, 2021 14:22

openshift-ci-robot requested review from mrunalp and rphillips February 11, 2021 14:22

rphillips reviewed Feb 11, 2021

View reviewed changes

harche force-pushed the dynamic-node-sizing branch from 39c26d8 to 17efd41 Compare February 11, 2021 14:30

rphillips suggested changes Feb 11, 2021

View reviewed changes

harche force-pushed the dynamic-node-sizing branch 3 times, most recently from 5643f43 to cef149b Compare February 11, 2021 14:42

sdodson reviewed Feb 11, 2021

View reviewed changes

staebler reviewed Feb 11, 2021

View reviewed changes

ehashman reviewed Feb 18, 2021

View reviewed changes

harche force-pushed the dynamic-node-sizing branch 2 times, most recently from 7c84577 to 8357ab6 Compare February 19, 2021 11:04

openshift-ci-robot requested a review from rphillips February 22, 2021 07:14

romfreiman reviewed Feb 22, 2021

View reviewed changes

harche force-pushed the dynamic-node-sizing branch from 8357ab6 to b4655ab Compare March 26, 2021 10:37

harche changed the title ~~WIP: Dynamic node sizing~~ Auto Node Sizing Mar 26, 2021

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2021

harche force-pushed the dynamic-node-sizing branch from b4655ab to 99f3bbe Compare March 26, 2021 11:59

openshift-ci-robot assigned rphillips Mar 26, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2021

harche force-pushed the dynamic-node-sizing branch from 99f3bbe to d2bbdc6 Compare June 18, 2021 08:16

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2021

harche force-pushed the dynamic-node-sizing branch 4 times, most recently from 1f7f6bb to 302b8d8 Compare June 18, 2021 11:28

Auto Node Sizing for Node

6b66fa4

Signed-off-by: Harshal Patil <harpatil@redhat.com>

harche force-pushed the dynamic-node-sizing branch from 302b8d8 to 6b66fa4 Compare June 18, 2021 11:45

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2021

mrunalp approved these changes Jun 18, 2021

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2021

openshift-merge-robot merged commit 46d7be3 into openshift:master Jun 18, 2021

danielfoehrKn mentioned this pull request Jun 21, 2021

More accurate kube-/system-reserved based on actual resource usage kubernetes/kubernetes#103046

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Node Sizing #642

Auto Node Sizing #642

harche commented Feb 11, 2021

openshift-ci-robot commented Feb 11, 2021

rphillips Feb 11, 2021

harche commented Feb 11, 2021

rphillips Feb 11, 2021

rphillips Feb 11, 2021

rphillips Feb 11, 2021

harche Feb 11, 2021

sdodson Feb 11, 2021

rphillips Feb 11, 2021

staebler Feb 11, 2021

staebler Feb 11, 2021

harche Feb 19, 2021

staebler Feb 19, 2021

harche Feb 22, 2021

harche Feb 22, 2021

staebler Feb 11, 2021

ehashman Feb 18, 2021

harche Feb 19, 2021 •

edited

Loading

staebler Feb 11, 2021

harche Feb 19, 2021

staebler Feb 11, 2021

ehashman Feb 18, 2021

ehashman Feb 18, 2021

harche Feb 19, 2021

ehashman Feb 18, 2021

romfreiman Feb 22, 2021

rphillips Feb 22, 2021

rphillips commented Mar 26, 2021

harche commented Jun 18, 2021

harche commented Jun 18, 2021

rphillips commented Jun 18, 2021

openshift-ci bot commented Jun 18, 2021


		## Summary

		Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu reservations.


		We have observed that varying the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values.

		Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manullay prior to Kubelet start.


		Node sizing values will be calculated by the script `/etc/kubernetes/node-sizing`. This script will set the environment variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` which will be used for setting the kubelet parameter `system-reserved`

		The script `/etc/kubernetes/node-sizing` will be executed using `ExecStartPre`


		### Installer Changes

		If the user wishes to enable `auto node sizing` during the installation of the cluster itself, they will have to indicate that using a field `metadata.autoNodeSizing` in installation configuration. e.g.


		Irrespective of `auto node sizing` is enabled or not, MCO will always place the script file `/usr/local/bin/node-sizing.sh` on the node.

		However the content of the script file will vary depending upon whether the `auto node sizing` is enabled or disabled. When the `auto node sizing` is disabled, `/usr/local/bin/node-sizing.sh` will set variables `CPU_RESERVED`, `MEMORY_RESERVED` and `EPHEMERAL_RESERVED` to their existing default values, but when the `auto node sizing` is enabled the script will dynamically select the optimal value based on the resources available on the node.

	Kubelet should have an automatic sizing calculation mechanism, which could give kubelet an ability to dynamically set sizing values for memory and cpu system reserved.
	Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size.


		This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated.

		## Alternatives

Auto Node Sizing #642

Auto Node Sizing #642

Conversation

harche commented Feb 11, 2021

openshift-ci-robot commented Feb 11, 2021

Choose a reason for hiding this comment

harche commented Feb 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harche Feb 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Mar 26, 2021

harche commented Jun 18, 2021

harche commented Jun 18, 2021

rphillips commented Jun 18, 2021

openshift-ci bot commented Jun 18, 2021

harche Feb 19, 2021 •

edited

Loading