-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto Node Sizing #642
Merged
openshift-merge-robot
merged 1 commit into
openshift:master
from
harche:dynamic-node-sizing
Jun 18, 2021
Merged
Auto Node Sizing #642
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
--- | ||
title: auto-node-sizing | ||
authors: | ||
- "@harche" | ||
reviewers: | ||
- "@rphillips" | ||
approvers: | ||
- "@rphillips" | ||
creation-date: 2021-02-11 | ||
last-updated: 2021-02-11 | ||
status: implementable | ||
see-also: | ||
- https://bugzilla.redhat.com/show_bug.cgi?id=1857446 | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# Kubelet Auto Node Sizing | ||
|
||
## Release Signoff Checklist | ||
|
||
- [x] Enhancement is `implementable` | ||
- [x] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
|
||
## Summary | ||
|
||
Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size. | ||
|
||
Today the sizing values are passed manually to kubelet using `--kube-reserved` and `--system-reserved` flags. Many cloud providers provide reference values for their customers to help them select optimal values based on the node sizes. e.g. [GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu), [AKS](https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations) | ||
|
||
This enhancement proposes a mechanism to automatically determine the optimal sizing values for any node size irrespective of the cloud provider. | ||
|
||
## Motivation | ||
|
||
Kubelet’s `system reserved` and `kube reserved` play a crucial role in the OOMKilling the resource intensive pods. Without an adequate enough `system reserved` and `kube reserved` we risk freezing the node making it completely unavailable for other pods. | ||
|
||
We have observed that scaling the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values. Larger nodes have capacity for more pods and will require larger system reserved values. | ||
|
||
Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manually prior to Kubelet start. | ||
|
||
### Goals | ||
|
||
* Enable Kubelet systemd service to determine the value of the `system reserved` automatically during start up. | ||
|
||
### Non-Goals | ||
|
||
* For now the systemd service will only be used for calculating the values of `system reserved`. Similar approach can be taken to dynamically fetch the values of other parameters of the kubelet (e.g. `evictionHard`) but they are out of scope of this enhancement. | ||
* Strictly from the OpenShift's point of view, we only need to take care of `system reserved`, and not `kube reserve`. Hence this proposal will not deal with generating optimal values for `kube reserve` | ||
|
||
### User Stories | ||
|
||
* User wants to enable auto node sizing on nodes of the cluster to start the kubelet with optimal system reserved values. | ||
|
||
## Proposal | ||
|
||
* New script that will be placed on the node that can calculate the system reserved values based on the node capacity | ||
* New auto node sizing service to execute that script which will result in storing the system reserved values in a file. | ||
* Modify kubelet service to read that file and use the generated values to start kubelet daemon. | ||
|
||
### Graduation Criteria | ||
|
||
#### Dev Preview -> Tech Preview | ||
* Successfully calculate and set the optimal system reserved values. | ||
* End user documentation | ||
|
||
#### Tech Preview -> GA | ||
* More testing (upgrade, downgrade, scale) | ||
* Optinally make it available during installation | ||
|
||
#### Removing a deprecated feature | ||
|
||
N/A | ||
|
||
## Design Details | ||
|
||
### Auto Node Sizing Enabler | ||
|
||
During the cluster installation a file will be placed at the location `/etc/node-sizing-enabled.env` with following content, | ||
|
||
```bash | ||
NODE_SIZING_ENABLED=false | ||
SYSTEM_RESERVED_MEMORY=1Gi | ||
SYSTEM_RESERVED_CPU=500m | ||
``` | ||
Initially we would like the `Auto Node Sizing` to be an optional feature, so the value of the variable `NODE_SIZING_ENABLED` will be set to `false` during the installation along with the existing default values for system reserved memory and cpu. To enable this feature, the value of the variable `NODE_SIZING_ENABLED` can be set to `true` by using following `KubeletConfig`. | ||
|
||
```yaml | ||
kind: KubeletConfig | ||
metadata: | ||
name: dynamic-node | ||
spec: | ||
autoSizingReserved: true | ||
machineConfigPoolSelector: | ||
matchLabels: | ||
pools.operator.machineconfiguration.openshift.io/worker: "" | ||
``` | ||
This will enable `Auto Node Sizing` on all the worker nodes. A similar approach can be taken to enable it on the `master` nodes or on a custom machine config pool. | ||
|
||
### Auto Node Sizing Script | ||
|
||
This script can be found on the node at the location, `/usr/local/sbin/dynamic-system-reserved-calc.sh` | ||
|
||
When the `Auto Node Sizing` is enabled, script will probe the host to get the installed resource capacity (such as, installed amount of RAM) and use well tested guidance on the optimal values for the corresponding system reserved. | ||
|
||
Some of the examples of the guidance values for system reserved provided by [GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu) and [AKS](https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations) | ||
|
||
And when the `Auto Node Sizing` is disabled, the script will output the existing default values for the system reserved. | ||
|
||
The script will output the values in the following format at the location `/etc/node-sizing.env`, | ||
|
||
```bash | ||
$ cat /etc/node-sizing.env | ||
SYSTEM_RESERVED_MEMORY=3.5Gi | ||
SYSTEM_RESERVED_CPU=0.09 | ||
``` | ||
### Kubelet Auto Node Sizing Service | ||
|
||
A new service `kubelet-auto-node-size.service` that will run `before` the existing kubelet service to calculate the optimal values of system reserved. | ||
|
||
```toml | ||
[Unit] | ||
Description=Dynamically sets the system reserved for the kubelet | ||
Wants=network-online.target | ||
After=network-online.target ignition-firstboot-complete.service | ||
Before=kubelet.service crio.service | ||
[Service] | ||
# Need oneshot to delay kubelet | ||
Type=oneshot | ||
RemainAfterExit=yes | ||
EnvironmentFile=/etc/node-sizing-enabled.env | ||
ExecStart=/bin/bash /usr/local/sbin/dynamic-system-reserved-calc.sh ${NODE_SIZING_ENABLED} | ||
[Install] | ||
RequiredBy=kubelet.service | ||
``` | ||
This service will write recommended values of system reserved to the location `/etc/node-sizing.env`. It depends on another systemd environment file `/etc/node-sizing-enabled.env` mentioned above to determine if the user has enabled the `Auto Node Sizing` feature. In case user has not opted to enable it, this service will output the default values of the system reserved used today in `/etc/node-sizing.env`. | ||
|
||
### Changes to Existing Kubelet Service | ||
|
||
```toml | ||
[Unit] | ||
Description=Kubernetes Kubelet | ||
Wants=rpc-statd.service network-online.target | ||
Requires=crio.service kubelet-auto-node-size.service | ||
After=network-online.target crio.service kubelet-auto-node-size.service | ||
After=ostree-finalize-staged.service | ||
[Service] | ||
Type=notify | ||
ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests | ||
ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state | ||
EnvironmentFile=/etc/os-release | ||
EnvironmentFile=-/etc/kubernetes/kubelet-workaround | ||
EnvironmentFile=-/etc/kubernetes/kubelet-env | ||
EnvironmentFile=/etc/node-sizing.env | ||
|
||
ExecStart=/usr/bin/hyperkube \ | ||
kubelet \ | ||
--config=/etc/kubernetes/kubelet.conf \ | ||
--bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ | ||
--kubeconfig=/var/lib/kubelet/kubeconfig \ | ||
--container-runtime=remote \ | ||
--container-runtime-endpoint=/var/run/crio/crio.sock \ | ||
--runtime-cgroups=/system.slice/crio.service \ | ||
--node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=${ID} \ | ||
{{- if eq .IPFamilies "DualStack"}} | ||
--node-ip=${KUBELET_NODE_IPS} \ | ||
{{- else}} | ||
--node-ip=${KUBELET_NODE_IP} \ | ||
{{- end}} | ||
--address=${KUBELET_NODE_IP} \ | ||
--minimum-container-ttl-duration=6m0s \ | ||
--volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ | ||
--cloud-provider={{cloudProvider .}} \ | ||
{{cloudConfigFlag . }} \ | ||
--pod-infra-container-image={{.Images.infraImageKey}} \ | ||
--system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY} \ | ||
--v=${KUBELET_LOG_LEVEL} | ||
|
||
Restart=always | ||
RestartSec=10 | ||
|
||
[Install] | ||
WantedBy=multi-user.target | ||
``` | ||
|
||
Node sizing values, `SYSTEM_RESERVED_CPU` and `SYSTEM_RESERVED_MEMORY`, above will be read from environment file `/etc/node-sizing.env` | ||
|
||
### Test Plan | ||
The following workload can be used to test the automatically generated node sizing values. | ||
|
||
```yaml | ||
apiVersion: v1 | ||
kind: ReplicationController | ||
metadata: | ||
name: badmem | ||
spec: | ||
replicas: 1 | ||
selector: | ||
app: badmem | ||
template: | ||
metadata: | ||
labels: | ||
app: badmem | ||
spec: | ||
containers: | ||
- args: | ||
- python | ||
- -c | ||
- | | ||
x = [] | ||
while True: | ||
x.append("x" * 1048576) | ||
image: registry.redhat.io/rhel7:latest | ||
name: badmem | ||
|
||
``` | ||
After submitting this ReplicationController the node should not end up in `NotReady` state. See https://bugzilla.redhat.com/show_bug.cgi?id=1857446 for more information. | ||
|
||
|
||
### Version Skew Strategy | ||
|
||
How will the component handle version skew with other components? | ||
What are the guarantees? Make sure this is in the test plan. | ||
|
||
Consider the following in developing a version skew strategy for this | ||
enhancement: | ||
- During an upgrade, we will always have skew among components, how will this impact your work? | ||
|
||
This functionality only modifies the systemd service file of the kubelet. It tries to supply values of `--system-reserved` kubelet flag. As long as kubelet keeps `--system-reserved` flag in place, version skew should not have any impact on this work. | ||
|
||
- Does this enhancement involve coordinating behavior in the control plane and | ||
in the kubelet? How does an n-2 kubelet without this feature available behave | ||
when this feature is used? | ||
|
||
N/A | ||
|
||
- Will any other components on the node change? For example, changes to CSI, CRI | ||
or CNI may require updating that component before the kubelet. | ||
|
||
No | ||
|
||
|
||
|
||
### Risks and Mitigations | ||
|
||
When auto node sizing is enabled, any bug in the script that calculates the optimal system reserved can yield incorrect results. This could lead to node performing with degraded performance or even a complete outage. | ||
|
||
Users can mitigate this by disabling auto node sizing. | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
Since this feature is controlled using the `KubeletConfig`, upgrade/downgrade strategies applicable for the `KubeletConfig` are applicable here too. | ||
|
||
## Drawbacks | ||
|
||
This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated. | ||
|
||
## Alternatives | ||
|
||
1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea. | ||
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`. | ||
|
||
## Implementation History | ||
|
||
See https://github.com/openshift/machine-config-operator/pull/2466 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt IPI has the knowledge in advance? Installer it the one who spawns the VMs in the cloud, so it knows the size. Same for BM with the assisted installer where it knows the values in advance. Not sure about Metal3/ironic though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a solution that works across all the clouds and install flavors. Metal would be problematic here to figure out up front, and sometimes machines have different hardware layouts. We plan on having this script run on each node which should alleviate cloud to bare metal discrepancies.