Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Node Sizing #642

Merged
merged 1 commit into from
Jun 18, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
268 changes: 268 additions & 0 deletions enhancements/kubelet/kubelet-node-sizing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
---
title: auto-node-sizing
authors:
- "@harche"
reviewers:
- "@rphillips"
approvers:
- "@rphillips"
creation-date: 2021-02-11
last-updated: 2021-02-11
status: implementable
see-also:
- https://bugzilla.redhat.com/show_bug.cgi?id=1857446
replaces:
superseded-by:
---

# Kubelet Auto Node Sizing

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)


## Summary

Nodes should have an automatic sizing calculation mechanism, which could give kubelet an ability to scale values for memory and cpu system reserved based on machine size.

Today the sizing values are passed manually to kubelet using `--kube-reserved` and `--system-reserved` flags. Many cloud providers provide reference values for their customers to help them select optimal values based on the node sizes. e.g. [GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu), [AKS](https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations)

This enhancement proposes a mechanism to automatically determine the optimal sizing values for any node size irrespective of the cloud provider.

## Motivation

Kubelet’s `system reserved` and `kube reserved` play a crucial role in the OOMKilling the resource intensive pods. Without an adequate enough `system reserved` and `kube reserved` we risk freezing the node making it completely unavailable for other pods.

We have observed that scaling the value of `system reserved` and `kube reserved` with respect to the installed capacity of the node helps to deduce optimal values. Larger nodes have capacity for more pods and will require larger system reserved values.

Currently, the only way to customize the `system reserved` and `kube reserved` limits is to pre-calculate the values manually prior to Kubelet start.

### Goals

* Enable Kubelet systemd service to determine the value of the `system reserved` automatically during start up.

### Non-Goals

* For now the systemd service will only be used for calculating the values of `system reserved`. Similar approach can be taken to dynamically fetch the values of other parameters of the kubelet (e.g. `evictionHard`) but they are out of scope of this enhancement.
* Strictly from the OpenShift's point of view, we only need to take care of `system reserved`, and not `kube reserve`. Hence this proposal will not deal with generating optimal values for `kube reserve`

### User Stories

* User wants to enable auto node sizing on nodes of the cluster to start the kubelet with optimal system reserved values.

## Proposal

* New script that will be placed on the node that can calculate the system reserved values based on the node capacity
* New auto node sizing service to execute that script which will result in storing the system reserved values in a file.
* Modify kubelet service to read that file and use the generated values to start kubelet daemon.

### Graduation Criteria

#### Dev Preview -> Tech Preview
* Successfully calculate and set the optimal system reserved values.
* End user documentation

#### Tech Preview -> GA
* More testing (upgrade, downgrade, scale)
* Optinally make it available during installation

#### Removing a deprecated feature

N/A

## Design Details

### Auto Node Sizing Enabler

During the cluster installation a file will be placed at the location `/etc/node-sizing-enabled.env` with following content,

```bash
NODE_SIZING_ENABLED=false
SYSTEM_RESERVED_MEMORY=1Gi
SYSTEM_RESERVED_CPU=500m
```
Initially we would like the `Auto Node Sizing` to be an optional feature, so the value of the variable `NODE_SIZING_ENABLED` will be set to `false` during the installation along with the existing default values for system reserved memory and cpu. To enable this feature, the value of the variable `NODE_SIZING_ENABLED` can be set to `true` by using following `KubeletConfig`.

```yaml
kind: KubeletConfig
metadata:
name: dynamic-node
spec:
autoSizingReserved: true
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/worker: ""
```
This will enable `Auto Node Sizing` on all the worker nodes. A similar approach can be taken to enable it on the `master` nodes or on a custom machine config pool.

### Auto Node Sizing Script

This script can be found on the node at the location, `/usr/local/sbin/dynamic-system-reserved-calc.sh`

When the `Auto Node Sizing` is enabled, script will probe the host to get the installed resource capacity (such as, installed amount of RAM) and use well tested guidance on the optimal values for the corresponding system reserved.

Some of the examples of the guidance values for system reserved provided by [GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#memory_cpu) and [AKS](https://docs.microsoft.com/en-us/azure/aks/concepts-clusters-workloads#resource-reservations)

And when the `Auto Node Sizing` is disabled, the script will output the existing default values for the system reserved.

The script will output the values in the following format at the location `/etc/node-sizing.env`,

```bash
$ cat /etc/node-sizing.env
SYSTEM_RESERVED_MEMORY=3.5Gi
SYSTEM_RESERVED_CPU=0.09
```
### Kubelet Auto Node Sizing Service

A new service `kubelet-auto-node-size.service` that will run `before` the existing kubelet service to calculate the optimal values of system reserved.

```toml
[Unit]
Description=Dynamically sets the system reserved for the kubelet
Wants=network-online.target
After=network-online.target ignition-firstboot-complete.service
Before=kubelet.service crio.service
[Service]
# Need oneshot to delay kubelet
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/etc/node-sizing-enabled.env
ExecStart=/bin/bash /usr/local/sbin/dynamic-system-reserved-calc.sh ${NODE_SIZING_ENABLED}
[Install]
RequiredBy=kubelet.service
```
This service will write recommended values of system reserved to the location `/etc/node-sizing.env`. It depends on another systemd environment file `/etc/node-sizing-enabled.env` mentioned above to determine if the user has enabled the `Auto Node Sizing` feature. In case user has not opted to enable it, this service will output the default values of the system reserved used today in `/etc/node-sizing.env`.

### Changes to Existing Kubelet Service

```toml
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service network-online.target
Requires=crio.service kubelet-auto-node-size.service
After=network-online.target crio.service kubelet-auto-node-size.service
After=ostree-finalize-staged.service
[Service]
Type=notify
ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests
ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state
EnvironmentFile=/etc/os-release
EnvironmentFile=-/etc/kubernetes/kubelet-workaround
EnvironmentFile=-/etc/kubernetes/kubelet-env
EnvironmentFile=/etc/node-sizing.env

ExecStart=/usr/bin/hyperkube \
kubelet \
--config=/etc/kubernetes/kubelet.conf \
--bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \
--kubeconfig=/var/lib/kubelet/kubeconfig \
--container-runtime=remote \
--container-runtime-endpoint=/var/run/crio/crio.sock \
--runtime-cgroups=/system.slice/crio.service \
--node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_id=${ID} \
{{- if eq .IPFamilies "DualStack"}}
--node-ip=${KUBELET_NODE_IPS} \
{{- else}}
--node-ip=${KUBELET_NODE_IP} \
{{- end}}
--address=${KUBELET_NODE_IP} \
--minimum-container-ttl-duration=6m0s \
--volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \
--cloud-provider={{cloudProvider .}} \
{{cloudConfigFlag . }} \
--pod-infra-container-image={{.Images.infraImageKey}} \
--system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY} \
--v=${KUBELET_LOG_LEVEL}

Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
```

Node sizing values, `SYSTEM_RESERVED_CPU` and `SYSTEM_RESERVED_MEMORY`, above will be read from environment file `/etc/node-sizing.env`

### Test Plan
The following workload can be used to test the automatically generated node sizing values.

```yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: badmem
spec:
replicas: 1
selector:
app: badmem
template:
metadata:
labels:
app: badmem
spec:
containers:
- args:
- python
- -c
- |
x = []
while True:
x.append("x" * 1048576)
image: registry.redhat.io/rhel7:latest
name: badmem

```
After submitting this ReplicationController the node should not end up in `NotReady` state. See https://bugzilla.redhat.com/show_bug.cgi?id=1857446 for more information.


### Version Skew Strategy

How will the component handle version skew with other components?
What are the guarantees? Make sure this is in the test plan.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?

This functionality only modifies the systemd service file of the kubelet. It tries to supply values of `--system-reserved` kubelet flag. As long as kubelet keeps `--system-reserved` flag in place, version skew should not have any impact on this work.

- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?

N/A

- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

No



### Risks and Mitigations

When auto node sizing is enabled, any bug in the script that calculates the optimal system reserved can yield incorrect results. This could lead to node performing with degraded performance or even a complete outage.

Users can mitigate this by disabling auto node sizing.

### Upgrade / Downgrade Strategy

Since this feature is controlled using the `KubeletConfig`, upgrade/downgrade strategies applicable for the `KubeletConfig` are applicable here too.

## Drawbacks

This solution utilizes kubelet command line flags. Kubelet command line flags have been deprecated in favour of config file, so there is risk for this solution if those flags are actually purged. Having said that, those flags are quite widely used today. So there has not been much traction on actually removing those flags even though they have been marked deprecated.

## Alternatives

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt IPI has the knowledge in advance? Installer it the one who spawns the VMs in the cloud, so it knows the size. Same for BM with the assisted installer where it knows the values in advance. Not sure about Metal3/ironic though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a solution that works across all the clouds and install flavors. Metal would be problematic here to figure out up front, and sometimes machines have different hardware layouts. We plan on having this script run on each node which should alleviate cloud to bare metal discrepancies.


1. Enhance kubelet itself to be more smart about calculating node sizing values. We have an actively debated [KEP](https://github.com/kubernetes/enhancements/pull/2370) in sig-node around this idea.
2. Modify MCO the way it handles kubeletconfig. Instead of passing `--system-reserved` argument to the kubelet, maybe there is a possibility to make sure MCO is more tolerant of changes to the kubelet config file. This way we will modify the config file to add system reserve values instead of passing them as `--system-reserved`.

## Implementation History

See https://github.com/openshift/machine-config-operator/pull/2466