From 9aeaf46caa074c01b85e8b264e0b74478ab96ecc Mon Sep 17 00:00:00 2001 From: ehila Date: Mon, 8 Aug 2022 23:45:05 -0400 Subject: [PATCH 01/18] feat: add wide cluster configuration for workload partitioning initial commit of proposal Signed-off-by: ehila doc: updated wording and updated reviewers and approvers added clearity on risks and goals Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 407 ++++++++++++++++++ 1 file changed, 407 insertions(+) create mode 100644 enhancements/workload-partitioning/wide-availability-workload-partitioning.md diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md new file mode 100644 index 0000000000..8e2140fd1e --- /dev/null +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -0,0 +1,407 @@ +--- +title: wide-availability-workload-partitioning +authors: + - "@eggfoobar" +reviewers: + - "@jerpeter1" +approvers: + - "@jerpeter1" +api-approvers: + - None +creation-date: 2022-08-03 +last-updated: 2022-08-09 +tracking-link: + - https://issues.redhat.com/browse/CNF-5562 +see-also: + - "/enhancements/workload-partitioning" + - "/enhancements/node-tuning/pao-in-nto.md" +--- + +# Wide Availability Workload Partitioning + +## Summary + +This enhancements builds on top of the [Management Workload +Partitioning](management-workload-partitioning.md) and the [move of PAO in +NTO](../node-tuning/pao-in-nto.md) enhancement to provide the ability to do +workload partitioning to our wider cluster configurations. The previous workload +partitioning work is isolated to Single Node cluster configurations, this +enhancement will seek to allow customers to configure workload partitioning on +HA as well as Compact(3NC) clusters. + +## Motivation + +Customers who want us to reduce the resource consumption of management workloads +have a fixed budget of CPU cores in mind. We want to use normal scheduling +capabilities of kubernetes to manage the number of pods that can be placed onto +those cores, and we want to avoid mixing management and normal workloads there. +Expanding on the already build workload partitioning we should be able to supply +the same functionality to HA and 3NC clusters. + +### User Stories + +As a cluster creator I want to isolate the management pods of Openshift in +compact(3NC) and HA clusters to specific CPU sets so that I can isolate the +platform workloads from the application workload due to the high performance and +determinism required of my applications. + +### Goals + +- This enhancement describes an approach for configuring OpenShift clusters to + run with management workloads on a restricted set of CPUs. +- Clusters built in this way should pass the same kubernetes and OpenShift + conformance and functional end-to-end tests as similar deployments that are + not isolating the management workloads. +- We want to run different workload partitioning on masters and workers. +- Customers will be supplied with the advice of 4 hyperthreaded cores for + masters and for workers, 2 hyperthreaded cores. +- We want a general approach, that can be applied to all OpenShift control plane + and per-node components via the PerformenceProfile +- We want to be clear with customers that this enhancement is a day 0 supported + feature only. We do not support turning it off after the installation is done + and the feature is on. + +### Non-Goals + +This enhancement expands on the existing [Management Workload +Partitioning](management-workload-partitioning.md) and as such shares similar +but slightly different non-goals + +- This enhancement is focused on CPU resources. Other compressible resource + types may need to be managed in the future, and those are likely to need + different approaches. +- This enhancement does not address mixing node partitioning, this feature will + be enabled cluster wide and encapsulate both master and worker pools. If it's + not desired then the setting will still be turned on but the management + workloads will run on the whole CPU set for that desired node. +- This enhancement does not address non-compressible resource requests, such as + for memory. +- This enhancement does not address ways to disable operators or operands + entirely. +- This enhancement does not address reducing actual utilization, beyond + providing a way to have a predictable upper-bounds. There is no expectation + that a cluster configured to use a small number of cores for management + services would offer exactly the same performance as the default. It must be + stable and continue to operate reliably, but may respond more slowly. +- This enhancement assumes that the configuration of a management CPU pool is + done as part of installing the cluster. It can be changed after the fact but + we will need to stipulate that, that is currently not supported. The intent + here is for this to be supported as a day 0 feature, only. + +## Proposal + +In order to implement this enhancement we are focused on changing 2 components. + +1. Admission controller ([management cpus + override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) + in openshift/kubernetes. +1. The + [PerformanceProfile](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) + part of [Cluster Node Tuning + Operator](https://github.com/openshift/cluster-node-tuning-operator) + +We want to remove the checks in the admission controller which specifically +checks that partitioning is only applied to single node topology configuration. +The design and configuration for any pod modification will remain the same, we +simply will allow you to apply partitioning on non single node topologies. + +Workload pinning involves configuring CRI-O and Kubelet. Currently, this is done +through a machine config that contains both of those configurations. This can +pose problems as the CPU set value has to be copied to these other two +configurations. We want to simplify the current implementation and apply both of +these configurations via the `PerformanceProfile` CRD. + +We want to add a new `Workloads` field to the `CPU` field that contains the +configuration information for `enablePinning`. We are not sure where we would +want to take workload pinning in the future and to allow flexibility we want to +place the configuration in `cpu.workloads`. + +```yaml +apiVersion: performance.openshift.io/v2 +kind: PerformanceProfile +metadata: + name: openshift-node-workload-partitioning-custom +spec: + cpu: + isolated: 2-3 + reserved: 0,1 + # New addition + workloads: + enablePinning: true +``` + +### Workflow Description + +The end user will be expected to provide a `PerformanceProfile` manifest that +describes their desired `isolated` and `reserved` CPUSet and the +`workloads.enablePinning` flag set to true. This manifest will be applied during +the installation process. + +**High level sequence diagram:** + +```mermaid +sequenceDiagram +Alice->>Installer: Provide PerformanceProfile manifest +Installer-->>NTO: Apply +NTO-->>MCO: Generated Machine Manifests +MCO-->>Node: Configure node +loop Apply + Node->>Node: Set kubelet config + Node->>Node: Set crio config + Node->>Node: Kubelet advertises cores +end +Node-->>MCO: Finished Restart +MCO-->>NTO: Machine Manifests Applied +NTO-->>Installer: PerformanceProfile Applied +Installer-->>Alice: Cluster is Up! +``` + +- **Alice** is a human user who creates an Openshift cluster. +- **Installer** is assisted installer that applies the user manifest. +- **NTO** is the node tuning operator. +- **MCO** is the machine config operator. +- **Node** is the kubernetes node. + +1. Alice sits down and provides the desired performance profile as an extra + manifest to the installer. +1. The installer applies the manifest. +1. The NTO will generate the appropriate machine configs that include the + kubelet config and the crio config to be applied as well as the existing + operation. +1. Once the MCO applies the configs, the node is restarted and the cluster + installation continues to completion. +1. Alice will now have a cluster that has been setup with workload pinning. + +#### Variation [optional] + +This section outlines an end-to-end workflow for deploying a cluster with +workload partitioning enabled and how pods are correctly scheduled to run on the +management CPU pool. + +1. User sits down at their computer. +1. The user creates a `PerformanceProfile` resource with the desired `isolated` + and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true. +1. The user runs the installer to create the standard manifests, adds their + extra manifests from steps 2, then creates the cluster. +1. NTO will generate the machine config manifests and apply them. +1. The kubelet starts up and finds the configuration file enabling the new + feature. +1. The kubelet advertises `management.workload.openshift.io/cores` extended + resources on the node based on the number of CPUs in the host. +1. The kubelet reads static pod definitions. It replaces the `cpu` requests with + `management.workload.openshift.io/cores` requests of the same value and adds + the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` + annotations for CRI-O with the same values. +1. Something schedules a regular pod with the + `target.workload.openshift.io/management` annotation in a namespace with the + `workload.openshift.io/allowed: management` annotation. +1. The admission hook modifies the pod, replacing the CPU requests with + `management.workload.openshift.io/cores` requests and adding the + `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` + annotations for CRI-O. +1. The scheduler sees the new pod and finds available + `management.workload.openshift.io/cores` resources on the node. The scheduler + places the pod on the node. +1. Repeat steps 8-10 until all pods are running. +1. Cluster deployment comes up with management components constrained to subset + of available CPUs. + +### API Extensions + +- We want to extend the `PerformanceProfile` API to include the addition of a + new `workloads` configuration under the `cpu` field. +- The behavior of existing resources should not change with this addition. +- New resources that make use of this new field will have the current machine + config generated with the additional machine config manifests. + - Uses the `isolated` to add the CRI-O and Kubelet configuration files to the + currently generated machine config. + - If no `isolated` and `enablePinning` is set to true, the default behavior is + to use the full CPUSet as if workloads were not pinned. + +Example change: + +```yaml +apiVersion: performance.openshift.io/v2 +kind: PerformanceProfile +metadata: + name: openshift-node-workload-partitioning-custom +spec: + cpu: + isolated: 2-3 + reserved: 0,1 + # New addition + workloads: + enablePinning: true +``` + +### Implementation Details/Notes/Constraints [optional] + +#### Changes to NTO + +The NTO PerformanceProfile will be updated to support a new flag which will +toggle the workload pinning to the `isolated` cores. The idea here being to +simplify the approach for how customers set this configuration. With PAO being +part of NTO now ([see here for more info](../node-tuning/pao-in-nto.md)) this +affords us the chance to consolidate the configuration for `kubelet` and `crio`. + +We will modify the code path that generates the [new machine +config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) +using the performance profile. With the new `workloads.enablePinning` flag we +will add the configuration for `crio` and `kubelet` to the final machine config +manifest. Then the existing code path will apply the change as normal. + +#### API Server Admission Hook + +The existing admission hook has 4 checks when it comes to workload pinning. + +1. Check if `pod` is a static pod + - Skips modification attempt if it is static. +1. Checks if currently running cluster topology is Single Node + - Skips modification if it is anything other than Single Node +1. Checks if all running nodes are managed + - Skips modification if any of the nodes are not managed +1. Checks what resource limits and requests are set on the pod + - Skips modification if QoS is guaranteed or both limits and requests are set + - Skips modification if after update the QoS is changed + +We will need to alter the code in the admission controller to remove the check +for Single Node Topology, the other configurations should remain untouched. + +### Risks and Mitigations + +The sames risks and mitigations highlighted in [Management Workload +Partitioning](management-workload-partitioning.md) apply to this enhancement as +well. + +We need to make it very clear to customers that this feature is supported as a +day 0 configuration and day n+1 alterations are not be supported with this +enhancement. Part of that messaging should involve a clear indication that this +should be a cluster wide feature. + +A risk we can run into is that a customer can apply a CPU set that is too small +or out of bounds can cause problems such as extremely poor performance or start +up errors. Mitigation of this scenario will be to provide proper guidance and +guidelines for customers who enable this enhancement. Furthermore, the +performance team would need to be reached out to for more specific information +around upper and lower bounds of CPU sets for running an Openshift cluster. + +It is possible to build a cluster with the feature enabled and then add a node +in a way that does not configure the workload partitions only for that node. We +do not support this configuration as all nodes must have the feature turned on. +However, there might be a race condition where a node is added and is in the +process of being restarted with workload partitioning, during that process pods +being admitted will trigger a warning. We expect the resulting error message +described in [failure modes](#failure-modes) to explain the problem well enough +for admins to recover. + +A possible risk are cluster upgrades, this is the first time this enhancement +will be for multi-node clusters, we need to run more tests on upgrade cycles to +make sure things run as expected. + +### Drawbacks + +This feature contains the same drawbacks as the [Management Workload +Partitioning](management-workload-partitioning.md). + +Several of the changes described above are patches that we may end up carrying +downstream indefinitely. Some version of a more general "CPU pool" feature may +be acceptable upstream, and we could reimplement management workload +partitioning to use that new implementation. + +## Design Details + +### Open Questions [optional] + +N/A + +### Test Plan + +We will add a CI job with a cluster configuration that reflects the minimum of +2CPU/4vCPU masters and 1CPU/2vCPU worker configuration. This job should ensure +that cluster deployments configured with management workload partitioning pass +the compliance tests. + +We will add a CI job to ensure that all release payload workloads have the +`target.workload.openshift.io/management` annotation and their namespaces have +the `workload.openshift.io/allowed` annotation. + +### Graduation Criteria + +#### Dev Preview -> Tech Preview + +- Ability to utilize the enhancement end to end +- End user documentation, relative API stability +- Sufficient test coverage +- Gather feedback from users rather than just developers +- Enumerate service level indicators (SLIs), expose SLIs as metrics +- Write symptoms-based alerts for the component(s) + +#### Tech Preview -> GA + +- More testing (upgrade, downgrade, scale) +- Sufficient time for feedback +- Available by default +- Backhaul SLI telemetry +- Document SLOs for the component +- Conduct load testing +- User facing documentation created in + [openshift-docs](https://github.com/openshift/openshift-docs/) + +#### Removing a deprecated feature + +- Announce deprecation and support policy of the existing feature +- Deprecate the feature + +### Upgrade / Downgrade Strategy + +This new behavior will be added in 4.12 as part of the installation +configurations for customers to utilize. + +Enabling the feature after installation is not supported in 4.12, so we do not +need to address what happens if an older cluster upgrades and then the feature +is turned on. + +### Version Skew Strategy + +N/A + +### Operational Aspects of API Extensions + +The addition to the API is an optional field which should not require any +conversion admission webhook changes. This change will only be used to allow the +user to explicitly define their intent and simplify the machine manifest by +generating the extra machine manifests that are currently being created +independently of the `PerformanceProfile` CRD. + +Futhermore the design and scope of this enhancement will mean that the existing +Admission webhook will continue to apply the same warnings and error messages to +Pods as described in the [failure modes](#failure-modes). + +#### Failure Modes + +In a failure situation, we want to try to keep the cluster operational. +Therefore, there are a few conditions under which the admission hook will strip +the workload annotations and add an annotation `workload.openshift.io/warning` +with a message warning the user that their partitioning instructions were +ignored. These conditions are: + +1. When a pod has the Guaranteed QoS class +1. When mutation would change the QoS class for the pod +1. When the feature is inactive because not all nodes are reporting the + management resource + +#### Support Procedures + +N/A + +## Implementation History + +WIP + +## Alternatives + +N/A + +## Infrastructure Needed [optional] + +N/A From 6a08295f2cf8792126338e504d5e25ca97341624 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 9 Aug 2022 19:34:44 -0400 Subject: [PATCH 02/18] updated approvers and reviewers Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 8e2140fd1e..b35a7b8ff4 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -4,8 +4,13 @@ authors: - "@eggfoobar" reviewers: - "@jerpeter1" + - "@mrunalp" + - "@rphillips" + - "@browsell" + - "@haircommander" approvers: - "@jerpeter1" + - "@mrunalp" api-approvers: - None creation-date: 2022-08-03 From d5c9cedc7334a77317713a3e232169c2ef4fd1e1 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 9 Aug 2022 19:45:37 -0400 Subject: [PATCH 03/18] updated wording for performance profile for clarity Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 56 ++++++++++++------- 1 file changed, 36 insertions(+), 20 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index b35a7b8ff4..9a9bc70ba9 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -60,8 +60,9 @@ determinism required of my applications. - We want to run different workload partitioning on masters and workers. - Customers will be supplied with the advice of 4 hyperthreaded cores for masters and for workers, 2 hyperthreaded cores. -- We want a general approach, that can be applied to all OpenShift control plane - and per-node components via the PerformenceProfile +- We want to consolidate the current approach and extend PerformanceProfile API + to avoid any possible errors when configuring a cluster for workload + partitioning. - We want to be clear with customers that this enhancement is a day 0 supported feature only. We do not support turning it off after the installation is done and the feature is on. @@ -70,15 +71,14 @@ determinism required of my applications. This enhancement expands on the existing [Management Workload Partitioning](management-workload-partitioning.md) and as such shares similar -but slightly different non-goals +but slightly different non-goals. + +> Note: Items in bold are modifications/additions from the previous enhancement, +> [Management Workload Partitioning](management-workload-partitioning.md) - This enhancement is focused on CPU resources. Other compressible resource types may need to be managed in the future, and those are likely to need different approaches. -- This enhancement does not address mixing node partitioning, this feature will - be enabled cluster wide and encapsulate both master and worker pools. If it's - not desired then the setting will still be turned on but the management - workloads will run on the whole CPU set for that desired node. - This enhancement does not address non-compressible resource requests, such as for memory. - This enhancement does not address ways to disable operators or operands @@ -88,33 +88,46 @@ but slightly different non-goals that a cluster configured to use a small number of cores for management services would offer exactly the same performance as the default. It must be stable and continue to operate reliably, but may respond more slowly. -- This enhancement assumes that the configuration of a management CPU pool is +- **This enhancement does not address mixing nodes with pinning and without, + this feature will be enabled cluster wide and encapsulate both master and + worker pools. If it's not desired then the setting will still be turned on but + the management workloads will run on the whole CPU set for that desired + pool.** +- **This enhancement assumes that the configuration of a management CPU pool is done as part of installing the cluster. It can be changed after the fact but we will need to stipulate that, that is currently not supported. The intent - here is for this to be supported as a day 0 feature, only. + here is for this to be supported as a day 0 feature, only.** ## Proposal In order to implement this enhancement we are focused on changing 2 components. -1. Admission controller ([management cpus +1. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. 1. The - [PerformanceProfile](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) + [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) +### Admission Controller + We want to remove the checks in the admission controller which specifically -checks that partitioning is only applied to single node topology configuration. +verifies that partitioning is only applied to single node topology configuration. The design and configuration for any pod modification will remain the same, we simply will allow you to apply partitioning on non single node topologies. -Workload pinning involves configuring CRI-O and Kubelet. Currently, this is done -through a machine config that contains both of those configurations. This can -pose problems as the CPU set value has to be copied to these other two -configurations. We want to simplify the current implementation and apply both of -these configurations via the `PerformanceProfile` CRD. +### Performance Profile Controller + +We want to consolidate the current approach to setting up workload partitioning. +Currently workload partitioning involves configuring CRI-O and Kubelet earlier +in the processes as a separate machine config manifest that requires the same +information present in the `PerformanceProfile` resource, that being the +`isolated` CPU sets. Because configuring multiple resources with the right CPU +sets consistently is error prone, we want to extend the PerformanceProfile API +to include settings for workload partitioning. We want to simplify the current +implementation and apply both of these configurations via the +`PerformanceProfile` resource. We want to add a new `Workloads` field to the `CPU` field that contains the configuration information for `enablePinning`. We are not sure where we would @@ -183,12 +196,15 @@ This section outlines an end-to-end workflow for deploying a cluster with workload partitioning enabled and how pods are correctly scheduled to run on the management CPU pool. +> Note: Items in bold are modifications/additions from the previous enhancement, +> [Management Workload Partitioning](management-workload-partitioning.md) + 1. User sits down at their computer. -1. The user creates a `PerformanceProfile` resource with the desired `isolated` - and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true. +1. **The user creates a `PerformanceProfile` resource with the desired `isolated` + and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true.** 1. The user runs the installer to create the standard manifests, adds their extra manifests from steps 2, then creates the cluster. -1. NTO will generate the machine config manifests and apply them. +1. **NTO will generate the machine config manifests and apply them.** 1. The kubelet starts up and finds the configuration file enabling the new feature. 1. The kubelet advertises `management.workload.openshift.io/cores` extended From 06d1c1387385c106ee705c837e63eb0644dc6541 Mon Sep 17 00:00:00 2001 From: ehila Date: Mon, 15 Aug 2022 15:07:42 -0400 Subject: [PATCH 04/18] update: added partition resize information added variation for partition size added exlicit goal to maintain partition resize Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 26 +++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 9a9bc70ba9..85ff77d9df 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -66,6 +66,9 @@ determinism required of my applications. - We want to be clear with customers that this enhancement is a day 0 supported feature only. We do not support turning it off after the installation is done and the feature is on. +- We want to maintain the ability to configure the partition size after + installation, we do not support turning off the partition feature but we do + support changing the CPU partition size post day 0. ### Non-Goals @@ -192,6 +195,8 @@ Installer-->>Alice: Cluster is Up! #### Variation [optional] +##### E2E Workflow deployment + This section outlines an end-to-end workflow for deploying a cluster with workload partitioning enabled and how pods are correctly scheduled to run on the management CPU pool. @@ -227,6 +232,22 @@ management CPU pool. 1. Cluster deployment comes up with management components constrained to subset of available CPUs. +##### Partition Resize workflow + +This section outlines an end-to-end workflow for resizing the CPUSet partition. + +> Note: Items in bold are modifications/additions from the previous enhancement, +> [Management Workload Partitioning](management-workload-partitioning.md) + +1. User sits down at their computer. +1. **The user updates the `PerformanceProfile` resource with the new desired + `isolated` and new `reserved` CPUSet with the `cpu.workloads.enablePinning` + set to true.** +1. **NTO will re-generate the machine config manifests and apply them.** +1. ... Steps same as [E2E Workflow deployment](#e2e-workflow-deployment) ... +1. Cluster deployment comes up with management components constrained to subset + of available CPUs. + ### API Extensions - We want to extend the `PerformanceProfile` API to include the addition of a @@ -302,9 +323,10 @@ should be a cluster wide feature. A risk we can run into is that a customer can apply a CPU set that is too small or out of bounds can cause problems such as extremely poor performance or start up errors. Mitigation of this scenario will be to provide proper guidance and -guidelines for customers who enable this enhancement. Furthermore, the +guidelines for customers who enable this enhancement. As mentioned in our goal +we do support re-configuring the CPUSet partition size after installation. The performance team would need to be reached out to for more specific information -around upper and lower bounds of CPU sets for running an Openshift cluster. +around upper and lower bounds of CPU sets for running an Openshift cluster It is possible to build a cluster with the feature enabled and then add a node in a way that does not configure the workload partitions only for that node. We From a62ba06a5132355b58f6fa7afe8fb30c1cc4a71b Mon Sep 17 00:00:00 2001 From: ehila Date: Wed, 31 Aug 2022 03:36:47 -0400 Subject: [PATCH 05/18] updated with reworked approach added information for new approach to avoid race condition added information about how to limit users ability to turn off feature and install time configuration Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 322 ++++++++++++++---- 1 file changed, 247 insertions(+), 75 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 85ff77d9df..3b39f73962 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -63,12 +63,15 @@ determinism required of my applications. - We want to consolidate the current approach and extend PerformanceProfile API to avoid any possible errors when configuring a cluster for workload partitioning. -- We want to be clear with customers that this enhancement is a day 0 supported - feature only. We do not support turning it off after the installation is done - and the feature is on. +- We want to make this enhancement a day 0 supported feature only. We do not + support turning it off after the installation is done and the feature is on. +- We want to make sure that when this feature is on and no CPU set limit is + defined via the Performance Profile, the default behavior is to allow full use + of the CPU set. - We want to maintain the ability to configure the partition size after installation, we do not support turning off the partition feature but we do - support changing the CPU partition size post day 0. + support changing the CPU partition size post day 0. The ability for turning on + this feature will be part of the installation phase only. ### Non-Goals @@ -103,15 +106,112 @@ but slightly different non-goals. ## Proposal -In order to implement this enhancement we are focused on changing 2 components. +We will need to maintain a global identifier that is set during installation and +can not be easily removed after the fact. This approach will help remove +exposing this feature via an API and limiting the chances that a +misconfiguration can cause un-recoverable scenarios for our customers. At +install time we will also apply an initial machine config for workload +partitioning that sets a default CPUSet for the whole CPUSet. Effectively this +will behave as if workload partitioning is not turned on. With this approach we +eliminate the race condition that can occur if we apply a machine config after +the fact as nodes join the cluster. When a customer wishes to pin the management +workloads they will be able to do that via the existing Performance Profile. +Resizing partition size will not cause any issues after installation. -1. Admission Controller ([management cpus +In order to implement this enhancement we are proposing changing 4 components +defined below. + +1. Openshift API - ([Infrastructure + Status](https://github.com/openshift/api/blob/81fadf1ca0981f800029bc5e2fe2dc7f47ce698b/config/v1/types_infrastructure.go#L51)) + + - The change in this component is to store a global identifier if we have + partitioning turned on. + +2. [Openshift Installer](https://github.com/openshift/installer)) + + - The change in the installer is to support turning on this feature only + during installation as well as apply the partitioning machine configs so + that we avoid the race conditions when running in multi-node environments. + +3. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. -1. The + - This change will be in support of checking the global identifier in order + to modify the pod spec with the correct `requests`. +4. The [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) + - This change will support adding the ability to explicitly pin + `Infrastructure` workloads. + +### Openshift API - Infrastructure Status + +We will need to maintain a status flag to be able to identify if a cluster we +are operating in has been setup for partitioning or not. Since this identifier +signifies that a cluster's infrastructure is setup for workload partitioning we +feel that this information should be part of the `Infrastructure Status` +resource. This identifier will be an enum that will be set during installation +and during an upgrade for existing single-node deployments we will set it via +the Node Tuning Operator. + +We propose that in the current implementation we only support either `None` or +`AllNodes`. + +```go +type InfrastructureStatus struct { + ... + // cpuPartitioning expresses if the cluster nodes have CPU Set partitioning turned on. + // The default of None means that no CPU Set partitioning is set. + // If AllNodes is set that indicates that all the nodes have partitioning set on + // and workloads might be pinned to specific CPU Sets depending on the configurations + // set via the Node Tuning Operator and the Performance Profile API + // +kubebuilder:default=None + // +kubebuilder:validation:Enum=None;AllNodes + CPUPartitioning PartitioningMode `json:"cpuPartitioning"` +} + +// PartitioningMode defines the CPU partitioning mode of the nodes. +type CPUPartitioningMode string + +const ( + // No partitioning is applied. + CPUPartitioningNone CPUPartitioningMode = "None" + // All nodes are configured for partitioning. + CPUPartitioningAllNodes CPUPartitioningMode = "AllNodes" +) +``` + +### Openshift Installer + +We will need to modify the Openshift installer to set and generate the machine +configs for the initial setup. The generated machine config manifests will be +set to be wide open to the whole CPU set. However, because these manifests are +applied early on in the process we avoid race condition situations that might +arise if these are applied after installation. + +In the similar approach to the `openshift/api` change we propose adding a new +feature to the install configuration that will flag a cluster for CPU +partitioning during installation. + +```go +// CPUPartitioningMode is a strategy for how various endpoints for the cluster are exposed. +// +kubebuilder:validation:Enum=None;AllNodes;MasterNodes;WorkerNodes +type CPUPartitioningMode string + +const ( + CPUPartitioningNone CPUPartitioningMode = "None" + CPUPartitioningAllNodes CPUPartitioningMode = "AllNodes" +) + +type InstallConfig struct { + // CPUPartitioning configures if a cluster will be used with CPU partitioning + // + // +kubebuilder:default=None + // +optional + CPUPartitioning CPUPartitioningMode `json:"cpuPartitioning,omitempty"` +} +``` ### Admission Controller @@ -120,22 +220,30 @@ verifies that partitioning is only applied to single node topology configuration The design and configuration for any pod modification will remain the same, we simply will allow you to apply partitioning on non single node topologies. +We will use the global identifier to correctly modify the pod spec with the +`requests.cpu` for the new `requests[management.workload.openshift.io/cores]` +that are used by the workload partitioning feature. + ### Performance Profile Controller -We want to consolidate the current approach to setting up workload partitioning. Currently workload partitioning involves configuring CRI-O and Kubelet earlier in the processes as a separate machine config manifest that requires the same information present in the `PerformanceProfile` resource, that being the -`isolated` CPU sets. Because configuring multiple resources with the right CPU -sets consistently is error prone, we want to extend the PerformanceProfile API -to include settings for workload partitioning. We want to simplify the current -implementation and apply both of these configurations via the -`PerformanceProfile` resource. - -We want to add a new `Workloads` field to the `CPU` field that contains the -configuration information for `enablePinning`. We are not sure where we would -want to take workload pinning in the future and to allow flexibility we want to -place the configuration in `cpu.workloads`. +`isolated` and `reserved` CPU sets. Because configuring multiple resources with +the right CPU sets consistently is error prone, we want to extend the +PerformanceProfile API to include settings for workload partitioning. + +We want to consolidate the current approach to setting up workload partitioning +and utilize the changes suggested via the `openshift/installer`. When +installation is done and workload partitioning is set then from that point on +the `kubelet` and `crio` only need to be configured with the desired CPU set to +use. We currently express this to customers via the `reserved` CPU set as part +of the performance profile api. + +We want to add a new `workloads` field to the `cpu` field that contains a list +of enums for defining which workloads to pin. This should allow us to expand +this in the future if desired, but in this enhancement we will only support +`Infrastructure` which defines all of the Openshift workloads. ```yaml apiVersion: performance.openshift.io/v2 @@ -148,18 +256,56 @@ spec: reserved: 0,1 # New addition workloads: - enablePinning: true + - Infrastructure ``` ### Workflow Description -The end user will be expected to provide a `PerformanceProfile` manifest that -describes their desired `isolated` and `reserved` CPUSet and the -`workloads.enablePinning` flag set to true. This manifest will be applied during -the installation process. +The end user will be expected to set the installer config to +`installConfig.cpuPartitioning: AllNodes` to turn on the feature for the whole +cluster. As well as provide a `PerformanceProfile` manifest that describes their +desired `isolated` and `reserved` CPUSet and the `Infrastructure` enum provided +to the list in the `workloads` enum list. **High level sequence diagram:** +Install Time Sequence + +```mermaid +sequenceDiagram +Alice->>Installer: Apply Install Config +loop Generate + Installer->>Installer: Machine Config +end +Installer-->>APIServer: Apply Machine Configs +APIServer-->>Installer: Applied! +APIServer-->>MCO: Machine Manifests +MCO-->>Node: Configure and restart Node +loop Apply + Node->>Node: Set kubelet config + Node->>Node: Set crio config + Node->>Node: Kubelet advertises cores +end +Node-->>MCO: Finished Restart +``` + +- **Alice** is a human user who creates an Openshift cluster. +- **Installer** is assisted installer that applies the user manifest. +- **MCO** is the machine config operator. +- **Node** is the kubernetes node. + +1. Alice sits down and provides the desired installer config with cpu + partitioning turned on, `cpuPartitioning: AllNodes` +2. The installer generates the manifest for a wide open CPU set and applies + them. +3. Once the MCO applies the configs, the node is restarted and the cluster + installation continues to completion. +4. Alice will now have a cluster that has been setup with workload pinning, but + the workloads are not limited to any CPU set until Alice applies the + performance profile. + +Applying CPU Partitioning Size Change + ```mermaid sequenceDiagram Alice->>Installer: Provide PerformanceProfile manifest @@ -185,13 +331,13 @@ Installer-->>Alice: Cluster is Up! 1. Alice sits down and provides the desired performance profile as an extra manifest to the installer. -1. The installer applies the manifest. -1. The NTO will generate the appropriate machine configs that include the +2. The installer applies the manifest. +3. The NTO will generate the appropriate machine configs that include the kubelet config and the crio config to be applied as well as the existing operation. -1. Once the MCO applies the configs, the node is restarted and the cluster +4. Once the MCO applies the configs, the node is restarted and the cluster installation continues to completion. -1. Alice will now have a cluster that has been setup with workload pinning. +5. Alice will now have a cluster that has been setup with workload pinning. #### Variation [optional] @@ -205,32 +351,35 @@ management CPU pool. > [Management Workload Partitioning](management-workload-partitioning.md) 1. User sits down at their computer. -1. **The user creates a `PerformanceProfile` resource with the desired `isolated` - and `reserved` CPUSet with the `cpu.workloads.enablePinning` set to true.** -1. The user runs the installer to create the standard manifests, adds their +2. **The user creates a `PerformanceProfile` resource with the desired `isolated` + and `reserved` CPUSet with the `cpu.workloads[Infrastructure]` added to the + enum list.** +3. **The user will set the installer configuration for CPU partitioning to + AllNodes, `cpuPartitioning: AllNodes`.** +4. The user runs the installer to create the standard manifests, adds their extra manifests from steps 2, then creates the cluster. -1. **NTO will generate the machine config manifests and apply them.** -1. The kubelet starts up and finds the configuration file enabling the new +5. **NTO will generate the machine config manifests and apply them.** +6. The kubelet starts up and finds the configuration file enabling the new feature. -1. The kubelet advertises `management.workload.openshift.io/cores` extended +7. The kubelet advertises `management.workload.openshift.io/cores` extended resources on the node based on the number of CPUs in the host. -1. The kubelet reads static pod definitions. It replaces the `cpu` requests with +8. The kubelet reads static pod definitions. It replaces the `cpu` requests with `management.workload.openshift.io/cores` requests of the same value and adds the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` annotations for CRI-O with the same values. -1. Something schedules a regular pod with the +9. Something schedules a regular pod with the `target.workload.openshift.io/management` annotation in a namespace with the `workload.openshift.io/allowed: management` annotation. -1. The admission hook modifies the pod, replacing the CPU requests with - `management.workload.openshift.io/cores` requests and adding the - `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` - annotations for CRI-O. -1. The scheduler sees the new pod and finds available - `management.workload.openshift.io/cores` resources on the node. The scheduler - places the pod on the node. -1. Repeat steps 8-10 until all pods are running. -1. Cluster deployment comes up with management components constrained to subset - of available CPUs. +10. The admission hook modifies the pod, replacing the CPU requests with + `management.workload.openshift.io/cores` requests and adding the + `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` + annotations for CRI-O. +11. The scheduler sees the new pod and finds available + `management.workload.openshift.io/cores` resources on the node. The scheduler + places the pod on the node. +12. Repeat steps 8-10 until all pods are running. +13. Cluster deployment comes up with management components constrained to subset + of available CPUs. ##### Partition Resize workflow @@ -240,25 +389,26 @@ This section outlines an end-to-end workflow for resizing the CPUSet partition. > [Management Workload Partitioning](management-workload-partitioning.md) 1. User sits down at their computer. -1. **The user updates the `PerformanceProfile` resource with the new desired - `isolated` and new `reserved` CPUSet with the `cpu.workloads.enablePinning` - set to true.** -1. **NTO will re-generate the machine config manifests and apply them.** -1. ... Steps same as [E2E Workflow deployment](#e2e-workflow-deployment) ... -1. Cluster deployment comes up with management components constrained to subset +2. **The user updates the `PerformanceProfile` resource with the new desired + `isolated` and new `reserved` CPUSet with the `cpu.workloads[Infrastructure]` + in the enum list.** +3. **NTO will re-generate the machine config manifests and apply them.** +4. ... Steps same as [E2E Workflow deployment](#e2e-workflow-deployment) ... +5. Cluster deployment comes up with management components constrained to subset of available CPUs. ### API Extensions - We want to extend the `PerformanceProfile` API to include the addition of a - new `workloads` configuration under the `cpu` field. -- The behavior of existing resources should not change with this addition. + new `workloads[Infrastructure]` configuration under the `cpu` field. +- The behavior of existing API should not change with this addition. - New resources that make use of this new field will have the current machine - config generated with the additional machine config manifests. - - Uses the `isolated` to add the CRI-O and Kubelet configuration files to the - currently generated machine config. - - If no `isolated` and `enablePinning` is set to true, the default behavior is - to use the full CPUSet as if workloads were not pinned. + config generated with the additional configurations added to the manifest. + - Uses the `reserved` field to add the correct CPU set to the CRI-O and + Kubelet configuration files to the currently generated machine config. + - If no `workloads[Infrastructure]` is provided then no workload partitioning + configurations are left wide open to all CPU sets for the Kubelet and CRI-O + configurations. Example change: @@ -271,9 +421,9 @@ spec: cpu: isolated: 2-3 reserved: 0,1 - # New addition + # New enum addition workloads: - enablePinning: true + - Infrastructure ``` ### Implementation Details/Notes/Constraints [optional] @@ -288,26 +438,39 @@ affords us the chance to consolidate the configuration for `kubelet` and `crio`. We will modify the code path that generates the [new machine config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) -using the performance profile. With the new `workloads.enablePinning` flag we +using the performance profile. With the new `spec.workloads[Infrastructure]` enum we will add the configuration for `crio` and `kubelet` to the final machine config manifest. Then the existing code path will apply the change as normal. #### API Server Admission Hook +We will need to alter the code in the admission controller to remove the check +for Single Node Topology, and modify the check for running nodes to check the +global identifier which will be set at install time. + The existing admission hook has 4 checks when it comes to workload pinning. +Old Path: + 1. Check if `pod` is a static pod - Skips modification attempt if it is static. -1. Checks if currently running cluster topology is Single Node +2. Checks if currently running cluster topology is Single Node - **Will Change** - Skips modification if it is anything other than Single Node -1. Checks if all running nodes are managed +3. Checks if all running nodes are managed - **Will Change** - Skips modification if any of the nodes are not managed -1. Checks what resource limits and requests are set on the pod +4. Checks what resource limits and requests are set on the pod - Skips modification if QoS is guaranteed or both limits and requests are set - Skips modification if after update the QoS is changed -We will need to alter the code in the admission controller to remove the check -for Single Node Topology, the other configurations should remain untouched. +Changed Path: + +1. Check if `pod` is a static pod + - Skips modification attempt if it is static. +2. Checks if currently running cluster has global identifier for partitioning set + - Skips modification if identifier partitioning set to `None` +3. Checks what resource limits and requests are set on the pod + - Skips modification if QoS is guaranteed or both limits and requests are set + - Skips modification if after update the QoS is changed ### Risks and Mitigations @@ -318,7 +481,7 @@ well. We need to make it very clear to customers that this feature is supported as a day 0 configuration and day n+1 alterations are not be supported with this enhancement. Part of that messaging should involve a clear indication that this -should be a cluster wide feature. +currently will be a cluster wide feature. A risk we can run into is that a customer can apply a CPU set that is too small or out of bounds can cause problems such as extremely poor performance or start @@ -331,11 +494,6 @@ around upper and lower bounds of CPU sets for running an Openshift cluster It is possible to build a cluster with the feature enabled and then add a node in a way that does not configure the workload partitions only for that node. We do not support this configuration as all nodes must have the feature turned on. -However, there might be a race condition where a node is added and is in the -process of being restarted with workload partitioning, during that process pods -being admitted will trigger a warning. We expect the resulting error message -described in [failure modes](#failure-modes) to explain the problem well enough -for admins to recover. A possible risk are cluster upgrades, this is the first time this enhancement will be for multi-node clusters, we need to run more tests on upgrade cycles to @@ -400,10 +558,24 @@ the `workload.openshift.io/allowed` annotation. This new behavior will be added in 4.12 as part of the installation configurations for customers to utilize. -Enabling the feature after installation is not supported in 4.12, so we do not +Enabling the feature after installation for HA/3NC is not supported in 4.12, so we do not need to address what happens if an older cluster upgrades and then the feature is turned on. +When upgrades occur for current single node deployments we will need to set the global +identifier during the upgrade. We will do this via the NTO and the trigger for +this event will be: + +- If we are in a `SingleNodeTopology` +- If the `capacity` field set on the node +- If the identifier is not currently set + +We will not change the current machine configs for single node deployments if +they are already set, this will be done to avoid extra restarts. We will need to +be clear with customers however, if they add the +`spec.workloads[Infrastructure]` we will then take that opportunity to +consolidate the machine configs and clean up the old way of deploying things. + ### Version Skew Strategy N/A @@ -429,8 +601,8 @@ with a message warning the user that their partitioning instructions were ignored. These conditions are: 1. When a pod has the Guaranteed QoS class -1. When mutation would change the QoS class for the pod -1. When the feature is inactive because not all nodes are reporting the +2. When mutation would change the QoS class for the pod +3. When the feature is inactive because not all nodes are reporting the management resource #### Support Procedures From 1f34501c185e0a3b0c8e08877e2118a88ac41730 Mon Sep 17 00:00:00 2001 From: ehila Date: Thu, 1 Sep 2022 17:56:57 -0400 Subject: [PATCH 06/18] updated to use manifest instead of installer config updated small wording changes gave more detail for upgrade path Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 143 +++++++++++------- 1 file changed, 85 insertions(+), 58 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 3b39f73962..5ebc90e95a 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -58,16 +58,13 @@ determinism required of my applications. conformance and functional end-to-end tests as similar deployments that are not isolating the management workloads. - We want to run different workload partitioning on masters and workers. -- Customers will be supplied with the advice of 4 hyperthreaded cores for - masters and for workers, 2 hyperthreaded cores. - We want to consolidate the current approach and extend PerformanceProfile API to avoid any possible errors when configuring a cluster for workload partitioning. - We want to make this enhancement a day 0 supported feature only. We do not support turning it off after the installation is done and the feature is on. -- We want to make sure that when this feature is on and no CPU set limit is - defined via the Performance Profile, the default behavior is to allow full use - of the CPU set. +- We want to make sure that the default behavior of this feature is to allow + full use of the CPU set. - We want to maintain the ability to configure the partition size after installation, we do not support turning off the partition feature but we do support changing the CPU partition size post day 0. The ability for turning on @@ -118,7 +115,7 @@ the fact as nodes join the cluster. When a customer wishes to pin the management workloads they will be able to do that via the existing Performance Profile. Resizing partition size will not cause any issues after installation. -In order to implement this enhancement we are proposing changing 4 components +In order to implement this enhancement we are proposing changing 3 components defined below. 1. Openshift API - ([Infrastructure @@ -127,23 +124,19 @@ defined below. - The change in this component is to store a global identifier if we have partitioning turned on. -2. [Openshift Installer](https://github.com/openshift/installer)) - - - The change in the installer is to support turning on this feature only - during installation as well as apply the partitioning machine configs so - that we avoid the race conditions when running in multi-node environments. - -3. Admission Controller ([management cpus +2. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. - This change will be in support of checking the global identifier in order to modify the pod spec with the correct `requests`. -4. The +3. The [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) - This change will support adding the ability to explicitly pin `Infrastructure` workloads. + - This change will support updating a global identifier when workload + partitioning is detected on the nodes. ### Openshift API - Infrastructure Status @@ -233,12 +226,10 @@ information present in the `PerformanceProfile` resource, that being the the right CPU sets consistently is error prone, we want to extend the PerformanceProfile API to include settings for workload partitioning. -We want to consolidate the current approach to setting up workload partitioning -and utilize the changes suggested via the `openshift/installer`. When -installation is done and workload partitioning is set then from that point on -the `kubelet` and `crio` only need to be configured with the desired CPU set to -use. We currently express this to customers via the `reserved` CPU set as part -of the performance profile api. +When installation is done and workload partitioning is set then from that point +on the `kubelet` and `crio` only need to be configured with the desired CPU set +to use. We currently express this to customers via the `reserved` CPU set as +part of the performance profile api. We want to add a new `workloads` field to the `cpu` field that contains a list of enums for defining which workloads to pin. This should allow us to expand @@ -259,13 +250,21 @@ spec: - Infrastructure ``` +To support upgrades and maintain better signaling for the cluster, the +Performance Profile Controller will also inspect the Nodes to update a global +identifier at start up. We will only update the identifier to true if our +criteria is met, otherwise we will not change the state to "off" as disabling this +feature is not supported. We will determine to be in workload pinning mode if +all the master nodes are configured with a capacity +(`management.workload.openshift.io/cores`) for workload partitioning. If that is +true we will set the global identifier otherwise we leave it as is. + ### Workflow Description -The end user will be expected to set the installer config to -`installConfig.cpuPartitioning: AllNodes` to turn on the feature for the whole -cluster. As well as provide a `PerformanceProfile` manifest that describes their -desired `isolated` and `reserved` CPUSet and the `Infrastructure` enum provided -to the list in the `workloads` enum list. +The end user will be expected to provide the default machine configs to turn on +the feature for the whole cluster. As well as provide a `PerformanceProfile` +manifest that describes their desired `isolated` and `reserved` CPUSet and the +`Infrastructure` enum provided to the list in the `workloads` enum list. **High level sequence diagram:** @@ -273,18 +272,32 @@ Install Time Sequence ```mermaid sequenceDiagram -Alice->>Installer: Apply Install Config -loop Generate - Installer->>Installer: Machine Config +Alice->>Installer: Generate Manifests +Installer->>Alice: Create Dir with Cluster Manifests +loop Apply + Alice->>Alice: Add Custom Machine Configs + Alice->>Alice: Add PerformanceProfile Manifest + Alice->>Alice: Alter Infrastructure Status end +Alice->>Installer: Create Cluster Installer-->>APIServer: Apply Machine Configs APIServer-->>Installer: Applied! APIServer-->>MCO: Machine Manifests -MCO-->>Node: Configure and restart Node +MCO-->>Node: Configure and Restart Node loop Apply - Node->>Node: Set kubelet config - Node->>Node: Set crio config - Node->>Node: Kubelet advertises cores + Node->>Node: Set Kubelet Config + Node->>Node: Set CRI- Config + Node->>Node: Kubelet Advertises Cores +end +Node-->>MCO: Finished Restart +loop Generate + NTO->>NTO: Machine Config from PerformanceProfile +end +NTO-->>MCO: Apply Machine Config +MCO-->>Node: Configure and Restart Node +loop Apply + Node->>Node: Set Kubelet Config + Node->>Node: Set CRI- Config end Node-->>MCO: Finished Restart ``` @@ -294,23 +307,35 @@ Node-->>MCO: Finished Restart - **MCO** is the machine config operator. - **Node** is the kubernetes node. -1. Alice sits down and provides the desired installer config with cpu - partitioning turned on, `cpuPartitioning: AllNodes` -2. The installer generates the manifest for a wide open CPU set and applies - them. -3. Once the MCO applies the configs, the node is restarted and the cluster - installation continues to completion. -4. Alice will now have a cluster that has been setup with workload pinning, but - the workloads are not limited to any CPU set until Alice applies the - performance profile. +1. Alice sits down and uses the installer to generate the manifests + - `openshift-install create manifests` +2. The installer generates the manifests to create the cluster +3. Alice adds the default machine configs and desired PerformanceProfile + manifest for workload partitioning to the `openshift` folder that was + generated by the installer. +4. Alice updates the `Infrastructure` CR status + to denote that workload partitioning is turned on. +5. Alice then creates the cluster via the installer. +6. The installer will apply the manifests and during the bootstrapping process + the MCO will apply the default configurations for workload partitioning, and + restarts the nodes. +7. After the cluster is up the NTO will then generate the machine configurations + using the information provided in the PerformanceProfile manifest. +8. The MCO applies the updated workload partitioning configurations and restarts + the relevant nodes. +9. Alice will now have a cluster that has been setup with workload partitioning + and the desired workloads pinned to the specified CPUSet in the PerformanceProfile. Applying CPU Partitioning Size Change ```mermaid sequenceDiagram -Alice->>Installer: Provide PerformanceProfile manifest -Installer-->>NTO: Apply -NTO-->>MCO: Generated Machine Manifests +Alice->>API: Apply PerformanceProfile Manifest +API-->>NTO: Apply +loop Generate + NTO->>NTO: Machine Configs +end +NTO-->>MCO: Apply Machine Configs MCO-->>Node: Configure node loop Apply Node->>Node: Set kubelet config @@ -319,8 +344,7 @@ loop Apply end Node-->>MCO: Finished Restart MCO-->>NTO: Machine Manifests Applied -NTO-->>Installer: PerformanceProfile Applied -Installer-->>Alice: Cluster is Up! +NTO-->>API: PerformanceProfile Applied ``` - **Alice** is a human user who creates an Openshift cluster. @@ -329,15 +353,13 @@ Installer-->>Alice: Cluster is Up! - **MCO** is the machine config operator. - **Node** is the kubernetes node. -1. Alice sits down and provides the desired performance profile as an extra - manifest to the installer. -2. The installer applies the manifest. -3. The NTO will generate the appropriate machine configs that include the - kubelet config and the crio config to be applied as well as the existing - operation. -4. Once the MCO applies the configs, the node is restarted and the cluster - installation continues to completion. -5. Alice will now have a cluster that has been setup with workload pinning. +1. Alice sits down and applies the desired PerformanceProfile with the selected + workloads to pin. +2. The NTO will generate the appropriate machine configs that include the + Kubelet config and the CRIO config and apply the machine configs. +3. Once the MCO applies the configs, the node is restarted and the cluster + has been updated with the desired workload pinning. +4. Alice will now have a cluster that has been setup with workload pinning. #### Variation [optional] @@ -494,6 +516,13 @@ around upper and lower bounds of CPU sets for running an Openshift cluster It is possible to build a cluster with the feature enabled and then add a node in a way that does not configure the workload partitions only for that node. We do not support this configuration as all nodes must have the feature turned on. +The risk that a customer will run into here is that if that node is not in a pool +configured with workload partitioning, then it might not be able to correctly +function at all. Things such as networking pods might not work +as those pods will have the custom `requests` +`management.workload.openshift.io/cores`. In this situation the mitigation is +for the customer to add the node to a pool that contains the configuration for +workload partitioning. A possible risk are cluster upgrades, this is the first time this enhancement will be for multi-node clusters, we need to run more tests on upgrade cycles to @@ -566,9 +595,7 @@ When upgrades occur for current single node deployments we will need to set the identifier during the upgrade. We will do this via the NTO and the trigger for this event will be: -- If we are in a `SingleNodeTopology` -- If the `capacity` field set on the node -- If the identifier is not currently set +- If the `capacity` field set on the master node We will not change the current machine configs for single node deployments if they are already set, this will be done to avoid extra restarts. We will need to From 7d4ea2950d3ec741f9459eca82a2f870a4a608bf Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 6 Sep 2022 12:43:47 -0400 Subject: [PATCH 07/18] updated: added more context and wording clean up added context to capture proposal for machine config added more info for upgrades cleaned up older references Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 187 +++++++++--------- 1 file changed, 89 insertions(+), 98 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 5ebc90e95a..9def860911 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -93,9 +93,9 @@ but slightly different non-goals. stable and continue to operate reliably, but may respond more slowly. - **This enhancement does not address mixing nodes with pinning and without, this feature will be enabled cluster wide and encapsulate both master and - worker pools. If it's not desired then the setting will still be turned on but - the management workloads will run on the whole CPU set for that desired - pool.** + worker pools. If it's not desired then the default behavior will still be + turned on but the management workloads will run on the whole CPU set for that + desired pool.** - **This enhancement assumes that the configuration of a management CPU pool is done as part of installing the cluster. It can be changed after the fact but we will need to stipulate that, that is currently not supported. The intent @@ -105,16 +105,29 @@ but slightly different non-goals. We will need to maintain a global identifier that is set during installation and can not be easily removed after the fact. This approach will help remove -exposing this feature via an API and limiting the chances that a -misconfiguration can cause un-recoverable scenarios for our customers. At -install time we will also apply an initial machine config for workload -partitioning that sets a default CPUSet for the whole CPUSet. Effectively this -will behave as if workload partitioning is not turned on. With this approach we -eliminate the race condition that can occur if we apply a machine config after -the fact as nodes join the cluster. When a customer wishes to pin the management +exposing this feature via an API and limit the chances that a misconfiguration +can cause un-recoverable scenarios for our customers. At install time we will +also apply an initial machine config for workload partitioning that sets a +default CPUSet for the whole CPUSet. Effectively this will behave as if workload +partitioning is not turned on. When a customer wishes to pin the management workloads they will be able to do that via the existing Performance Profile. Resizing partition size will not cause any issues after installation. +With this approach we eliminate the race condition that can occur if we apply +the machine config after bootstrap via NTO. Since we create a "default" cri-o +and kubelet configuration that does not specify the CPUSet customers do not have +to worry about configuring correct bounds for their machines and risk +misconfiguration. + +Furthermore, as machines join the cluster they will have the feature turned on +before kubelet and the api-server come up as they query the +`machine-config-server` for their configurations before joining. This also +allows us more flexibility and an easier interface for the customer since +customers only need to interact with the Performance Profile to set their +`reserved` and `isolated` CPUSet. This makes things less prone to error as not +only can the CPUSets be different for workers and masters but the machines +themselves might have vastly different core counts. + In order to implement this enhancement we are proposing changing 3 components defined below. @@ -129,8 +142,8 @@ defined below. in openshift/kubernetes. - This change will be in support of checking the global identifier in order to modify the pod spec with the correct `requests`. -3. The - [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) +3. The [Performance Profile + Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) - This change will support adding the ability to explicitly pin @@ -161,7 +174,7 @@ type InfrastructureStatus struct { // set via the Node Tuning Operator and the Performance Profile API // +kubebuilder:default=None // +kubebuilder:validation:Enum=None;AllNodes - CPUPartitioning PartitioningMode `json:"cpuPartitioning"` + CPUPartitioning CPUPartitioningMode `json:"cpuPartitioning"` } // PartitioningMode defines the CPU partitioning mode of the nodes. @@ -175,48 +188,22 @@ const ( ) ``` -### Openshift Installer - -We will need to modify the Openshift installer to set and generate the machine -configs for the initial setup. The generated machine config manifests will be -set to be wide open to the whole CPU set. However, because these manifests are -applied early on in the process we avoid race condition situations that might -arise if these are applied after installation. - -In the similar approach to the `openshift/api` change we propose adding a new -feature to the install configuration that will flag a cluster for CPU -partitioning during installation. - -```go -// CPUPartitioningMode is a strategy for how various endpoints for the cluster are exposed. -// +kubebuilder:validation:Enum=None;AllNodes;MasterNodes;WorkerNodes -type CPUPartitioningMode string - -const ( - CPUPartitioningNone CPUPartitioningMode = "None" - CPUPartitioningAllNodes CPUPartitioningMode = "AllNodes" -) - -type InstallConfig struct { - // CPUPartitioning configures if a cluster will be used with CPU partitioning - // - // +kubebuilder:default=None - // +optional - CPUPartitioning CPUPartitioningMode `json:"cpuPartitioning,omitempty"` -} -``` - ### Admission Controller We want to remove the checks in the admission controller which specifically -verifies that partitioning is only applied to single node topology configuration. -The design and configuration for any pod modification will remain the same, we -simply will allow you to apply partitioning on non single node topologies. +verifies that partitioning is only applied to single node topology +configuration. The design and configuration for any pod modification will remain +the same, we simply will allow you to apply partitioning on non single node +topologies. We will use the global identifier to correctly modify the pod spec with the `requests.cpu` for the new `requests[management.workload.openshift.io/cores]` that are used by the workload partitioning feature. +However, for Single-Node we will continue to check the conventional way to be +able to support the upgrade flow from 4.11 -> 4.12. After 4.12 release that +logic should no longer be needed and will be removed. + ### Performance Profile Controller Currently workload partitioning involves configuring CRI-O and Kubelet earlier @@ -252,12 +239,11 @@ spec: To support upgrades and maintain better signaling for the cluster, the Performance Profile Controller will also inspect the Nodes to update a global -identifier at start up. We will only update the identifier to true if our -criteria is met, otherwise we will not change the state to "off" as disabling this -feature is not supported. We will determine to be in workload pinning mode if -all the master nodes are configured with a capacity -(`management.workload.openshift.io/cores`) for workload partitioning. If that is -true we will set the global identifier otherwise we leave it as is. +identifier at start up. We will only update the identifier to `AllNodes` if we +are running in Single Node and our Node has the capacity resource +(`management.workload.openshift.io/cores`) for our 4.11 -> 4.12 upgrades. This +should not be needed after 4.12 for all clusters. This should have no baring on +4.11 HA/3NC clusters as this feature will not be back ported. ### Workflow Description @@ -313,8 +299,8 @@ Node-->>MCO: Finished Restart 3. Alice adds the default machine configs and desired PerformanceProfile manifest for workload partitioning to the `openshift` folder that was generated by the installer. -4. Alice updates the `Infrastructure` CR status - to denote that workload partitioning is turned on. +4. Alice updates the `Infrastructure` CR status to denote that workload + partitioning is turned on. 5. Alice then creates the cluster via the installer. 6. The installer will apply the manifests and during the bootstrapping process the MCO will apply the default configurations for workload partitioning, and @@ -324,7 +310,8 @@ Node-->>MCO: Finished Restart 8. The MCO applies the updated workload partitioning configurations and restarts the relevant nodes. 9. Alice will now have a cluster that has been setup with workload partitioning - and the desired workloads pinned to the specified CPUSet in the PerformanceProfile. + and the desired workloads pinned to the specified CPUSet in the + PerformanceProfile. Applying CPU Partitioning Size Change @@ -357,8 +344,8 @@ NTO-->>API: PerformanceProfile Applied workloads to pin. 2. The NTO will generate the appropriate machine configs that include the Kubelet config and the CRIO config and apply the machine configs. -3. Once the MCO applies the configs, the node is restarted and the cluster - has been updated with the desired workload pinning. +3. Once the MCO applies the configs, the node is restarted and the cluster has + been updated with the desired workload pinning. 4. Alice will now have a cluster that has been setup with workload pinning. #### Variation [optional] @@ -373,34 +360,36 @@ management CPU pool. > [Management Workload Partitioning](management-workload-partitioning.md) 1. User sits down at their computer. -2. **The user creates a `PerformanceProfile` resource with the desired `isolated` - and `reserved` CPUSet with the `cpu.workloads[Infrastructure]` added to the - enum list.** -3. **The user will set the installer configuration for CPU partitioning to - AllNodes, `cpuPartitioning: AllNodes`.** +2. **The user creates a `PerformanceProfile` resource with the desired + `isolated` and `reserved` CPUSet with the `cpu.workloads[Infrastructure]` + added to the enum list.** +3. **Alice updates the `Infrastructure` CR status to denote that workload + partitioning is turned on.** 4. The user runs the installer to create the standard manifests, adds their extra manifests from steps 2, then creates the cluster. -5. **NTO will generate the machine config manifests and apply them.** -6. The kubelet starts up and finds the configuration file enabling the new +5. The kubelet starts up and finds the configuration file enabling the new feature. -7. The kubelet advertises `management.workload.openshift.io/cores` extended +6. The kubelet advertises `management.workload.openshift.io/cores` extended resources on the node based on the number of CPUs in the host. -8. The kubelet reads static pod definitions. It replaces the `cpu` requests with +7. The kubelet reads static pod definitions. It replaces the `cpu` requests with `management.workload.openshift.io/cores` requests of the same value and adds the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` annotations for CRI-O with the same values. -9. Something schedules a regular pod with the - `target.workload.openshift.io/management` annotation in a namespace with the - `workload.openshift.io/allowed: management` annotation. -10. The admission hook modifies the pod, replacing the CPU requests with +8. **NTO will generate the machine config manifests and apply them.** +9. **MCO modifies kubelet and cri-o configurations of the relevant machine pools + to the updated `reserved` CPU cores and restarts the nodes** +10. Something schedules a regular pod with the + `target.workload.openshift.io/management` annotation in a namespace with the + `workload.openshift.io/allowed: management` annotation. +11. The admission hook modifies the pod, replacing the CPU requests with `management.workload.openshift.io/cores` requests and adding the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` annotations for CRI-O. -11. The scheduler sees the new pod and finds available - `management.workload.openshift.io/cores` resources on the node. The scheduler - places the pod on the node. -12. Repeat steps 8-10 until all pods are running. -13. Cluster deployment comes up with management components constrained to subset +12. The scheduler sees the new pod and finds available + `management.workload.openshift.io/cores` resources on the node. The + scheduler places the pod on the node. +13. Repeat steps 10-12 until all pods are running. +14. Cluster deployment comes up with management components constrained to subset of available CPUs. ##### Partition Resize workflow @@ -453,16 +442,16 @@ spec: #### Changes to NTO The NTO PerformanceProfile will be updated to support a new flag which will -toggle the workload pinning to the `isolated` cores. The idea here being to +toggle the workload pinning to the `reserved` cores. The idea here being to simplify the approach for how customers set this configuration. With PAO being part of NTO now ([see here for more info](../node-tuning/pao-in-nto.md)) this affords us the chance to consolidate the configuration for `kubelet` and `crio`. We will modify the code path that generates the [new machine config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) -using the performance profile. With the new `spec.workloads[Infrastructure]` enum we -will add the configuration for `crio` and `kubelet` to the final machine config -manifest. Then the existing code path will apply the change as normal. +using the performance profile. With the new `spec.workloads[Infrastructure]` +enum we will add the configuration for `crio` and `kubelet` to the final machine +config manifest. Then the existing code path will apply the change as normal. #### API Server Admission Hook @@ -488,8 +477,11 @@ Changed Path: 1. Check if `pod` is a static pod - Skips modification attempt if it is static. -2. Checks if currently running cluster has global identifier for partitioning set - - Skips modification if identifier partitioning set to `None` +2. Checks if currently running cluster has global identifier for partitioning + set + - Skips modification if identifier partitioning set to `None` unless Single + Node, will check with old logic to maintain upgrade for Single-Node 4.11 -> + 4.12. 3. Checks what resource limits and requests are set on the pod - Skips modification if QoS is guaranteed or both limits and requests are set - Skips modification if after update the QoS is changed @@ -509,17 +501,15 @@ A risk we can run into is that a customer can apply a CPU set that is too small or out of bounds can cause problems such as extremely poor performance or start up errors. Mitigation of this scenario will be to provide proper guidance and guidelines for customers who enable this enhancement. As mentioned in our goal -we do support re-configuring the CPUSet partition size after installation. The -performance team would need to be reached out to for more specific information -around upper and lower bounds of CPU sets for running an Openshift cluster +we do support re-configuring the CPUSet partition size after installation. It is possible to build a cluster with the feature enabled and then add a node in a way that does not configure the workload partitions only for that node. We do not support this configuration as all nodes must have the feature turned on. -The risk that a customer will run into here is that if that node is not in a pool -configured with workload partitioning, then it might not be able to correctly -function at all. Things such as networking pods might not work -as those pods will have the custom `requests` +The risk that a customer will run into here is that if that node is not in a +pool configured with workload partitioning, then it might not be able to +correctly function at all. Things such as networking pods might not work as +those pods will have the custom `requests` `management.workload.openshift.io/cores`. In this situation the mitigation is for the customer to add the node to a pool that contains the configuration for workload partitioning. @@ -587,21 +577,22 @@ the `workload.openshift.io/allowed` annotation. This new behavior will be added in 4.12 as part of the installation configurations for customers to utilize. -Enabling the feature after installation for HA/3NC is not supported in 4.12, so we do not -need to address what happens if an older cluster upgrades and then the feature -is turned on. +Enabling the feature after installation for HA/3NC is not supported in 4.12, so +we do not need to address what happens if an older cluster upgrades and then the +feature is turned on. -When upgrades occur for current single node deployments we will need to set the global -identifier during the upgrade. We will do this via the NTO and the trigger for -this event will be: +When upgrades occur for current single node deployments we will need to set the +global identifier during the upgrade. We will do this via the NTO and the +trigger for this event will be: -- If the `capacity` field set on the master node +- If the `capacity` field set on the master node and is running in Single Node We will not change the current machine configs for single node deployments if they are already set, this will be done to avoid extra restarts. We will need to be clear with customers however, if they add the -`spec.workloads[Infrastructure]` we will then take that opportunity to -consolidate the machine configs and clean up the old way of deploying things. +`spec.workloads[Infrastructure]` we will then generate the new machine config +and an extra restart will happen. They will need to delete the old machine +configs afterwards. ### Version Skew Strategy From 989bd6519fe8db2131a006ee331046f79f26d5f1 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 13 Sep 2022 12:12:19 -0400 Subject: [PATCH 08/18] feat: added api approvers Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 9def860911..36dc4b72a5 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -12,7 +12,8 @@ approvers: - "@jerpeter1" - "@mrunalp" api-approvers: - - None + - "@deads2k" + - "@JoelSpeed" creation-date: 2022-08-03 last-updated: 2022-08-09 tracking-link: From 60cfbed408263122599b9e8f8671c8915525e30a Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 27 Sep 2022 15:50:39 -0400 Subject: [PATCH 09/18] docs: added information about node admission plugin Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 30 +++++++++++++++++-- 1 file changed, 28 insertions(+), 2 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 36dc4b72a5..c728c28408 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -129,7 +129,7 @@ customers only need to interact with the Performance Profile to set their only can the CPUSets be different for workers and masters but the machines themselves might have vastly different core counts. -In order to implement this enhancement we are proposing changing 3 components +In order to implement this enhancement we are proposing working on the 4 components defined below. 1. Openshift API - ([Infrastructure @@ -141,9 +141,16 @@ defined below. 2. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. + - This change will be in support of checking the global identifier in order to modify the pod spec with the correct `requests`. -3. The [Performance Profile + +3. Node Admission Plugin in openshift/kubernetes + + - We will add an admission plugin for nodes to prevent nodes from joining a + cluster that are not correctly setup for CPU Partitioning. + +4. The [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) @@ -205,6 +212,25 @@ However, for Single-Node we will continue to check the conventional way to be able to support the upgrade flow from 4.11 -> 4.12. After 4.12 release that logic should no longer be needed and will be removed. +### Node Admission Plugin + +We want to ensure that nodes that are not setup for CPU Partitioning do not get +added to a cluster that is designated for CPU Partitioning. We will create an +admission plugin for Nodes that will validate if a cluster is setup for CPU +Partitioning then the node MUST contain a Capacity resource for +`workload.openshift.io/cores`. All node creation requests from Kubelet currently add +that information on boot up when registering with the API Server, we will +leverage that to ensure a CPU Partitioned cluster only contains nodes created +for CPU Partitioning. + +We will also keep in mind upgrades from Single Node clusters which already +contain this feature. During initial upgrade, Single Node clusters will not +contain the CPUPartitioningMode, for this reason we will fall back to checking +with the old logic for Single Node to insure we do not cause issues when +upgrading. This check for Single Node should be something that happens on initial +upgrades, as NTO will update the `Infra.Status.CPUPartitioningMode` to the correct +value after initial boot. + ### Performance Profile Controller Currently workload partitioning involves configuring CRI-O and Kubelet earlier From 9bb1901f0ac8e9d604477f63fefe3a276e92e6c6 Mon Sep 17 00:00:00 2001 From: ehila Date: Fri, 30 Sep 2022 12:50:13 -0400 Subject: [PATCH 10/18] doc: updated to reflect current approach Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 133 ++++++++++++------ 1 file changed, 92 insertions(+), 41 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index c728c28408..7bf54a2c28 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -134,23 +134,23 @@ defined below. 1. Openshift API - ([Infrastructure Status](https://github.com/openshift/api/blob/81fadf1ca0981f800029bc5e2fe2dc7f47ce698b/config/v1/types_infrastructure.go#L51)) - - The change in this component is to store a global identifier if we have partitioning turned on. - -2. Admission Controller ([management cpus +2. Openshift Installer + - Add an `InstallConfig` flag to the installer to enable this feature from + install time only. +3. Machine Config Operator + - Add the ability of MCO to generate the needed machine configs for the + worker pools from bootstrap and maintain it. +4. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. - - This change will be in support of checking the global identifier in order to modify the pod spec with the correct `requests`. - -3. Node Admission Plugin in openshift/kubernetes - +5. Node Admission Plugin in openshift/kubernetes - We will add an admission plugin for nodes to prevent nodes from joining a cluster that are not correctly setup for CPU Partitioning. - -4. The [Performance Profile +6. The [Performance Profile Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) @@ -196,6 +196,62 @@ const ( ) ``` +### Openshift Installer + +We want to be able to allow a concise way to turn this feature on and easily +enable other consumers of the openshift-installer to utilize it via the +`InstallConfig`. This allows our other supported installer methods, such as +assisted installer or ztp, a straightforward way to expose this feature. + +In order to set the correct `Infrastructure Status` at install time, we'll +modify the `InstallConfig` to include enums to correctly set the status. + +```go +type InstallConfig struct { + ... + // CPUPartitioning determines if a cluster should be setup for CPU workload partitioning at install time. + // When this field is set the cluster will be flagged for CPU Partitioning allowing users to segregate workloads to + // specific CPU Sets. This does not make any decisions on workloads it only configures the nodes for it. + // + // This feature will alter the infrastructure nodes and prepare them for CPU Partitioning and as such can not be changed after being set. + // + // +kubebuilder:default="None" + // +optional + CPUPartitioning CPUPartitioningMode `json:"cpuPartitioningMode,omitempty"` + ... +} + + +// CPUPartitioningMode defines how the nodes should be setup for partitioning the CPU Sets. +// +kubebuilder:validation:Enum=None;AllNodes +type CPUPartitioningMode string + +const ( + // CPUPartitioningNone means that no CPU Partitioning is on in this cluster infrastructure + CPUPartitioningNone CPUPartitioningMode = "None" + // CPUPartitioningAllNodes means that all nodes are configured with CPU Partitioning in this cluster + CPUPartitioningAllNodes CPUPartitioningMode = "AllNodes" +) +``` + +### Machine Config Operator + +Once we have a global flag, and a way to set it at install time, we'll need to +apply the needed configurations at install time during bootstrap. We will add +this ability to MCO to generate and maintain the needed configurations before +`kubelet` and the `api-server` stands up. + +We will add to the `kubelet` controller the ability to watch the +`Infrastructure` resource and if CPU Partitioning is set to `AllNodes` we will +generate the bootstrap and the controller will maintain the MCs from that point +on. Things to note, this feature is explicitly designed to not be turned off, as +such once set we will not remove the MCs. + +We will need to support upgrades for Single Node since this feature already +exists for them but this implementation differs slightly. To avoid needless +restarts, we will not alter the current configuration if we detect that we are +in a single node topology and the nodes are already configured for CPU Partitioning. + ### Admission Controller We want to remove the checks in the admission controller which specifically @@ -285,17 +341,24 @@ Install Time Sequence ```mermaid sequenceDiagram -Alice->>Installer: Generate Manifests +Alice->>Installer: Generate Manifests Install Config With CPU Partitioning Installer->>Alice: Create Dir with Cluster Manifests loop Apply - Alice->>Alice: Add Custom Machine Configs Alice->>Alice: Add PerformanceProfile Manifest - Alice->>Alice: Alter Infrastructure Status end -Alice->>Installer: Create Cluster -Installer-->>APIServer: Apply Machine Configs -APIServer-->>Installer: Applied! -APIServer-->>MCO: Machine Manifests +Alice->>Installer: Install Config With CPU Partitioning +loop Generate + Installer->>Installer: Set Infrastructure Status +end +Installer-->>APIServer: Create Cluster +APIServer-->>MCO: Get Infrastructure Status +loop Generate + MCO->>MCO: Generate MC Configs +end +MCO-->>APIServer: Apply New MC +loop Generate + MCO->>MCO: Generate Rendered MC +end MCO-->>Node: Configure and Restart Node loop Apply Node->>Node: Set Kubelet Config @@ -320,23 +383,22 @@ Node-->>MCO: Finished Restart - **MCO** is the machine config operator. - **Node** is the kubernetes node. -1. Alice sits down and uses the installer to generate the manifests +1. Alice sits down and uses the installer to generate the manifests with the + InstallConfig specifying `cpuPartitioningMode: AllNodes` - `openshift-install create manifests` -2. The installer generates the manifests to create the cluster -3. Alice adds the default machine configs and desired PerformanceProfile - manifest for workload partitioning to the `openshift` folder that was - generated by the installer. -4. Alice updates the `Infrastructure` CR status to denote that workload - partitioning is turned on. -5. Alice then creates the cluster via the installer. -6. The installer will apply the manifests and during the bootstrapping process - the MCO will apply the default configurations for workload partitioning, and - restarts the nodes. -7. After the cluster is up the NTO will then generate the machine configurations +2. The installer generates the manifests to create the cluster, with the new + `Infrastructure.Status.CPUPartitioning: AllNodes` +3. Alice adds the desired PerformanceProfile manifest for workload partitioning + to the `openshift` folder that was generated by the installer. +4. Alice then creates the cluster via the installer. +5. The installer will apply the manifests and during the bootstrapping process + the MCO will generate the default configurations for workload partitioning + based off of the `Infrastructure.Status`, and restarts the nodes. +6. After the cluster is up the NTO will then generate the machine configurations using the information provided in the PerformanceProfile manifest. -8. The MCO applies the updated workload partitioning configurations and restarts +7. The MCO applies the updated workload partitioning configurations and restarts the relevant nodes. -9. Alice will now have a cluster that has been setup with workload partitioning +8. Alice will now have a cluster that has been setup with workload partitioning and the desired workloads pinned to the specified CPUSet in the PerformanceProfile. @@ -530,17 +592,6 @@ up errors. Mitigation of this scenario will be to provide proper guidance and guidelines for customers who enable this enhancement. As mentioned in our goal we do support re-configuring the CPUSet partition size after installation. -It is possible to build a cluster with the feature enabled and then add a node -in a way that does not configure the workload partitions only for that node. We -do not support this configuration as all nodes must have the feature turned on. -The risk that a customer will run into here is that if that node is not in a -pool configured with workload partitioning, then it might not be able to -correctly function at all. Things such as networking pods might not work as -those pods will have the custom `requests` -`management.workload.openshift.io/cores`. In this situation the mitigation is -for the customer to add the node to a pool that contains the configuration for -workload partitioning. - A possible risk are cluster upgrades, this is the first time this enhancement will be for multi-node clusters, we need to run more tests on upgrade cycles to make sure things run as expected. From f9c0d20099fcbf35286f18202f81d5fea07faa45 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 4 Oct 2022 15:01:41 -0400 Subject: [PATCH 11/18] doc: added information to capture required machine config pool node sizes Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 7bf54a2c28..9c239efc72 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -101,6 +101,10 @@ but slightly different non-goals. done as part of installing the cluster. It can be changed after the fact but we will need to stipulate that, that is currently not supported. The intent here is for this to be supported as a day 0 feature, only.** +- **This enhancement does not address mixing nodes of different CPU sizes in + Machine Configuration Pools. Pools must contain machines of the same CPU set size, + currently the Performance Profile defines and targets specific machine pools + to apply the configuration to, as such they must have the same CPU set size.** ## Proposal @@ -592,6 +596,15 @@ up errors. Mitigation of this scenario will be to provide proper guidance and guidelines for customers who enable this enhancement. As mentioned in our goal we do support re-configuring the CPUSet partition size after installation. +Another similar risk to CPU set sizes being too small or out of bounds is if +customers mix nodes of differing CPU set sizes in the same machine config pool. +This is not supported with this feature as it is required that machine pools +contain machines of the same size to correctly apply the CPUSet affinities. If +customers mix node sizes in the same pool, then the ones where the CPUSet is out +of bound will fail, while the others will not. They will need to evict those +machines and add ones that fall with in bounds of the `reserved` and `isolated` +CPUSets defined in their PerformanceProfile. + A possible risk are cluster upgrades, this is the first time this enhancement will be for multi-node clusters, we need to run more tests on upgrade cycles to make sure things run as expected. From e6f2fa7bd66e689bc1dcfa93f409a1aa0f12b083 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 1 Nov 2022 13:52:12 -0400 Subject: [PATCH 12/18] upkeep: add reviewers added reviewers for api/mco/installer Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 9c239efc72..49629b53de 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -8,6 +8,11 @@ reviewers: - "@rphillips" - "@browsell" - "@haircommander" + - "@JoelSpeed" + - "@deads2k" + - "@patrickdillon" + - "@zaneb" + - "@sinnykumari" approvers: - "@jerpeter1" - "@mrunalp" From 3bdfc7d85a7b7bd73c836f255109c2d6bad237f2 Mon Sep 17 00:00:00 2001 From: ehila Date: Wed, 2 Nov 2022 12:49:19 -0400 Subject: [PATCH 13/18] alter: changed Infrastructure enum to Management Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 49629b53de..4592eba3ac 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -164,7 +164,7 @@ defined below. part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) - This change will support adding the ability to explicitly pin - `Infrastructure` workloads. + `Management` workloads. - This change will support updating a global identifier when workload partitioning is detected on the nodes. @@ -313,7 +313,7 @@ part of the performance profile api. We want to add a new `workloads` field to the `cpu` field that contains a list of enums for defining which workloads to pin. This should allow us to expand this in the future if desired, but in this enhancement we will only support -`Infrastructure` which defines all of the Openshift workloads. +`Management` which defines all of the Openshift workloads. ```yaml apiVersion: performance.openshift.io/v2 @@ -342,7 +342,7 @@ should not be needed after 4.12 for all clusters. This should have no baring on The end user will be expected to provide the default machine configs to turn on the feature for the whole cluster. As well as provide a `PerformanceProfile` manifest that describes their desired `isolated` and `reserved` CPUSet and the -`Infrastructure` enum provided to the list in the `workloads` enum list. +`Management` enum provided to the list in the `workloads` enum list. **High level sequence diagram:** @@ -459,10 +459,10 @@ management CPU pool. 1. User sits down at their computer. 2. **The user creates a `PerformanceProfile` resource with the desired - `isolated` and `reserved` CPUSet with the `cpu.workloads[Infrastructure]` + `isolated` and `reserved` CPUSet with the `cpu.workloads[Management]` added to the enum list.** -3. **Alice updates the `Infrastructure` CR status to denote that workload - partitioning is turned on.** +3. **User updates the `InstallConfig` and sets `cpuPartitioningMode: AllNodes` + to denote that workload partitioning is turned on.** 4. The user runs the installer to create the standard manifests, adds their extra manifests from steps 2, then creates the cluster. 5. The kubelet starts up and finds the configuration file enabling the new @@ -499,7 +499,7 @@ This section outlines an end-to-end workflow for resizing the CPUSet partition. 1. User sits down at their computer. 2. **The user updates the `PerformanceProfile` resource with the new desired - `isolated` and new `reserved` CPUSet with the `cpu.workloads[Infrastructure]` + `isolated` and new `reserved` CPUSet with the `cpu.workloads[Management]` in the enum list.** 3. **NTO will re-generate the machine config manifests and apply them.** 4. ... Steps same as [E2E Workflow deployment](#e2e-workflow-deployment) ... @@ -509,13 +509,13 @@ This section outlines an end-to-end workflow for resizing the CPUSet partition. ### API Extensions - We want to extend the `PerformanceProfile` API to include the addition of a - new `workloads[Infrastructure]` configuration under the `cpu` field. + new `workloads[Management]` configuration under the `cpu` field. - The behavior of existing API should not change with this addition. - New resources that make use of this new field will have the current machine config generated with the additional configurations added to the manifest. - Uses the `reserved` field to add the correct CPU set to the CRI-O and Kubelet configuration files to the currently generated machine config. - - If no `workloads[Infrastructure]` is provided then no workload partitioning + - If no `workloads[Management]` is provided then no workload partitioning configurations are left wide open to all CPU sets for the Kubelet and CRI-O configurations. @@ -532,7 +532,7 @@ spec: reserved: 0,1 # New enum addition workloads: - - Infrastructure + - Management ``` ### Implementation Details/Notes/Constraints [optional] @@ -547,7 +547,7 @@ affords us the chance to consolidate the configuration for `kubelet` and `crio`. We will modify the code path that generates the [new machine config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) -using the performance profile. With the new `spec.workloads[Infrastructure]` +using the performance profile. With the new `spec.workloads[Management]` enum we will add the configuration for `crio` and `kubelet` to the final machine config manifest. Then the existing code path will apply the change as normal. @@ -686,7 +686,7 @@ trigger for this event will be: We will not change the current machine configs for single node deployments if they are already set, this will be done to avoid extra restarts. We will need to be clear with customers however, if they add the -`spec.workloads[Infrastructure]` we will then generate the new machine config +`spec.workloads[Management]` we will then generate the new machine config and an extra restart will happen. They will need to delete the old machine configs afterwards. From fbf1a16839125ebd7fadbb9ad82651139c7d9658 Mon Sep 17 00:00:00 2001 From: ehila Date: Wed, 2 Nov 2022 12:51:56 -0400 Subject: [PATCH 14/18] upkeep: updated to reference 4.13 and added more context cleared up some wording and added an explainer on alternatives for the installer Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 43 +++++++++++++------ 1 file changed, 30 insertions(+), 13 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 4592eba3ac..9d481b1045 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -270,11 +270,14 @@ the same, we simply will allow you to apply partitioning on non single node topologies. We will use the global identifier to correctly modify the pod spec with the -`requests.cpu` for the new `requests[management.workload.openshift.io/cores]` -that are used by the workload partitioning feature. +`requests.cpu` for the new +`requests[.workload.openshift.io/cores]` that are used by the +workload partitioning feature. Where `workload-type` is driven from the +Deployment pod spec annotation +`spec.template.metadata.annotations[target.workload.openshift.io/]` However, for Single-Node we will continue to check the conventional way to be -able to support the upgrade flow from 4.11 -> 4.12. After 4.12 release that +able to support the upgrade flow from 4.12 -> 4.13. After 4.13 release that logic should no longer be needed and will be removed. ### Node Admission Plugin @@ -326,16 +329,16 @@ spec: reserved: 0,1 # New addition workloads: - - Infrastructure + - Management ``` To support upgrades and maintain better signaling for the cluster, the Performance Profile Controller will also inspect the Nodes to update a global identifier at start up. We will only update the identifier to `AllNodes` if we are running in Single Node and our Node has the capacity resource -(`management.workload.openshift.io/cores`) for our 4.11 -> 4.12 upgrades. This -should not be needed after 4.12 for all clusters. This should have no baring on -4.11 HA/3NC clusters as this feature will not be back ported. +(`management.workload.openshift.io/cores`) for our 4.12 -> 4.13 upgrades. This +should not be needed after 4.13 for all clusters. This should have no baring on +4.12 HA/3NC clusters as this feature will not be back ported. ### Workflow Description @@ -578,8 +581,8 @@ Changed Path: 2. Checks if currently running cluster has global identifier for partitioning set - Skips modification if identifier partitioning set to `None` unless Single - Node, will check with old logic to maintain upgrade for Single-Node 4.11 -> - 4.12. + Node, will check with old logic to maintain upgrade for Single-Node 4.12 -> + 4.13. 3. Checks what resource limits and requests are set on the pod - Skips modification if QoS is guaranteed or both limits and requests are set - Skips modification if after update the QoS is changed @@ -635,7 +638,9 @@ N/A We will add a CI job with a cluster configuration that reflects the minimum of 2CPU/4vCPU masters and 1CPU/2vCPU worker configuration. This job should ensure that cluster deployments configured with management workload partitioning pass -the compliance tests. +the compliance tests. We will run a periodic informing job that will test and +verify that workload partitioning is working as expected. The intention is to +run this job at least once a day and as needed on NTO/PAO PRs. We will add a CI job to ensure that all release payload workloads have the `target.workload.openshift.io/management` annotation and their namespaces have @@ -670,10 +675,10 @@ the `workload.openshift.io/allowed` annotation. ### Upgrade / Downgrade Strategy -This new behavior will be added in 4.12 as part of the installation +This new behavior will be added in 4.13 as part of the installation configurations for customers to utilize. -Enabling the feature after installation for HA/3NC is not supported in 4.12, so +Enabling the feature after installation for HA/3NC is not supported in 4.13, so we do not need to address what happens if an older cluster upgrades and then the feature is turned on. @@ -729,7 +734,19 @@ WIP ## Alternatives -N/A +When we first discussed the global identifier we looked for ways to add it with +out involving any changes to the `Installer`, however, a few things made that +more difficult. Originally we had planned to generate the manifests via the +installer and allow the user to modify the `Infrastructure` resource to add the +`AllNodes` option for CPU Partitioning. It quickly became a cumbersome process to +automate and would require more effort on customers and be prone to error, especially +since this feature has to be turned on at install time. Furthermore, the primary +decision to involve the installer directly became evident when most of our other +tooling such as `ztp`, `assisted-installer`, and `agent-installer` did not have +a way to modify the `Infrastructure` resource after being generated. If we expose +this feature through a configuration flag on the `install-config.yaml` we +provide a straightforward path for these consumers to support workload +partitioning in their offering. ## Infrastructure Needed [optional] From 536751bc11af6c37b325e81ccb5e2ec5d10bacc8 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 8 Nov 2022 15:59:19 -0500 Subject: [PATCH 15/18] feat: added nto reviewers and updated to use nto for bootstrap Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 28 ++++++++++++------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 9d481b1045..8707b73dad 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -13,6 +13,8 @@ reviewers: - "@patrickdillon" - "@zaneb" - "@sinnykumari" + - "@dagrayvid" + - "@jmencak" approvers: - "@jerpeter1" - "@mrunalp" @@ -148,9 +150,10 @@ defined below. 2. Openshift Installer - Add an `InstallConfig` flag to the installer to enable this feature from install time only. -3. Machine Config Operator - - Add the ability of MCO to generate the needed machine configs for the - worker pools from bootstrap and maintain it. + - Add the ability to call the NTO `render` command during bootstrap. +3. Node Tuning Operator + - Add the ability of NTO to generate the needed machine configs for the + worker pools from bootstrap. 4. Admission Controller ([management cpus override](https://github.com/openshift/kubernetes/blob/a9d6306a701d8fa89a548aa7e132603f6cd89275/openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/doc.go)) in openshift/kubernetes. @@ -243,18 +246,23 @@ const ( ) ``` -### Machine Config Operator +We will also need to be able to support instantiating the cpu partitioning files at +bootstrap. Currently, NTO has support to render most of the configurations for a given +`PerformanceProfile`, what we are missing is that this is not being called in +the installer yet. We will add the NTO render command as part of the installer +bootstrap flow to generate the default configs. + +### Node Tuning Operator Once we have a global flag, and a way to set it at install time, we'll need to apply the needed configurations at install time during bootstrap. We will add -this ability to MCO to generate and maintain the needed configurations before +this ability to NTO to generate the needed configurations before `kubelet` and the `api-server` stands up. -We will add to the `kubelet` controller the ability to watch the -`Infrastructure` resource and if CPU Partitioning is set to `AllNodes` we will -generate the bootstrap and the controller will maintain the MCs from that point -on. Things to note, this feature is explicitly designed to not be turned off, as -such once set we will not remove the MCs. +We will add to the `render` command the ability to ingest the `Infrastructure` +resource and if CPU Partitioning is set to `AllNodes` we will generate the +bootstrap configurations. Things to note, this feature is explicitly designed to +not be turned off. We will need to support upgrades for Single Node since this feature already exists for them but this implementation differs slightly. To avoid needless From f9aedb86cc605ec4e4e68cb8e35e73300b4b4b15 Mon Sep 17 00:00:00 2001 From: ehila Date: Mon, 12 Dec 2022 09:08:47 -0500 Subject: [PATCH 16/18] feat: updated to include NTO rendering Updated ep to account for agreed upon change of NTO taking renering responsibilities during bootstrap Signed-off-by: ehila --- ...wide-availability-workload-partitioning.md | 268 ++++++++++-------- 1 file changed, 156 insertions(+), 112 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 8707b73dad..1289fc9a77 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -166,8 +166,8 @@ defined below. Controller](https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md) part of [Cluster Node Tuning Operator](https://github.com/openshift/cluster-node-tuning-operator) - - This change will support adding the ability to explicitly pin - `Management` workloads. + - This change will support adding the ability to implicitly pin + `management` workloads based on `reserved` and `isolated` cores. - This change will support updating a global identifier when workload partitioning is detected on the nodes. @@ -262,7 +262,10 @@ this ability to NTO to generate the needed configurations before We will add to the `render` command the ability to ingest the `Infrastructure` resource and if CPU Partitioning is set to `AllNodes` we will generate the bootstrap configurations. Things to note, this feature is explicitly designed to -not be turned off. +not be turned off. After the files have been generated, NTO/PAO will take +ownership of the default generated files, so even if the customer deletes the +PerformanceProfile resource the cluster will remain in compliance under the +default configuration, which is CPU Partitioned but with the whole CPUSet used. We will need to support upgrades for Single Node since this feature already exists for them but this implementation differs slightly. To avoid needless @@ -304,7 +307,7 @@ contain this feature. During initial upgrade, Single Node clusters will not contain the CPUPartitioningMode, for this reason we will fall back to checking with the old logic for Single Node to insure we do not cause issues when upgrading. This check for Single Node should be something that happens on initial -upgrades, as NTO will update the `Infra.Status.CPUPartitioningMode` to the correct +upgrades, as NTO will update the `infra.status.cpuPartitioning` to the correct value after initial boot. ### Performance Profile Controller @@ -314,18 +317,16 @@ in the processes as a separate machine config manifest that requires the same information present in the `PerformanceProfile` resource, that being the `isolated` and `reserved` CPU sets. Because configuring multiple resources with the right CPU sets consistently is error prone, we want to extend the -PerformanceProfile API to include settings for workload partitioning. +Performance Profile Operator to manage settings for workload partitioning. This +will not include an API Change, merely that those files will now be implicitly +set by Performance Profile Operator when the cluster is setup for workload +partitioning via the global flag in `infra.status.cpuPartitioning`. When installation is done and workload partitioning is set then from that point on the `kubelet` and `crio` only need to be configured with the desired CPU set to use. We currently express this to customers via the `reserved` CPU set as part of the performance profile api. -We want to add a new `workloads` field to the `cpu` field that contains a list -of enums for defining which workloads to pin. This should allow us to expand -this in the future if desired, but in this enhancement we will only support -`Management` which defines all of the Openshift workloads. - ```yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile @@ -333,11 +334,10 @@ metadata: name: openshift-node-workload-partitioning-custom spec: cpu: + # These values will be used to derive pinned CPUSets + # for management workloads isolated: 2-3 reserved: 0,1 - # New addition - workloads: - - Management ``` To support upgrades and maintain better signaling for the cluster, the @@ -357,7 +357,7 @@ manifest that describes their desired `isolated` and `reserved` CPUSet and the **High level sequence diagram:** -Install Time Sequence +#### Install Time Sequence (With Performance Profile) ```mermaid sequenceDiagram @@ -366,41 +366,29 @@ Installer->>Alice: Create Dir with Cluster Manifests loop Apply Alice->>Alice: Add PerformanceProfile Manifest end -Alice->>Installer: Install Config With CPU Partitioning loop Generate Installer->>Installer: Set Infrastructure Status end -Installer-->>APIServer: Create Cluster -APIServer-->>MCO: Get Infrastructure Status -loop Generate - MCO->>MCO: Generate MC Configs +loop Bootstrap Cycle + Installer->>Installer: NTO Render end -MCO-->>APIServer: Apply New MC +Installer->>APIServer: Apply Manifests +MCO-->>APIServer: Get MCs loop Generate MCO->>MCO: Generate Rendered MC end -MCO-->>Node: Configure and Restart Node +MCO-->>Node: Configure and Restart loop Apply Node->>Node: Set Kubelet Config Node->>Node: Set CRI- Config Node->>Node: Kubelet Advertises Cores end Node-->>MCO: Finished Restart -loop Generate - NTO->>NTO: Machine Config from PerformanceProfile -end -NTO-->>MCO: Apply Machine Config -MCO-->>Node: Configure and Restart Node -loop Apply - Node->>Node: Set Kubelet Config - Node->>Node: Set CRI- Config -end -Node-->>MCO: Finished Restart ``` - **Alice** is a human user who creates an Openshift cluster. - **Installer** is assisted installer that applies the user manifest. -- **MCO** is the machine config operator. +- **NTO** is the node tunning operator. - **Node** is the kubernetes node. 1. Alice sits down and uses the installer to generate the manifests with the @@ -411,18 +399,60 @@ Node-->>MCO: Finished Restart 3. Alice adds the desired PerformanceProfile manifest for workload partitioning to the `openshift` folder that was generated by the installer. 4. Alice then creates the cluster via the installer. -5. The installer will apply the manifests and during the bootstrapping process - the MCO will generate the default configurations for workload partitioning - based off of the `Infrastructure.Status`, and restarts the nodes. -6. After the cluster is up the NTO will then generate the machine configurations - using the information provided in the PerformanceProfile manifest. -7. The MCO applies the updated workload partitioning configurations and restarts +5. The installer will call the NTO `render` command to generate the manifests + during the bootstrapping process. +6. The MCO applies the updated workload partitioning configurations and restarts the relevant nodes. -8. Alice will now have a cluster that has been setup with workload partitioning +7. Alice will now have a cluster that has been setup with workload partitioning and the desired workloads pinned to the specified CPUSet in the PerformanceProfile. -Applying CPU Partitioning Size Change +##### Install Time Sequence (Performance Profile) + +This sequence describes the scenario where you would want to activate Workload +Pinning on a cluster, but not pin any workloads to CPUSets just yet. + +```mermaid +sequenceDiagram +Alice->>Installer: Start Install with CPU Partitioning +loop Generate + Installer->>Installer: Set Infrastructure Status +end +loop Bootstrap Cycle + Installer->>Installer: NTO Render +end +Installer->>APIServer: Apply Manifests +MCO-->>APIServer: Get MCs +loop Generate + MCO->>MCO: Generate Rendered MC +end +MCO-->>Node: Configure and Restart +loop Apply + Node->>Node: Set Kubelet Config + Node->>Node: Kubelet Advertises Cores +end +Node-->>MCO: Finished Restart +``` + +- **Alice** is a human user who creates an Openshift cluster. +- **Installer** is assisted installer that applies the user manifest. +- **NTO** is the node tunning operator. +- **Node** is the kubernetes node. + +1. Alice sits down and uses the installer with the InstallConfig specifying + `cpuPartitioningMode: AllNodes` to create the cluster. + - `openshift-install create cluster` +2. The installer generates the manifests to create the cluster, with the new + `Infrastructure.Status.CPUPartitioning: AllNodes` +3. The installer will call the NTO `render` command to generate the manifests + during the bootstrapping process. +4. The MCO applies the updated workload partitioning configurations and restarts + the relevant nodes. +5. Alice will now have a cluster that has been setup with workload partitioning + with no workloads bound to CPUSets until choosing to apply a + PerformanceProfile later on. + +#### Applying CPU Partitioning Size Change ```mermaid sequenceDiagram @@ -470,35 +500,39 @@ management CPU pool. 1. User sits down at their computer. 2. **The user creates a `PerformanceProfile` resource with the desired - `isolated` and `reserved` CPUSet with the `cpu.workloads[Management]` - added to the enum list.** + `isolated` and `reserved` CPUSet.** 3. **User updates the `InstallConfig` and sets `cpuPartitioningMode: AllNodes` to denote that workload partitioning is turned on.** 4. The user runs the installer to create the standard manifests, adds their extra manifests from steps 2, then creates the cluster. -5. The kubelet starts up and finds the configuration file enabling the new - feature. -6. The kubelet advertises `management.workload.openshift.io/cores` extended - resources on the node based on the number of CPUs in the host. -7. The kubelet reads static pod definitions. It replaces the `cpu` requests with - `management.workload.openshift.io/cores` requests of the same value and adds - the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` - annotations for CRI-O with the same values. -8. **NTO will generate the machine config manifests and apply them.** -9. **MCO modifies kubelet and cri-o configurations of the relevant machine pools - to the updated `reserved` CPU cores and restarts the nodes** -10. Something schedules a regular pod with the +5. **During the installation bootstrap phase, the NTO's `render` command will be + called**. +6. **The `render` command will generate the initial CPU Partitioning machine + config manifests signaled by the `Infrastructure.Status.CPUPartitioningMode`.** +7. **The `render` command will then generate the subsequent manifest files + derived from the PerformanceProfile included by the user.** +8. **All nodes will query the machine config server for their configuration prior + to joining the cluster.** +9. **The kubelet starts up and finds the configuration file enabling the new + feature.** +10. The kubelet advertises `management.workload.openshift.io/cores` extended + resources on the node based on the number of CPUs in the host. +11. The kubelet reads static pod definitions. It replaces the `cpu` requests with + `management.workload.openshift.io/cores` requests of the same value and adds + the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` + annotations for CRI-O with the same values. +12. Something schedules a regular pod with the `target.workload.openshift.io/management` annotation in a namespace with the `workload.openshift.io/allowed: management` annotation. -11. The admission hook modifies the pod, replacing the CPU requests with +13. The admission hook modifies the pod, replacing the CPU requests with `management.workload.openshift.io/cores` requests and adding the `resources.workload.openshift.io/{container-name}: {"cpushares": 400}` annotations for CRI-O. -12. The scheduler sees the new pod and finds available +14. The scheduler sees the new pod and finds available `management.workload.openshift.io/cores` resources on the node. The scheduler places the pod on the node. -13. Repeat steps 10-12 until all pods are running. -14. Cluster deployment comes up with management components constrained to subset +15. Repeat steps 12-14 until all pods are running. +16. Cluster deployment comes up with management components constrained to subset of available CPUs. ##### Partition Resize workflow @@ -510,41 +544,16 @@ This section outlines an end-to-end workflow for resizing the CPUSet partition. 1. User sits down at their computer. 2. **The user updates the `PerformanceProfile` resource with the new desired - `isolated` and new `reserved` CPUSet with the `cpu.workloads[Management]` - in the enum list.** + `isolated` and new `reserved` CPUSet.** 3. **NTO will re-generate the machine config manifests and apply them.** -4. ... Steps same as [E2E Workflow deployment](#e2e-workflow-deployment) ... -5. Cluster deployment comes up with management components constrained to subset +4. **MCO will re-render change and propogate to the machine pools** +5. ... Follows steps 9-15 defined in [E2E Workflow deployment](#e2e-workflow-deployment) ... +6. Cluster deployment comes up with management components constrained to subset of available CPUs. ### API Extensions -- We want to extend the `PerformanceProfile` API to include the addition of a - new `workloads[Management]` configuration under the `cpu` field. -- The behavior of existing API should not change with this addition. -- New resources that make use of this new field will have the current machine - config generated with the additional configurations added to the manifest. - - Uses the `reserved` field to add the correct CPU set to the CRI-O and - Kubelet configuration files to the currently generated machine config. - - If no `workloads[Management]` is provided then no workload partitioning - configurations are left wide open to all CPU sets for the Kubelet and CRI-O - configurations. - -Example change: - -```yaml -apiVersion: performance.openshift.io/v2 -kind: PerformanceProfile -metadata: - name: openshift-node-workload-partitioning-custom -spec: - cpu: - isolated: 2-3 - reserved: 0,1 - # New enum addition - workloads: - - Management -``` +N/A ### Implementation Details/Notes/Constraints [optional] @@ -558,9 +567,10 @@ affords us the chance to consolidate the configuration for `kubelet` and `crio`. We will modify the code path that generates the [new machine config](https://github.com/openshift/cluster-node-tuning-operator/blob/a780dfe07962ad07e4d50c852047ef8cf7b287da/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L91-L127) -using the performance profile. With the new `spec.workloads[Management]` -enum we will add the configuration for `crio` and `kubelet` to the final machine -config manifest. Then the existing code path will apply the change as normal. +using the performance profile. We will use the existing `reserved` and +`isolated` signaled by the global `Infrastructure.Status.CPUPartitioningMode` to +add the configuration for `crio` and `kubelet` to the final machine config +manifest. Then the existing code path will apply the change as normal. #### API Server Admission Hook @@ -601,16 +611,25 @@ The sames risks and mitigations highlighted in [Management Workload Partitioning](management-workload-partitioning.md) apply to this enhancement as well. +#### Configuration + We need to make it very clear to customers that this feature is supported as a -day 0 configuration and day n+1 alterations are not be supported with this +day 0 configuration and day n+1 alterations are not to be supported with this enhancement. Part of that messaging should involve a clear indication that this -currently will be a cluster wide feature. +currently will be a cluster wide feature. As such, adding nodes that are not +configured for CPUPartitioning is not supported, with the enhancement as +described we will add a node admission plugin to mitigate that possibility by +not allowing customers to add nodes that are not configured properly. + +#### CPU Set size A risk we can run into is that a customer can apply a CPU set that is too small or out of bounds can cause problems such as extremely poor performance or start up errors. Mitigation of this scenario will be to provide proper guidance and guidelines for customers who enable this enhancement. As mentioned in our goal -we do support re-configuring the CPUSet partition size after installation. +we do support re-configuring the CPUSet partition size after installation. This +is current messaging that is occurring with the Single Node implementation of +this feature. Another similar risk to CPU set sizes being too small or out of bounds is if customers mix nodes of differing CPU set sizes in the same machine config pool. @@ -621,9 +640,42 @@ of bound will fail, while the others will not. They will need to evict those machines and add ones that fall with in bounds of the `reserved` and `isolated` CPUSets defined in their PerformanceProfile. -A possible risk are cluster upgrades, this is the first time this enhancement -will be for multi-node clusters, we need to run more tests on upgrade cycles to -make sure things run as expected. +#### Deletions + +Once a cluster is setup for CPU Partitioning, if a customer was to delete the +machine config defining the default CPU Partitioning, nothing would happen +initially since the Kubelet does not remove the Capacity and Allocatable +resource for workload partitioning from the Node resource. Scheduling will still +work but we will run into A possible risk are cluster upgrades, this is the +first time this enhancement will be for multi-node clusters, we need to run more +tests on upgrade cycles to make sure things run as expected. + +#### CRIO Config File + +The current CRIO config supports multiple workload configurations. We do not +support multiple workload CRIO config and customers need to be aware that when +it comes to workload partitioning, we can only support the CRIO configurations +we create and maintain on CPU partitioned clusters. + +#### Install Failures + +In terms of possible install failures, the most critical thing is that the +`kubelet` workload partitioning configuration file exists before `kubelet` comes +online to join the cluster. A cluster that is designated as CPU Partitioned via +the global flag will not allow nodes not configured with CPU Partitioning to +join. Since nodes query the `machine-config-server` for their configuration +prior to joining the cluster, we expect any failures prior to this query to be +resolved in any fashion they are currently being resolved in. When it comes to +errors that occur after this query the thing that matters is that the `kubelet` +configuration file `/etc/kubernetes/openshift-workload-pinning` exists. At the +moment of writing its content is irrelevant as kubelet uses it as a flag to +correctly apply the node `capacity/allocatable` information to the node resource +when joining the cluster. If that file exists then you can resolve any errors in +the fashion they are being resolved today. If that file does not exist due to +some other error, we will need to access the node and create it and restart +kubelet. Once kubelet applies the node `capacity/allocatable` information it +does not remove them unless manually done so. Resolving errors from that +point on should be done in what ever manner they are being done now. ### Drawbacks @@ -696,12 +748,12 @@ trigger for this event will be: - If the `capacity` field set on the master node and is running in Single Node -We will not change the current machine configs for single node deployments if -they are already set, this will be done to avoid extra restarts. We will need to -be clear with customers however, if they add the -`spec.workloads[Management]` we will then generate the new machine config -and an extra restart will happen. They will need to delete the old machine -configs afterwards. +One challenge we have with existing single node deployments of workload +partitioning is that customers may have applied different named files and +different formatting for the crio and kubelet configuration files. This means +that a onetime secondary reboot during upgrade might be unavoidable to bring +those cluster deployments in compliance with this new method of configuring +workload partitioning. ### Version Skew Strategy @@ -709,15 +761,7 @@ N/A ### Operational Aspects of API Extensions -The addition to the API is an optional field which should not require any -conversion admission webhook changes. This change will only be used to allow the -user to explicitly define their intent and simplify the machine manifest by -generating the extra machine manifests that are currently being created -independently of the `PerformanceProfile` CRD. - -Futhermore the design and scope of this enhancement will mean that the existing -Admission webhook will continue to apply the same warnings and error messages to -Pods as described in the [failure modes](#failure-modes). +N/A #### Failure Modes From 683ed2b568f2e742a698b0cb16c96527e07d3435 Mon Sep 17 00:00:00 2001 From: ehila Date: Tue, 13 Dec 2022 12:00:49 -0500 Subject: [PATCH 17/18] clearified the authoritative design of the global identifier Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index 1289fc9a77..ff715bddac 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -122,9 +122,16 @@ can cause un-recoverable scenarios for our customers. At install time we will also apply an initial machine config for workload partitioning that sets a default CPUSet for the whole CPUSet. Effectively this will behave as if workload partitioning is not turned on. When a customer wishes to pin the management -workloads they will be able to do that via the existing Performance Profile. +workloads they will be able to do that via the existing Performance Profile API. Resizing partition size will not cause any issues after installation. +In short, the global identifier of `Infrastructure.Status.CPUPartitioning: +AllNodes/None` will be the authoritative value that sets a cluster for workload +partitioning or not. If a customer has a cluster setup with `CPUPartitioning: +None`, applying a PerformanceProfile will not apply workload partitioning. With +`CPUPartitioning: AllNodes` is the only condition when we will apply workload +partitioning with the CPUSet size being driven by a Performance Profile instance. + With this approach we eliminate the race condition that can occur if we apply the machine config after bootstrap via NTO. Since we create a "default" cri-o and kubelet configuration that does not specify the CPUSet customers do not have From 9315debe57d4f64785cceccf7556dff6fecc2b3c Mon Sep 17 00:00:00 2001 From: ehila Date: Wed, 25 Jan 2023 10:55:19 -0500 Subject: [PATCH 18/18] upkeep: wording change for day 0 support added explicit language for support only as a day 0 feature. Signed-off-by: ehila --- .../wide-availability-workload-partitioning.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md index ff715bddac..7372fd818b 100644 --- a/enhancements/workload-partitioning/wide-availability-workload-partitioning.md +++ b/enhancements/workload-partitioning/wide-availability-workload-partitioning.md @@ -105,9 +105,9 @@ but slightly different non-goals. turned on but the management workloads will run on the whole CPU set for that desired pool.** - **This enhancement assumes that the configuration of a management CPU pool is - done as part of installing the cluster. It can be changed after the fact but - we will need to stipulate that, that is currently not supported. The intent - here is for this to be supported as a day 0 feature, only.** + done as part of installing the cluster since we only support this feature as a + day 0 feature. This enhancement does not implement or provide a path to toggle + this feature after day 0.** - **This enhancement does not address mixing nodes of different CPU sizes in Machine Configuration Pools. Pools must contain machines of the same CPU set size, currently the Performance Profile defines and targets specific machine pools