KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting #2435

Levovar · 2018-07-30T16:48:59Z

Opening a KEP about the subject was previously aligned with @ConnorDoyle

k8s-ci-robot · 2018-07-30T16:49:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: calebamiles

If they are not already assigned, you can assign the PR to them by writing /assign @calebamiles in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

keps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2018-07-30T16:49:11Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Levovar · 2018-07-30T16:53:59Z

/cc @ConnorDoyle

jeremyeder · 2018-07-30T16:55:06Z

@vikaschoudhary16 @zvonkok

ConnorDoyle · 2018-07-30T17:03:26Z

/ok-to-test

jeremyeder · 2018-07-31T17:32:04Z

keps/sig-node/0018-20180730-make-cpu-manager-respect-isolcpus.md

+For example:
+- outsourcing the management of a subset of specialized, or optimized CPUs to an external CPU manager without any (other) change in Kubelet's CPU manager
+- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node)
+


Some more color: Current CPU Manager tuning only affects things that the kubelet manages. There are threads not managed by the kubelet that affect performance of things running under the kubelet. To handle that, the kernel offers isolcpus or systemd configurations, features which are leveraged by a large ecosystem of software. We need the kubelet to coordinate with the kernel in order to offer applications fully de-jittered cores on which to be scheduled by the kubelet.

I fully agree, so I created this KEP :)

The aim of this KEP is to align kubelet with the isolcpus kernel parameter. It could be looked at as a first step on a longer journey, but right now I think this small step would be a good first step.

Was your comment just an elaboration about the subject, targeting the reviewing community; or you would like to see some change in this KEP in order to get it accepted?
From my POV I would like to restrict its scope to this specific kernel parameter, but ofc curious about what you have in mind!

So, just to be on the same page, please take into account the Non-goals of this KEP:
"It is also outside the scope of this KEP to enhance the CPU manager itself with more fine-grained management policies, or introduce topology awareness into the CPU manager."

What I would like to achieve here is to leave CPU manager policies intact and unchanged, but make it possible to explicitly restrict the cores these policies can manage.

I definitely don't want to enhance kubelet's native manager so it can offer applications fully de-jittered cores (at least in the scope of this KEP). That is because multiple KEPs have been already created trying to achieve exactly that (referring to CPU pooling proposals). And let's be honest, these were quasi-rejected by SIG-Node and WG-ResourceManagement communities.

So, instead, what I would like to achieve here is to leave room for other (non K8s core) managers to do this pooling, while the community discusses how to support the same natively.

Even when that native support arrives I feel that this enhancement would be still valid and useful e.g. for the hybrid infrastructure use-cases.

I wanted to make it a bit more clear as to my POV here (representing a set of customers of course). And provide some historical context. Nothing more.

vikaschoudhary16 · 2018-08-01T11:33:46Z

keps/sig-node/0018-20180730-make-cpu-manager-respect-isolcpus.md

+"Isolcpus" is a boot-time Linux kernel parameter, which can be used to isolate CPU cores from the generic Linux scheduler.
+This kernel setting is routinely used within the Linux community to manually isolate, and then assign CPUs to specialized workloads.
+The CPU Manager implemented within kubelet currently ignores this kernel setting when creating cpusets for Pods.
+This KEP proposes that CPU Manager should respects this kernel setting when assigning Pods to cpusets, through whichever supported CPU management policy.


not sure if i understood "through whichever supported CPU management policy" correctly. Can you please reword this?

sure, sometimes I get carried away when writing :)

The meaning I wanted to convey is "respecting isolcpus should not be tied any specific CPU management policy, either all policies should respect it, or none of them"
I will elaborate on this point more when I expand the KEP, but the idea is that if this setting would be policy specific, for every policy we would need to introduce a an alternative version which respects isolcpus. So, even today we would need to introduce a "default_with_isolcpus" and a "static_with_isolcpus" policy, and it would only get worse in the future.

Instead, what I had in mind is making this node-wide configurable via a kubelet flag / feature gate, and when it is turned on it is applied regardless what is CPU management policy configured on the node.

vikaschoudhary16 · 2018-08-01T11:34:03Z

keps/sig-node/0018-20180730-make-cpu-manager-respect-isolcpus.md

+
+## Motivation
+
+The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host.


What is omega?

"alpha and omega" is just a phrase. I wanted to again just convey that kubelet assumes it is both the beginning and the end of CPU management on a node, and no other SW is running next to it which might also dabble into resource management.

can reword if you wish

vikaschoudhary16 · 2018-08-01T11:45:10Z

keps/sig-node/0018-20180730-make-cpu-manager-respect-isolcpus.md

+
+The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host.
+However, in certain infrastructures this might not always be the case.
+While it is already possible to effectively take-away CPU cores from the CPU manager via the kube-reserved and system-reserved kubelet flags, this implicit way of expressing isolation needs is not dynamic enough to cover all use-cases.


We may use kubelet flags to take-away few cores but even in implicit way there is no guarantee that these cores are really "isolated".

sure, you are right
I will add another sentence explaining this

Levovar · 2018-08-03T14:57:21Z

@ConnorDoyle @dchen1107 @derekwaynecarr
Could you take a look at the base idea presented in this review, and share your comments, if any?
If you agree with the generic direction I would be happy to flesh out the proposal in a follow-up PR, but it would be good to reserve (or free in case of a rejection) the KEP number going forward.

Levovar · 2018-08-07T16:42:58Z

As the "early KEP merge" described by the process did not happen, I filled out the whole KEP.

Kindly review!

vishh · 2018-08-09T00:44:38Z

keps/sig-node/0024-20180730-make-cpu-manager-respect-isolcpus.md

+
+Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host.
+However, in certain infrastructures this might not always be the case.
+While it is already possible to effectively take-away CPU cores from the Kubernetes managed workloads via the kube-reserved and system-reserved kubelet flags, this implicit way of declaring a Kubernetes managed CPU pool is not flexible enough to cover all use-cases.


What use cases aren't covered?

all of the use-cases where any non-Kubernetes managed processes run on a node which contains a kubelet
some of them are also mentioned in this document as a thought-raiser

Maybe a low-impact change would be to let system-reserved to take in a cpuset. For example, syntax

--system-reserved=cpu=cpuset:0-3

would mean that system-reserved cpu allocation would be calculated to be 4000m, and the CPUs which CPU manager would not give out (excluded from both default and explicitly assigned sets) would be 0-3.

Oh yeah, I actually had something like that on my mind (but via a newly introduced flag), just forgot to put it into the alternatives section!

I went with the "isolcpus" way of declaration at the end as it does not require additional manual adjustment from operator, so K8s can "do the right thing by default" as Vish mentioned it below.

But I will definitely add this to the alternatives section in my next commit! I'm fine either way, I guess it comes down to what is the community more comfortable with

vishh · 2018-08-09T00:45:26Z

keps/sig-node/0024-20180730-make-cpu-manager-respect-isolcpus.md

+## Motivation
+
+Kubelet's in-built CPU Manager always assumes that it is the primary software component managing the CPU cores of the host.
+However, in certain infrastructures this might not always be the case.


k8s makes an assumption that it owns the node not just from the perspective of cpu isolation. Trying to change that fundamental assumption would be hard.

hence we should not do it all ot once, but step-by-step!
this would be the first step. it makes sense to start with the CPU because we already have an existing kernel level parameter which we only need to respect to achieve it. I agree that doing the same for example with memory, or hugepages would be more difficult, but they are outside the scope of this KEP as described in the Non-goals section

vishh · 2018-08-09T00:46:45Z

keps/sig-node/0024-20180730-make-cpu-manager-respect-isolcpus.md

+Therefore, the need arises to enhance existing CPU manager with a method of explicitly defining a discontinuous pool of CPUs it can manage.
+Making kubelet respect the isolcpus kernel setting fulfills exactly that need, while also doing it in a de-facto standard way.
+
+If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.


Sounds too complicated. Why do admins have to manage multiple cpu managers? is there room for k8s to do the right thing by default?

My personal opinion is that the whole aim of this KEP is exactly what you describe: doing the right thing, by default when it comes to CPU.
Right now if literally anything runs on a node next to kubelet admins are forced either to evacuate their nodes, or configure their components accordingly. But the thing is that Kubernetes cannot even be configured right now to not overlap with other resource consumers.

However isolcpus generally means exactly this: "don't touch these resources, I'm gonna use them for something". If kubelet would also respect this wish of operators, isn't that exactly K8s doing the right thing by default?

IMHO it is, hence the KEP proposes simply respecting this flag, instead of requiring cluster admins to manually provide a list of cores K8s can manage.

vishh · 2018-08-09T00:47:58Z

keps/sig-node/0024-20180730-make-cpu-manager-respect-isolcpus.md

+
+If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.
+Such feature could come in handy if one would like to:
+- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager


Until we can automate and refine the existing CPU management policies, I'd like to avoid opening up extensions. We risk fragmenting the project quite a bit if we open up extensions prior to having a default solution that mostly just works.

I would generally accept and respect this comment under other circumstances, but the thing is that actually 2 such KEPs were discussed and quasi-rejected recently by the community.
First CPU pooling KEP was trying to add a new de-facto CPU management policy to CPU manager to achieve more fine-grained CPU management.
The re-worked CPU pooling KEP was trying to make CPU manager extendable externally.

Both of them got stopped in their track, in part because of your objections.
I'm not saying this maliciously, because I totally get the reasoning of the community leaders, and to a certain degree I also agree with it :)

This is exactly why this KEP was born: this KEP does not open-up the CPU manager in any new ways, nor does it fragment the Kubernetes projects.
It only wants to make CPU manager respect boundaries which anyway it should respect by default IMHO.
Please, do consider that:

the need is real. Infra operators cannot wait until the community decides which direction it wants to go with CPU pooling, and recent examples shows that this is still a long time away

Device Management is already an open interface, and it won't be closed. There are already external CPU managers out there (CMK for instance just to mention the most popular). You could say the "damage" is already done
We might as well recognize it, make the most out of the situation (at least not double-bookkeep these resources), and take it as a motivation to finally come up with a plan how to make Kubernetes CPU manager so awesome that nobody would every consider employing an external manager in their cluster

vishh · 2018-08-09T00:48:32Z

keps/sig-node/0024-20180730-make-cpu-manager-respect-isolcpus.md

+If Kubernetes' CPU manager would support this more granular node configuration, then infrastructure administrators could make multiple "CPU managers" seamlessly inter-work on the same node.
+Such feature could come in handy if one would like to:
+- outsource the management of a subset of specialized, or optimized cores (e.g. real-time enabled CPUs, CPUs with different HT configuration etc.) to an external CPU manager without any (other) change in Kubelet's CPU manager
+- ensure proper resource accounting and separation within a hybrid infrastructure (e.g. Openstack + Kubernetes running on the same node)


Why should k8s support such an architecture?

Because it is a customer need, I guess :)

In any case, I only wanted to put some production use-cases into display to show that this simple feature would be useful even if Kubernetes would have the best CPU manager on the world.

But, for the sake of the argument let's pretend these production use-cases are not real: let's just simply look at the most generic situation: people have "something" on their nodes next to kubelet, not managed by kubelet.

It can be a systemd service. It can be a legacy application, it can be really anything. Kubelet already having system-reserved flag enforces the idea that resource management community already recognized this use-case!
However, I think it is somewhat naive to assume that all of these non-kubelet managed processes are running on the first couple of cores, all the time. Legacy applications are a handful, who knows how they were written back in the days, right?

I could say that even mighty Google faces this situation with Borg + Kubernetes co-running in their environment, right? Okay, maybe Google can allow physically separating these systems from each other, but other companies, projects, customers etc. might not be this lucky, or resourceful :)

I think I will add this generic use-case to the next version as a third user-story, now that you made me think about it

lmdaly · 2018-08-13T16:22:20Z

+1 for this feature or the alternative of allowing a cpuset to be specified in system-reserved.

Adding the draft version of the KEP, only containing Overview section for early merge ( as dictated by the KEP process) 07.31: Updated Reviewers metadata field to re-trigger CNCF CLA check (which should be okay now)

Rephrasing a couple of sentences in the wake of the received comments.

- added a new use-case concentrating more on the every-day usage of this improvement - added a new alternative to the proposed implementation method, chaning the syntax of the system-reserved flag - defined the inter-working between this proposal, and existing capacity manipulating features

Levovar · 2018-08-15T11:41:55Z

Comments incorporated.

KEP expanded with a new use-case, implementation alternative, and an inter-working scenario.

k8s-ci-robot · 2018-08-27T21:48:39Z

@Levovar: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

justaugustus · 2018-10-13T03:20:52Z

/kind kep

Levovar · 2018-10-19T15:14:44Z

Bump.
Would appreciate some feedback from @vishh and @ConnorDoyle regarding still outstanding blocking issues (if, any), or open questions.

justaugustus · 2018-11-20T04:43:19Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

justaugustus · 2018-12-01T08:05:02Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:05:03Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 30, 2018

k8s-ci-robot requested review from bgrant0607 and dchen1107 July 30, 2018 16:49

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 30, 2018

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 30, 2018

k8s-ci-robot requested a review from ConnorDoyle July 30, 2018 16:53

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 30, 2018

k8s-ci-robot requested a review from jeremyeder July 30, 2018 17:08

Levovar force-pushed the isolcpus_kep branch 2 times, most recently from 02bdcfe to f787a3b Compare July 31, 2018 15:15

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 31, 2018

Levovar force-pushed the isolcpus_kep branch from f787a3b to 0b7c96e Compare July 31, 2018 15:27

jeremyeder reviewed Jul 31, 2018

View reviewed changes

vikaschoudhary16 reviewed Aug 1, 2018

View reviewed changes

Levovar force-pushed the isolcpus_kep branch from 0b7c96e to 8ad409b Compare August 2, 2018 09:50

Levovar force-pushed the isolcpus_kep branch 2 times, most recently from 1dea757 to ae1cc65 Compare August 7, 2018 16:37

Levovar changed the title ~~"Early" KEP for making CPU manager respect Linux kernel isolcpus setting~~ Full KEP for making kubelet's CPU manager respect Linux kernel isolcpus setting Aug 7, 2018

Levovar force-pushed the isolcpus_kep branch from ae1cc65 to 773fbff Compare August 7, 2018 16:40

Levovar changed the title ~~Full KEP for making kubelet's CPU manager respect Linux kernel isolcpus setting~~ KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting Aug 8, 2018

vishh reviewed Aug 9, 2018

View reviewed changes

Levovar added 5 commits August 14, 2018 16:19

Reserving KEP number 18 for "Make CPU manager respect isolcpus" KEP.

1ad7ce8

Adding the draft version of the KEP, only containing Overview section for early merge ( as dictated by the KEP process) 07.31: Updated Reviewers metadata field to re-trigger CNCF CLA check (which should be okay now)

Aligning the KEP number of the proposal to the latest master.

5c443ac

Rephrasing a couple of sentences in the wake of the received comments.

Changing the langugae of the proposal at some places to be more clear

c911157

Filling out the rest of the KEP

309d3ae

Levovar force-pushed the isolcpus_kep branch from 773fbff to 6420025 Compare August 14, 2018 16:07

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2018

k8s-ci-robot added the kind/kep label Oct 13, 2018

k8s-ci-robot closed this Dec 1, 2018

Levovar mentioned this pull request Oct 17, 2019

Support configuring an exact cpuset as "system-reserved" KEP kubernetes/enhancements#1319

Closed


		## Motivation

		The CPU Manager always assumes that it is the alpha and omega on a node, when it comes to managing the CPU resources of the host.

KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting #2435

KEP: Make kubelet's CPU manager respect Linux kernel isolcpus setting #2435

Conversation

Levovar commented Jul 30, 2018 • edited Loading

k8s-ci-robot commented Jul 30, 2018

k8s-ci-robot commented Jul 30, 2018

Levovar commented Jul 30, 2018

jeremyeder commented Jul 30, 2018

ConnorDoyle commented Jul 30, 2018

jeremyeder Jul 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Levovar commented Aug 3, 2018

Levovar commented Aug 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Levovar Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Levovar Aug 9, 2018 • edited Loading

Choose a reason for hiding this comment

lmdaly commented Aug 13, 2018

Levovar commented Aug 15, 2018

k8s-ci-robot commented Aug 27, 2018

justaugustus commented Oct 13, 2018

Levovar commented Oct 19, 2018

justaugustus commented Nov 20, 2018

justaugustus commented Dec 1, 2018

k8s-ci-robot commented Dec 1, 2018

Levovar commented Jul 30, 2018 •

edited

Loading

jeremyeder Jul 31, 2018 •

edited

Loading

Levovar Aug 9, 2018 •

edited

Loading

Levovar Aug 9, 2018 •

edited

Loading