Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: New Resource API proposal #2265

Conversation

vikaschoudhary16
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 14, 2018
@vikaschoudhary16 vikaschoudhary16 changed the title Kep resource api proposal KEP: New Resource API proposal Jun 14, 2018
@idvoretskyi
Copy link
Member

/uncc

@k8s-ci-robot k8s-ci-robot removed the request for review from idvoretskyi June 14, 2018 08:36
@vikaschoudhary16
Copy link
Contributor Author

/assign @dchen1107 @derekwaynecarr

operator: "GtEq"
values:
- "30G"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind to add another yaml of how Pod reference ResourceClass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. will do.

Copy link
Contributor

@vishh vishh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Other than per-device quota, I feel most other use cases are ahead of their time.
  2. If Quota is the main problem to tackle right now, then does it require a new set of Resource APIs or can be solved via [admission extensions] (https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyY) or some other way?
  3. If you'd like to test the waters with other use cases, this proposal should ideally be implemented via extensions. If extensions are inadequate, we should try to address extension gaps.

## Use Stories
### As a cluster operator:
- Nodes in my cluster has GPU HW from different generations. I want to classify GPU nodes into one of the three categories, silver, gold and platinum depending upon the launch timeline of the GPU family eg: Kepler K20, K80, Pascal P40, P100, Volta V100. I want to charge each of the three categories differently. I want to offer my clients 3 GPU rates/classes to choose from.<br/>
**Motivation:** As time progresses in a cluster lifecycle, new advanced, high performance, expensive variants of GPUs gets added to the cluster nodes. At the same time older variants also co-exist. There are workloads which strictly wants latest GPUs and also there are workloads which are fine with older GPUs. But since there is a wide range of types, it will be hard to manage and confusing at the same time to have granularity at each GPU type. Grouping into few broad categories will be convenient to manage.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any users today that need this feature from kubernetes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for this user story, I think we could use NodeAffinity to chose different GPU type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NodeAffinity shares similar problem as node taint that cluster administrators can not apply proper access control to restrict a user pod from not using them. As mentioned at the very beginning, the goal of this proposal is "to better support non-native compute resources on kubernetes". We want to allow users to request them as compute resources, and allow administrators to control their access through the resource quota.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in-line with general QoS model. We might like to experiment with this model in Openshift. /cc @derekwaynecarr
Wondering how NodeAffinity can be tied with usage metrics which will be needed to charge as per usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.google.com/a/google.com/document/d/1Lpuw-tjm252W4oGqBvcFAFUZkwFlpZ2i0EgpCwialN0/edit?disco=AAAAB_xFEyk

Without knowing for sure real users will benefit from it, i don't see why we'd solve this problem.

Copy link
Contributor

@adohe-zz adohe-zz Jul 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See ur point, and I always think a more general resource API model should be better than label based solutions. :)
👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to build a feature, we should have a clear user identified that we know will consume the feature to guide the use-case. i think we need to evaluate the feature relative to other ideas seen in the ecosystem.

for the similar use-case of "specify gpu attributes such as gpu type and memory requirements for deployment in heterogenous GPU clusters", nvidia appears to enable this by carrying two API fields on the pod spec.

see:
https://developer.nvidia.com/kubernetes-gpu
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576
https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2389

it would be good to evaluate resource class versus this other approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd first like to understand if gpu type and memory requirements are a real user concern today in the first place even before considering possible solutions.
There are users who are sufficiently happy with using node selectors and most users today seem to bind pods to specific gpu types either for cost or specific memory requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr we have scalability concerns with kubernetes-gpu resource. Selectors computation will be done for each compute resource in scheduler cache. OTOH with resource classes, resource classes will be fewer than compute resources.
Another concern is portability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the pointers, @derekwaynecarr
As @vikaschoudhary16 mentioned, looking at nvidia's non-upstream solution, the main difference is that they changed the current resource requirement API of the container spec. In our proposal, we explicitly mentioned that this is a non-goal for the following reasons: First, in a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable. It can cause scaling issues on the scheduler side. Second, non-primary compute resources usually lack standard resource properties. Although there are benefits to allow users to directly express their resource metadata requirements in their container spec, it may also compromise workload portability in longer term. Third, resource quota control will become harder. Fourth, we may consider the resource requirement API change as a possible extension orthogonal to the ResourceClass proposal. By introducing ResourceClass as an additional resource abstraction layer, users can express their special resource requirements through a high-level portable name, and cluster admins can configure compute resources properly on different environments to meet such requirements. We feel this helps promote portability and separation of concerns, while still maintains API compatibility.

**How Resource classes can solve this:** I, operator/admin, creates three resource classes: GPU-Platinum, GPU-Gold, GPU-Silver. Now since resource classes are quota controlled, end-user will be able to request resource classes only if quota is allocated.

- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and shares a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>
**Motivation:** Increased performance because of local reference. Local reference also helps better use of cache<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case is unclear to me. What does local reference mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 s/local reference/local access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, "local" cache from the same NUMA node in case of cores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intersects with topology awareness heavily. I think Resource Class (if it exists) should restrict itself to policy like allowing only certain shapes (2 GPU with max of 16 CPUs, ...). The topology aspect as currently planned is expected to be covered by QoS (or an additional application performance class API if necessary). Don't combine them both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this proposal focuses on is a building block that allows guaranteed metadata aware resource scheduling by surfacing resource metadata to the scheduler. I think what kind of metadata people want to surface should be left to HW vendors, resource providors, or infrastructure admins, based on different HW properties, platform environment, and workload requirements. We can provide best practice guidelines and scaling results for people to make right decisions. Node level best effort topology aware scheduling may allow better scaling but I don't think we want to take an opinioned position here.

**How Resource classes can solve this:** Property/attribute which forms the grouping can be advertised in the device attributes and then a resource can be created to form a grouped super-resource based on that property.<br/>
**Can this be solved without resource classes:** No

- I want to have quota control on the devices at the granularity of device properties. For example, I want to have a separate quota for ECC enabled GPUs. I want a specific user to not let use more than ‘N’ number of ECC enabled GPUs overall at namespace level.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ECC is probably not a good example. Device types might be more common.
I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree on Device type. Actually, ECC enabled GPU will be a different ComputeResource as mentioned in the sections below.

I'd like to explicitly identify if the number of dimensions that we need to support for quota is one or N ideally based on concrete user feedback.

"number of dimensions" not clear to me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be N. We want to provide a general framework to support different types of devices like GPU, high performance NIC, FPGA, etc.

I see the goal, but is there a real world use case for N? I only see the need for 1 dimension now.

- In my cluster, I have many different classes (different capabilities) of a device type (ex: NICs). End user’s expectations are met as long as device has a very small subset of these capabilities. I want a mechanism where end user can request devices which satisfies their minimum expectation.
Few nodes are connected to data network over 40 Gig NICs and others are connected over normal 1 Gig NICs. I want end user pods to be able to request
data network connectivity with high network performance while
in default case, data network connectivity is offered via normal 1 Gbps NICs.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a niche use case. Why can't the existing labels+affinity features not work for this use case?
Also, why not build policies to restrict access via admission plugins rather than adding a new core resource?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if people need to build and deploy various admission plugins to restrict access on different HW with different properties, that indicates the need for a general framework to support that use case.

**Can this be solved without resource classes:** Taints and tolerations can help in steering pods but the problem in that there is no way today to have access control over use of tolerations and therefore if multiple users are there, it is not possible to have control on allowed tolerations.<br/>
**How Resource classes can solve this:** I can define a ResourceClass for the high-performance NIC with minimum bandwidth requirements, and makes sure only users with proper quota can use such resources.

- I want to be able to utilize different 'types' of a HW resource while not losing workload portability when moving from one cluster to another. There can be Nvidia GPUs on one cluster and AMD GPUs on another cluster. This is example of different ‘types’ of a HW resource(GPU). I want to offer GPUs to be consumed under a same portable name, as long as their capabilities are almost same. If pods are consuming these GPUs with a generic resource class name, workload can be migrated from one cluster to another transparently.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a dream given the state of SW today based on my experience. For example, Tensorflow struggles working seamlessly across compute types (CPU, GPU, etc) and sub-architectures (Skylake, V100, AMD).
I feel we need to wait a bit for the world to evolve for this use case to become valid in k8s.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPUs from different vendors, agree their properties can be quite different currently, although I wonder whether the difference is less significant for certain workloads like video decoding. For high-performance nic, I think user experience is perhaps less diversified. I also feel promoting portability is always a strong motivation on kubernetes,

**Motivation:** I want minimum guaranteed compute performance<br/>
**Can this be solved without resource classes:**<br/>
- Yes, using node labels and NodeLabelSelectors.
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't quota take care of access control to an extent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please elaborate more on this that how quota can be used with labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That is why we would like to introduce ResourceClass that fits naturally with resource quota.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh if i understand the proposed alternative, its basically treating it as an opaque resource by convention? the user still needs to couple the opaque resource consumption with the device consumption and that really cant be done until scheduling, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr if we can assume 1-1 mapping between the opaque resource and actual device, then we don't have to be concerned with scheduling right?
I'm not sure if clobbering resource requests in a webhook is possible though.

- Yes, using node labels and NodeLabelSelectors.
Problem: Same problem of lack of access control on using labelselectors at user level as with the use of tolerations.
- OR, Instead of using resource class, provide flexibility to query resource properties directly in pod container resource requests.
Problem: In a large cluster, computing operators like “greater than”, “less than” at pod creation can be a very slow operation and is not scalable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scale we are targeting? If generic scheduling features don't scale, then it's a problem that needs to be tackled separately.
cc @bsalamat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand your comment here. What this paragraph means is that if we extend container resource request API to directly specify their metadata requirements, scheduler needs to do the label selection matching on all of the compute resources in the cluster. But with ResourceClass, scheduler can cache compute resource to ResourceClass matching in its NodeInfo cache, and so the current PodFitsNodeResource evaluation will mostly stay the same without introducing new scaling concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. So this is not really a point justifying the need of better resource APIs. It is about the internal design of such a new API.

**How Resource classes can solve this:**
The Kubernetes scheduler is the central place to map container resource requests expressed through ResourceClass names to the underlying qualified physical resources, which automatically supports metadata aware resource scheduling.

- As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details. I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec. When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case seems too vague. TBH, the workloads that consume additional HW are specialized enough that it requires developer maintenance and cluster admins may not be able to homogenize different environments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to disagree. I think a big value of Kubernetes is to allow separation of concern that application developers can focus on their own software with underlying infrastructure taken cared by cluster admins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh quoting Henry from Ebay in the support of workload portability:

HI Vish, Bobby, it is not exactly this requirement. Currently we ask developers to submit resource specifications for GPU using name of the cards to our data center :

"accelerator": {
"type": "gpu",
"quantity": "1",
"labels": {
"product": "nvidia",
"family": "tesla",
"model": "m40"
}
}

But when we go to other cloud such as Google or AWS they may not have the same cards.

So I was wondering if we could offer resource such as CUDA cores and memory as resource specifications rather actual name and type of the cards.

However, different cards such as AMD vs NVIDIA was not the goal because we know program code against NVIDIA cards will not work well if run with AMD cards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for Henry to respond to that thread. I think you all should solicit feedback from real users (I'm thinking ML WG, SIG BIG Data, etc.) to figure out if this is really feasible. No user that I have spoken to is ready to consume this level of sophistication today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are proposing an infrastructure building block here whose focusing users are infrastructure admins and developers who want to make their systems easier to use by hiding the underlying hardware details from end users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a data scientist, I want my workloads to use advanced compute resources available in the running clusters without understanding the underlying hardware configuration details.

Yes and no. But mostly no. As mentioned by @vishh this is too vague and a lot of things that this statements covers is outside of K8s' scope.

  • It might make sense when your users only request one GPU (but in that case does ECC on/off is a HW config?)
  • When a user requests more than one GPU users should at the very least be able to specify if GPUs are linked through NVLINK

I want the same workload to run on either on-prem Kubernetes clusters or on cloud, without changing its pod spec.

What's blocking this today or what might be blocking this in the future?

When a new hardware driver comes out, I hope all the required resource configurations are handled properly by my cluster operators and things will just continue to work for any of my existing workloads.

How is this in the scope of Resource Classes

- I want an easy and extensible mechanism to export my resource to Kubernetes. I want to be able to roll out new hardware features to the users who require those features without breaking users who are using old versions of hardware.<br/>
**Motivation:** enables more compute resources and their advanced features on Kubernetes<br/>
**Can this be solved without resource classes:**<br/>
Yes, Using node labels and NodeLabelSelectors.<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again seems pretty vague. Is this a real need today? The example mentioned below also doesn't seem realistic.
Are there use cases outside of GPUs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This user story is mostly motivated by some past discussions on device plugin features requests, e.g., as @kad mentioned in kubernetes/kubernetes#59109 (comment)
I like the explicit API model that once ResourceClass is in place, Kubelet can pass ResourceClass name to a device plugin, and the device plugin can map that ResourceClass name to the special underlying resource metadata requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For kubernetes/kubernetes#59109 (comment) wouldn't Pod annotations suffice?

Another example could be different types of cpu cores i,e isolated cores, numa affined cores, hyperthreaded cores etc.

I don't think we are decided yet on what level of CPU specifics we want to expose to users.

Copy link
Contributor Author

@vikaschoudhary16 vikaschoudhary16 Jun 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After cpu manager's static policy, IMO, supporting features like isolated cores is a natural progression.
/cc @jeremyeder @kad

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this feature is orthogonal to cpus and confuses the discussion.

authors:
- "@vikaschoudhary16"
- "@jiayingz"
owning-sig: sig-node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add sig-scheduling as well since there is quite some impact to scheduling, quota, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the KEP template and existing ones, I think it expects one owning sig. But agree a big part of this proposal is on scheduler side, and we should add it as participating-sigs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ConnorDoyle
Copy link
Contributor

/cc

@vikaschoudhary16 vikaschoudhary16 force-pushed the kep-resource-api-proposal branch 2 times, most recently from 22a9847 to bbf8f0c Compare June 28, 2018 11:07
@vikaschoudhary16
Copy link
Contributor Author

@vishh

Other than per-device quota, I feel most other use cases are ahead of their time.

Workload portability is a feedback driven real use-case.

@jiayingz
Copy link
Contributor

@k8s-ci-robot k8s-ci-robot requested a review from kad June 29, 2018 06:07
@k8s-ci-robot
Copy link
Contributor

@jiayingz: GitHub didn't allow me to request PR reviews from the following users: hsaputra, bart0sh, fabiand.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @RenaudWasTaken @hsaputra @kad @bart0sh @fabiand

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jiayingz
Copy link
Contributor

jiayingz commented Jul 30, 2018 via email

@jiayingz
Copy link
Contributor

jiayingz commented Jul 30, 2018 via email

scenario as new resource properties are introduced into the system. Therefore we
support this behavior by default. To also provide an easy way for cluster admins
to reserve expensive compute resources and control their access with resource
quota, we propose to include a Priority field in ResourceClass API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we clarify the use cases that require non-overlapping resource classes?

  1. It seems this effect is achievable anyway if cluster admins design their resource class specs properly.
  2. With the described priority mechanism, to answer the question "why doesn't resource X on node Y match class Z?" users potentially have to inspect every resource class.


Possible fields we may consider to add later include:
- `DeviceUnits resource.Quantity`. This field can be used to support fractional
resource or infinite resource. In a more advanced use case, a device plugin may

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infinite resources might come handy in the case when a "default" non-countable resource should be assigned to a container. This could be used for example to make a DP set a node-specific environment variable to a container.

@RobertKrawitz
Copy link
Contributor

Another use for infinite resources would be for metrics, if it's desired to know how much of a resource is being used but without intent of imposing a limit. For (non-)random example (not specifically applicable to this), using filesystem quotas to measure storage use by setting an effectively infinite quota.

**How Resource classes can solve this:**<br/>
Vendors can use DevicePlugin API to propagate new hardware features, and provide best-practice ResourceClass spec to consume their new hardware or new hardware features on Kubernetes. Vendors don’t need to worry supporting this new hardware would break existing use cases on old hardware because the Kubernetes scheduler takes the resource metadata into account during pod scheduling, and so only pods that explicitly request this new hardware through the corresponding ResourceClass name will be allocated with such resources.</br>

- I want a mechanism where it is possible to offer a group of devices, which are co-located on a single node and share a common property, as a single resource that can be requested in pod container spec. Example, N GPU units interconnected by NVLink or N cpu cores on same NUMA node.<br/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could very well be the frontend to previously-discussed CPU pool concept. It would just be configuration which would set the property (such as "these cpu cores belong to the AVX pool") and not any physical device property (such as the shared NUMA node). I think we need to keep the primary resources in mind for this proposal too, even if they are not part of the scope yet.

@k8s-ci-robot
Copy link
Contributor

@vikaschoudhary16: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-community-verify 55ecd0a link /test pull-community-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

**Motivation:** Empower enterprise customers to consume and manage non-primary resources easily, similar to how they consume and manage primary resources today.<br/>
**Can this be solved without resource classes:** Without ResourceClass, people would rely on `NodeLabels`, `NodeAffinity`, `Taints`, and `Tolerations` to steer workloads to the appropriate nodes, or build their own [non-upstream solutions](https://github.com/NVIDIA/kubernetes/blob/875873bec8f104dd87eea1ce123e4b81ff9691d7/pkg/apis/core/types.go#L2576) to allow users to specify their resource specific metadata requirements. Workloads would have different experience on consuming non-primary compute resources on k8s. As time goes and more non-upstream solutions were deployed, user experience becomes fragmented across different environments. Furthermore, `NodeLabels` and `Taints` were designed as node level properties. They can't support multiple types of compute resources on a single node, and don't integrate well with resource quota. Even with the recent [Pod Scheduling Policy proposal](https://github.com/kubernetes/community/pull/1937), cluster admins can either allow or deny pods in a namespace to specify a `NodeAffinity` or `Toleration`, but cannot assign different quota to different namespaces.<br/>
**How Resource classes can solve this:** I, operator/admin, create different ResourceClasses for different types of GPUs. User workloads can request different types of GPUs in their `ContainerSpec` resource requests/limits through the corresponding ResourceClass name, in the same way as they request primary resources. Now since resource classes are quota controlled, end-user will be able to consume the requested GPUs only if they have enough quota.<br/>
**Similar use case for network devices:** A cluster can have different types of high-performance NICs and/or infiniband cards, with different performance and cost. E.g., some nodes may have 40 Gig high-performance NICs and some may have 10 Gig high-performance NICs. Some devices may support RDMA and some may not. Different workloads may desire to use different type of high-network access devices depending on their performance and cost tradeoff.</br>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'devicey-ness' of NICs is usually not the only considerations. Nobody cares about a hardware NIC without caring about what its connected to and the services its being provided by that connection. Or to put it more clearly: the characteristics of the NIC itself are only small part of the puzzle for network devices. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and it is allowed to use any of attributes to characterize a NIC as you want :).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 Yep, I like the attributes approach. I'm still trying to understand how attributes of different 'types' are handled. Perhaps a more realistic example for network devices could be added here?

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to test my own comprehension here. Could one reasonably have:

  • key: "networkservice"
    operator: "Eq"
    values:
    - "radio-network"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, key can be any attribute name advertised by device plugin.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 OK, so what entity has to understand that "radio-network" is of type string, instead of type network bandwidth, or other type?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

- key: "speed"
operator: "Gt"
values:
- "40GBPS"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also... "40GBPS" is not a simple number, how do you propose handling units wrt comparisons? What entity has to understand the units?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be similar to how millicore units and memory units are handled in existing code already.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikaschoudhary16 Right... but some entity has to understand the units. I imagine there are many many kinds of units one might support. What entity has to understand these units? Effectively we are implicitly introducing a 'type' here by adding units... where before in the Device Plugin API we had an int. I'm just curious how new 'types' get added... and how we handle collisions of unit abbreviations.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

@booxter
Copy link

booxter commented Aug 28, 2018

I am trying to understand how the proposal fits scenarios with network resources (NICs), and I have some comments / questions below.

  1. A node (or device) may not be universally connectable to any data center network. Let's say a NIC is able to connect to a "red" network but not a "blue" one. In this case, I would expect that the device will be tagged with an attribute that would describe its adjacency to the "red" network. Then RC object would refer to the attribute (perhaps of list type, using In operator) to pick a particular device with expected adjacency. Is it how it's supposed to be used? If so, there may be a problem with what seems to be a push against overlapping resource classes. AFAIU every NIC device will be tagged with adjacency, and some NIC devices may also have additional characteristics that separate them from the broader "connected-to-red-network" class. (Perhaps it's performance characteristics, whether formulated in terms of bandwidth or as "golden" or "silver" classes.) If overlapping RCs are not supported, then one can't have both a generic "red" RC with a high-performing "speedy-red" RC. Which reduces usefulness of the feature for NIC resource classification. (I imagine that the issue is not as pressing for other classes of devices where they may not have a universal nearly-mandatory attribute to select with.)

  2. For NICs, perhaps the most important quantifiable characteristics will be bandwidth. For devices that are directly backed by a hardware entity (like SR-IOV PF), the current proposal seems to work fine since each entity has a limited and non-shareable bandwidth. But for devices that have no 1-to-1 mapping to a physical entity (let's say it's OVS DP that connects multiple virtual devices to a single physical NIC), bandwidth is a shared resource that describes the total bandwidth of all virtual devices connected through the NIC. One could model their devices by splitting the total bandwidth between a limited number of devices (f.e. if you have a total 10Gbps for a NIC, you create 10 virtual devices with 1Gbps each) but it won't fit a case where someone needs a single virtual device with 5Gbps allocated to it. (Again, this probably doesn't affect traditional cases like GPUs where there is a clear 1-to-1 mapping between a pod and a device.) Is this scenario being looked into in scope of this proposal? If not, are there plans to consider it in future work?

Thanks in advance for answers, and thanks for working on the proposal.

Copy link

@fabiand fabiand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice proposal. Just my 2ct added.

- key: "speed"
operator: "Gt"
values:
- "40GBPS"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we could separate the resource phy nic from the logical network "red", correct?

My only concern is that we might have a rather long list of properties for certain virtual/overlay network cases, where there are 10ths of thousands of networks.

But that might be so special that it's solved differently.

spec:
resourceName: "nvidia.com/gpu"
resourceSelector:
- matchExpressions:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These attributes are the attributes exposed frm the DP to kubelet?

- key: "speed"
operator: "Gt"
values:
- "40GBPS"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the scheduler or constraint solver will need to be aware of the units (any unit used) in order to be able to interpret the matchExpression.

Possible fields we may consider to add later include:
- `AutoProvisionConfig`. This field can be used to specify resource auto provisioning config in different cloud environments.
- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This owuld be nice. Parameters to requests. But IIUIC it was explicitly exlucded from this proposal, correct?

- `Scope`. Indicate whether it maps to node level resource or cluster level resource. For cluster level resource, scheduler, Kubelet, and cluster autoscaler can skip the PodFitsResources predicate evaluation. This allows consistent resource predicate evaluation among these components.
- `ResourceRequestParameters`. This field can be used to indicate special resource request prameters that device plugins may need to perform special configurations on their devices to be consumed by workload pods requesting this resource.

Note we intentially leave these fields out of the initial design to limit the scope
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… Ah yes, to answer myself.

- Another option is that Kubelet can evict the pods that are allocated with a non-existing ComputeResource. Although simple, this approach may disturb long-running workloads during device plugin upgrade.
- To support a less disruptive model, upon resource property change, Kubelet can still export capacity at old ComputeResource name for the devices used by active pods, and exports capacity at new matching ComputeResource name for devices not in use. Only when those pods finish running, that particular node finishes its transition. This approach avoids resource multiple counting and simplifies the scheduler resource accounting. One potential downside is that the transition may take quite long process if there are long running pods using the resource on the nodes. In that case, cluster admins can still drain the node at convenient time to speed up the transition. Note that this approach does add certain code complexity on Kubelet DeviceManager component.

We propose to start with the first option, i.e., device property change requires
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

```

On the other hand, cluster admin may want to allow pods requesting nvidia-p100 to use ecc p100 GPUs if they are idle, but relies on scheduler preemption to re-assign those devices to pods requesting nvidia-p100-ecc and with higher priority. Such use cases require the scheduler support on matching a ComputeResource to multiple qualified ResourceClasses.
We feel this model
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are to many lines breaks in the next three lines.

@fabiand
Copy link

fabiand commented Sep 5, 2018

Is there also a KEP to track gRPC changes to support additional device types better (i.e. NICs)?

@justaugustus
Copy link
Member

/kind kep

@justaugustus
Copy link
Member

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

@justaugustus
Copy link
Member

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

@k8s-ci-robot
Copy link
Contributor

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@angao
Copy link
Contributor

angao commented Jul 25, 2019

I want to know the current status of this proposal, I did not find it in k/enhancements repo. And we have some requirements that are similar to this one. We are concerned about whether Kubernetes has a plan to implement device-based scheduling, instead of having devicemanager randomly select devices for Pod. We want to implement such requirements as scheduling based on GPU models.
/reopen @k82cn @vishh

@k8s-ci-robot
Copy link
Contributor

@angao: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jiayingz
Copy link
Contributor

@angao this proposal is currently put on hold. I am not aware of any plan to implement device-based scheduling in k8s.

@angao
Copy link
Contributor

angao commented Jul 26, 2019

@jiayingz thx. I wonder if we can try to implement device-specific allocation by modifying the devicemanager interface instead of the current random allocation. Other implementations such as API can be implemented via CRD. This way, we can maybe alleviate such problems.

@jiayingz
Copy link
Contributor

@angao there has been some effort on extending device plugin API for topology aware scheduling, but not sure whether this is something you are looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.