allow specifying arbitrary resource capacity via aws asg tags #597

peter-bitfusion · 2018-01-31T21:01:32Z

We're starting to work with device plugins and want to be able to scale up node groups from 0 when an unschedulable pod requests custom resources provided by a device plugin

for example my pod requests

resources:
  requests:
    acme.inc/widget: 5

This changeset allows adding aws asgs tags to specify additional resources capacity, using a autoscaler prefix similar to how node labels are specified. example tags:

Key: k8s.io/cluster-autoscaler/node-template/resource-capacity/acme.inc/widget
Value: 5

Is this an acceptable approach?

Making the PR against master. Please advise if master isn't the right branch.

mwielgus · 2018-02-01T00:06:15Z

cc: @sethpollack @mumoshu

MaciekPytel · 2018-02-01T10:58:19Z

I suspect this may run into the same issue as kubernetes/kubernetes#54959.

The problem is when new node is added, CA knows it is 'upcoming' and it takes it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. So far all is good. Enter device plugins. Unless something changed in the last few weeks the plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later (once the plugin is done). By that time CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.

I added a hack for handling this specifically for GPUs on GKE (#461). However, this hack relies on CA being able to identify the nodes with GPUs by knowing some labels that GKE adds to those nodes. This is by no means a generic solution. We need to talk with owners of device plugins and figure out a proper fix before device plugins beta.

cc: @jiayingz @vishh

jiayingz · 2018-02-01T18:18:24Z

I am not sure how aws CA works, but taking a quick look at the PR, it seems you directly updated the node status resource capacity to reflect the upcoming availability of the resource to be exported by device plugin. This should allow the pod to be scheduled on the new node. However, there will be a time gap for your device plugin to establish communication with Kubelet so that Kubelet can make proper device plugin calls during container start. Once the communication is established, Kubelet will overwrite the resource capacity/allocatable based on the actual no. of devices advertised by device plugin. How long this time gap is will depend on how your device plugin is deployed on the node and what setup operations your device plugin needs to perform to make the "acme.inc/widget" resource consumable on the node. On GKE GPU case, this takes some non-neglectable time, and that is why @MaciekPytel needs to add some special handling to handle this time gap. Agree with @MaciekPytel that the current solution is not ideal and we may think about a general way to handle this resource exporting time gap.

peter-bitfusion · 2018-02-02T00:21:34Z

@jiayingz @MaciekPytel I think I understand why extra nodes are scaled up.
It sounds like it would be helpful if nodes could be marked as NotReady until all their device plugins are Ready. However then you couldn't schedule pods that need X devices until Y devices were also available. It would be helpful if nodes exposed what device plugins were present and their statuses. Then CA could use that info to consider a node not ready yet for pods that require resources provided by device plugins that are present on that node but not ready. Perhaps CA could infer that information by looking at daemonsets that should be scheduled on a node? Would have to somehow know if a daemonset represents a device plugin though

MaciekPytel · 2018-02-02T09:25:36Z

Yup, what we need is for device plugin to provide some information as to what extra resources will show up. We can't get that from daemonsets unless there is some standardized field or annotation or something we could look at. And even if there is how would we know how much of a given resource is on the node (ex. how many GPUs)?

There are many ways this information could be exposed (example: set capacity but not allocatable until resource is ready, have a CRD for each custom resource describing how to recognize node with this resource, worst-case some standardized annotations). However, until we come up with something like that I don't think we can fix it for generic use-case in CA.

jiayingz · 2018-02-02T21:29:10Z

/cc @bsalamat @dchen1107 @thockin

Agree we need some way for CA to know how it should config the new node to satisfy the "extended-resource" requests from the pending pod, and also be able to tell what resources will show up on the newly created node and how long it should wait. As @MaciekPytel mentioned, there could be different ways to expose this information. I cc-ed some folks who may be interested in this discussion. Right now, we probably don't have a good solution for this problem, and CA probably need to use some kind of timer to deal with this time gap.

fejta-bot · 2018-05-03T22:28:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-06-02T23:15:17Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-07-03T00:00:42Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

allow specifying arbitrary resource capacity via aws asg tags

7282e9b

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 31, 2018

fix comment

1661ed8

mwielgus added area/provider/aws Issues or PRs related to aws provider sig/aws labels Feb 1, 2018

aleksandra-malinowska added the area/cluster-autoscaler label Feb 1, 2018

k8s-ci-robot requested review from bsalamat, dchen1107 and thockin February 2, 2018 21:29

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2018

k8s-ci-robot closed this Jul 3, 2018

This was referenced Aug 9, 2018

cherry pick: "Fix nvidia gpu resource name on AWS" to 1.1 #1130

Merged

Could GPU support move to cloud provider? #1135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow specifying arbitrary resource capacity via aws asg tags #597

allow specifying arbitrary resource capacity via aws asg tags #597

peter-bitfusion commented Jan 31, 2018

mwielgus commented Feb 1, 2018

MaciekPytel commented Feb 1, 2018

jiayingz commented Feb 1, 2018 •

edited

Loading

peter-bitfusion commented Feb 2, 2018

MaciekPytel commented Feb 2, 2018

jiayingz commented Feb 2, 2018

fejta-bot commented May 3, 2018

fejta-bot commented Jun 2, 2018

fejta-bot commented Jul 3, 2018

allow specifying arbitrary resource capacity via aws asg tags #597

allow specifying arbitrary resource capacity via aws asg tags #597

Conversation

peter-bitfusion commented Jan 31, 2018

mwielgus commented Feb 1, 2018

MaciekPytel commented Feb 1, 2018

jiayingz commented Feb 1, 2018 • edited Loading

peter-bitfusion commented Feb 2, 2018

MaciekPytel commented Feb 2, 2018

jiayingz commented Feb 2, 2018

fejta-bot commented May 3, 2018

fejta-bot commented Jun 2, 2018

fejta-bot commented Jul 3, 2018

jiayingz commented Feb 1, 2018 •

edited

Loading