Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow specifying arbitrary resource capacity via aws asg tags #597

Closed
wants to merge 2 commits into from
Closed

Conversation

peter-bitfusion
Copy link

We're starting to work with device plugins and want to be able to scale up node groups from 0 when an unschedulable pod requests custom resources provided by a device plugin

for example my pod requests

resources:
  requests:
    acme.inc/widget: 5

This changeset allows adding aws asgs tags to specify additional resources capacity, using a autoscaler prefix similar to how node labels are specified. example tags:

Key: k8s.io/cluster-autoscaler/node-template/resource-capacity/acme.inc/widget
Value: 5

Is this an acceptable approach?

Making the PR against master. Please advise if master isn't the right branch.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 31, 2018
@mwielgus
Copy link
Contributor

mwielgus commented Feb 1, 2018

cc: @sethpollack @mumoshu

@MaciekPytel
Copy link
Contributor

I suspect this may run into the same issue as kubernetes/kubernetes#54959.

The problem is when new node is added, CA knows it is 'upcoming' and it takes it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. So far all is good. Enter device plugins. Unless something changed in the last few weeks the plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later (once the plugin is done). By that time CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes.

I added a hack for handling this specifically for GPUs on GKE (#461). However, this hack relies on CA being able to identify the nodes with GPUs by knowing some labels that GKE adds to those nodes. This is by no means a generic solution. We need to talk with owners of device plugins and figure out a proper fix before device plugins beta.

cc: @jiayingz @vishh

@jiayingz
Copy link

jiayingz commented Feb 1, 2018

I am not sure how aws CA works, but taking a quick look at the PR, it seems you directly updated the node status resource capacity to reflect the upcoming availability of the resource to be exported by device plugin. This should allow the pod to be scheduled on the new node. However, there will be a time gap for your device plugin to establish communication with Kubelet so that Kubelet can make proper device plugin calls during container start. Once the communication is established, Kubelet will overwrite the resource capacity/allocatable based on the actual no. of devices advertised by device plugin. How long this time gap is will depend on how your device plugin is deployed on the node and what setup operations your device plugin needs to perform to make the "acme.inc/widget" resource consumable on the node. On GKE GPU case, this takes some non-neglectable time, and that is why @MaciekPytel needs to add some special handling to handle this time gap. Agree with @MaciekPytel that the current solution is not ideal and we may think about a general way to handle this resource exporting time gap.

@peter-bitfusion
Copy link
Author

@jiayingz @MaciekPytel I think I understand why extra nodes are scaled up.
It sounds like it would be helpful if nodes could be marked as NotReady until all their device plugins are Ready. However then you couldn't schedule pods that need X devices until Y devices were also available. It would be helpful if nodes exposed what device plugins were present and their statuses. Then CA could use that info to consider a node not ready yet for pods that require resources provided by device plugins that are present on that node but not ready. Perhaps CA could infer that information by looking at daemonsets that should be scheduled on a node? Would have to somehow know if a daemonset represents a device plugin though

@MaciekPytel
Copy link
Contributor

Yup, what we need is for device plugin to provide some information as to what extra resources will show up. We can't get that from daemonsets unless there is some standardized field or annotation or something we could look at. And even if there is how would we know how much of a given resource is on the node (ex. how many GPUs)?

There are many ways this information could be exposed (example: set capacity but not allocatable until resource is ready, have a CRD for each custom resource describing how to recognize node with this resource, worst-case some standardized annotations). However, until we come up with something like that I don't think we can fix it for generic use-case in CA.

@jiayingz
Copy link

jiayingz commented Feb 2, 2018

/cc @bsalamat @dchen1107 @thockin

Agree we need some way for CA to know how it should config the new node to satisfy the "extended-resource" requests from the pending pod, and also be able to tell what resources will show up on the newly created node and how long it should wait. As @MaciekPytel mentioned, there could be different ways to expose this information. I cc-ed some folks who may be interested in this discussion. Right now, we probably don't have a good solution for this problem, and CA probably need to use some kind of timer to deal with this time gap.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 3, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants