-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow specifying arbitrary resource capacity via aws asg tags #597
Conversation
cc: @sethpollack @mumoshu |
I suspect this may run into the same issue as kubernetes/kubernetes#54959. The problem is when new node is added, CA knows it is 'upcoming' and it takes it knows some pods will go on that node once it boots up. So it won't trigger another scale-up for those pods. So far all is good. Enter device plugins. Unless something changed in the last few weeks the plugin triggers after the node becomes ready and the extra resource required by pod will show up in node spec even later (once the plugin is done). By that time CA will see that it was wrong and the pods for which it triggered scale-up can't actually go to a node it added for them (because it doesn't have the special resource yet and CA doesn't know it will be added later). So CA will go and create more nodes. I added a hack for handling this specifically for GPUs on GKE (#461). However, this hack relies on CA being able to identify the nodes with GPUs by knowing some labels that GKE adds to those nodes. This is by no means a generic solution. We need to talk with owners of device plugins and figure out a proper fix before device plugins beta. |
I am not sure how aws CA works, but taking a quick look at the PR, it seems you directly updated the node status resource capacity to reflect the upcoming availability of the resource to be exported by device plugin. This should allow the pod to be scheduled on the new node. However, there will be a time gap for your device plugin to establish communication with Kubelet so that Kubelet can make proper device plugin calls during container start. Once the communication is established, Kubelet will overwrite the resource capacity/allocatable based on the actual no. of devices advertised by device plugin. How long this time gap is will depend on how your device plugin is deployed on the node and what setup operations your device plugin needs to perform to make the "acme.inc/widget" resource consumable on the node. On GKE GPU case, this takes some non-neglectable time, and that is why @MaciekPytel needs to add some special handling to handle this time gap. Agree with @MaciekPytel that the current solution is not ideal and we may think about a general way to handle this resource exporting time gap. |
@jiayingz @MaciekPytel I think I understand why extra nodes are scaled up. |
Yup, what we need is for device plugin to provide some information as to what extra resources will show up. We can't get that from daemonsets unless there is some standardized field or annotation or something we could look at. And even if there is how would we know how much of a given resource is on the node (ex. how many GPUs)? There are many ways this information could be exposed (example: set capacity but not allocatable until resource is ready, have a CRD for each custom resource describing how to recognize node with this resource, worst-case some standardized annotations). However, until we come up with something like that I don't think we can fix it for generic use-case in CA. |
/cc @bsalamat @dchen1107 @thockin Agree we need some way for CA to know how it should config the new node to satisfy the "extended-resource" requests from the pending pod, and also be able to tell what resources will show up on the newly created node and how long it should wait. As @MaciekPytel mentioned, there could be different ways to expose this information. I cc-ed some folks who may be interested in this discussion. Right now, we probably don't have a good solution for this problem, and CA probably need to use some kind of timer to deal with this time gap. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
We're starting to work with device plugins and want to be able to scale up node groups from 0 when an unschedulable pod requests custom resources provided by a device plugin
for example my pod requests
This changeset allows adding aws asgs tags to specify additional resources capacity, using a autoscaler prefix similar to how node labels are specified. example tags:
Is this an acceptable approach?
Making the PR against master. Please advise if master isn't the right branch.