Support nvidia.com/gpu as a resource managed by kube-… #181

mitake · 2018-03-16T07:03:39Z

…batchd

This commit lets kube-batchd manage GPU resources
(nvidia.com/gpu) and allocate them for pods like other
resources (CPU and memory).

/cc @YujiOshima

k8s-ci-robot · 2018-03-16T07:03:40Z

@mitake: GitHub didn't allow me to request PR reviews from the following users: YujiOshima.

Note that only kubernetes-incubator members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

…batchd

This commit lets kube-batchd manage GPU resources
(alpha.kubernetes.io/nvidia-gpu) and allocate them for pods like other
resources (CPU and memory).

This PR is a half-baked thing. It is not tested with a working cluster and kubeflow components (I'll do it next week). If I can have feedback from the maintainers or other developers who are already working on similar feature, it's really nice.

/cc @YujiOshima

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

YujiOshima · 2018-03-20T02:37:50Z

pkg/batchd/cache/resource_info.go

 }

+const (
+	GPUResourceName = "alpha.kubernetes.io/nvidia-gpu"


GPU resource name has been changed from v1.8 onwards. ( https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v18-onwards )
You should use nvidia.com/gpu for v1.8 and v1.9.

I found a useful const in the k8s api package: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4084
Probably using it would be the most straightforward way for the purpose?

After v1.8, k8s delegated the management of GPU to Nvidia device plugin.
So the resource name is defined in Nvidia device plugin. https://github.com/NVIDIA/k8s-device-plugin/blob/66a35b71ac4b5cbfb04714678b548bd77e5ba719/server.go#L20

Unfortunately the const seems to be private. We need to have our own local copy in kube-batchd...

sedflix · 2018-03-21T05:11:59Z

pkg/batchd/cache/resource_info.go

 		}
 	}
 	return r
 }

 func (r *Resource) IsEmpty() bool {
-	return r.MilliCPU < minMilliCPU && r.Memory < minMemory
+	return r.MilliCPU < minMilliCPU && r.Memory < minMemory && r.GPU == 0


What if GPU is not present on any node in the cluster?

Pods won't be bond forever.

Is that the desired behavior? Will all pod require GPU?

Is that the desired behavior?

Probably yes.

Will all pod require GPU?

It depends on workloads.

sedflix · 2018-03-21T05:18:09Z

If this PR is alright, I could do the the same for quota-alloc.

k82cn · 2018-03-21T06:36:09Z

@geekSiddharth , please hold the enhancement to kube-quotalloc; we'd like to make kube-batchd ready firstly :)

mitake · 2018-03-28T07:14:05Z

@k82cn I tested this PR with the test command of tf-operator (added in this PR: kubeflow/training-operator#509), and it seems to be working well for now. Could you take a look?

jinzhejz · 2018-04-19T02:05:34Z

pkg/batchd/cache/resource_info.go

 }

 func (r *Resource) LessEqual(rr *Resource) bool {
 	return (r.MilliCPU < rr.MilliCPU || math.Abs(rr.MilliCPU-r.MilliCPU) < 0.01) &&
-		(r.Memory < rr.Memory || math.Abs(rr.Memory-r.Memory) < 1)
+		(r.Memory < rr.Memory || math.Abs(rr.Memory-r.Memory) < 1) &&
+		(r.GPU < rr.GPU || rr.GPU == 0)


@mitake I think it should be (r.GPU < rr.GPU || r.GPU == rr.GPU) :)

You mean r.GPU <= rr.GPU?

yes, we use math.Abs as == does not work for float :)

I see. I'll fix it later. Thanks for pointing out.

k82cn · 2018-04-19T08:02:38Z

@mitake, thanks for your contribution; it'll be better to have a UT for GPU in drf_test.go.

mitake · 2018-04-20T06:18:15Z

@k82cn ok, I'll add the UT.

This commit lets kube-batchd manage GPU resources (nvidia.com/gpu) and allocate them for pods like other resources (CPU and memory).

k82cn · 2018-05-10T10:24:41Z

merge this firstly, please add UT later :)

mitake · 2018-05-11T06:46:51Z

@k82cn thanks, I'll add the UT later

This is a successor of kubernetes-retired#181

Support nvidia.com/gpu as a resource managed by kube-…

This is a successor of kubernetes-retired#181

Support nvidia.com/gpu as a resource managed by kube-…

This is a successor of kubernetes-retired#181

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 16, 2018

mitake mentioned this pull request Mar 16, 2018

Enable kube-arbitrator as scheduler for tensorflow kubeflow/training-operator#349

Closed

YujiOshima reviewed Mar 20, 2018

View reviewed changes

sedflix reviewed Mar 21, 2018

View reviewed changes

mitake force-pushed the gpu branch from 6d8a0e3 to 7770e14 Compare March 28, 2018 07:12

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 28, 2018

mitake changed the title ~~WIP, DO NOT MERGE: Support alpha.kubernetes.io/nvidia-gpu as a resource managed by kube-…~~ Support nvidia.com/gpu as a resource managed by kube-… Mar 28, 2018

mitake force-pushed the gpu branch from 7770e14 to be82566 Compare March 29, 2018 05:49

jinzhejz reviewed Apr 19, 2018

View reviewed changes

Support nvidia.com/gpu as a resource managed by kube-batchd

4493c6f

This commit lets kube-batchd manage GPU resources (nvidia.com/gpu) and allocate them for pods like other resources (CPU and memory).

mitake force-pushed the gpu branch from d3c8e29 to 4493c6f Compare April 20, 2018 06:25

Merge branch 'master' into gpu

c4e1b46

k82cn merged commit 0593b2d into kubernetes-retired:master May 10, 2018

mitake deleted the gpu branch May 11, 2018 06:46

mitake added a commit to mitake/kube-arbitrator that referenced this pull request May 29, 2018

Add unit tests for GPU allocation

a7a4e51

This is a successor of kubernetes-retired#181

mitake mentioned this pull request May 29, 2018

Add unit tests for GPU allocation #227

Merged

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Merge pull request kubernetes-retired#181 from mitake/gpu

d86690a

Support nvidia.com/gpu as a resource managed by kube-…

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Add unit tests for GPU allocation

cc44ca5

This is a successor of kubernetes-retired#181

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Merge pull request kubernetes-retired#181 from mitake/gpu

21b189f

Support nvidia.com/gpu as a resource managed by kube-…

kevin-wangzefeng pushed a commit to kevin-wangzefeng/scheduler that referenced this pull request Jun 28, 2019

Add unit tests for GPU allocation

7ece90e

This is a successor of kubernetes-retired#181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support nvidia.com/gpu as a resource managed by kube-… #181

Support nvidia.com/gpu as a resource managed by kube-… #181

mitake commented Mar 16, 2018 •

edited

Loading

k8s-ci-robot commented Mar 16, 2018

YujiOshima Mar 20, 2018

mitake Mar 22, 2018

YujiOshima Mar 22, 2018

mitake Mar 28, 2018

sedflix Mar 21, 2018 •

edited

Loading

mitake Mar 28, 2018 •

edited

Loading

sedflix Mar 28, 2018

mitake Mar 29, 2018

sedflix commented Mar 21, 2018

k82cn commented Mar 21, 2018

mitake commented Mar 28, 2018

jinzhejz Apr 19, 2018

mitake Apr 19, 2018

jinzhejz Apr 19, 2018

k82cn Apr 19, 2018

mitake Apr 20, 2018

k82cn commented Apr 19, 2018

mitake commented Apr 20, 2018

k82cn commented May 10, 2018

mitake commented May 11, 2018

Support nvidia.com/gpu as a resource managed by kube-… #181

Support nvidia.com/gpu as a resource managed by kube-… #181

Conversation

mitake commented Mar 16, 2018 • edited Loading

k8s-ci-robot commented Mar 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sedflix Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

mitake Mar 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sedflix commented Mar 21, 2018

k82cn commented Mar 21, 2018

mitake commented Mar 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented Apr 19, 2018

mitake commented Apr 20, 2018

k82cn commented May 10, 2018

mitake commented May 11, 2018

mitake commented Mar 16, 2018 •

edited

Loading

sedflix Mar 21, 2018 •

edited

Loading

mitake Mar 28, 2018 •

edited

Loading