-
Notifications
You must be signed in to change notification settings - Fork 263
Support nvidia.com/gpu as a resource managed by kube-… #181
Conversation
@mitake: GitHub didn't allow me to request PR reviews from the following users: YujiOshima. Note that only kubernetes-incubator members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/batchd/cache/resource_info.go
Outdated
} | ||
|
||
const ( | ||
GPUResourceName = "alpha.kubernetes.io/nvidia-gpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU resource name has been changed from v1.8 onwards. ( https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v18-onwards )
You should use nvidia.com/gpu
for v1.8 and v1.9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a useful const in the k8s api package: https://github.com/kubernetes/api/blob/master/core/v1/types.go#L4084
Probably using it would be the most straightforward way for the purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After v1.8, k8s delegated the management of GPU to Nvidia device plugin
.
So the resource name is defined in Nvidia device plugin
. https://github.com/NVIDIA/k8s-device-plugin/blob/66a35b71ac4b5cbfb04714678b548bd77e5ba719/server.go#L20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the const seems to be private. We need to have our own local copy in kube-batchd...
} | ||
} | ||
return r | ||
} | ||
|
||
func (r *Resource) IsEmpty() bool { | ||
return r.MilliCPU < minMilliCPU && r.Memory < minMemory | ||
return r.MilliCPU < minMilliCPU && r.Memory < minMemory && r.GPU == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if GPU is not present on any node in the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods won't be bond forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the desired behavior? Will all pod require GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the desired behavior?
Probably yes.
Will all pod require GPU?
It depends on workloads.
If this PR is alright, I could do the the same for quota-alloc. |
@geekSiddharth , please hold the enhancement to kube-quotalloc; we'd like to make kube-batchd ready firstly :) |
@k82cn I tested this PR with the test command of tf-operator (added in this PR: kubeflow/training-operator#509), and it seems to be working well for now. Could you take a look? |
pkg/batchd/cache/resource_info.go
Outdated
} | ||
|
||
func (r *Resource) LessEqual(rr *Resource) bool { | ||
return (r.MilliCPU < rr.MilliCPU || math.Abs(rr.MilliCPU-r.MilliCPU) < 0.01) && | ||
(r.Memory < rr.Memory || math.Abs(rr.Memory-r.Memory) < 1) | ||
(r.Memory < rr.Memory || math.Abs(rr.Memory-r.Memory) < 1) && | ||
(r.GPU < rr.GPU || rr.GPU == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mitake I think it should be (r.GPU < rr.GPU || r.GPU == rr.GPU)
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean r.GPU <= rr.GPU
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we use math.Abs
as ==
does not work for float :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'll fix it later. Thanks for pointing out.
@mitake, thanks for your contribution; it'll be better to have a UT for GPU in |
@k82cn ok, I'll add the UT. |
This commit lets kube-batchd manage GPU resources (nvidia.com/gpu) and allocate them for pods like other resources (CPU and memory).
merge this firstly, please add UT later :) |
@k82cn thanks, I'll add the UT later |
This is a successor of kubernetes-retired#181
Support nvidia.com/gpu as a resource managed by kube-…
This is a successor of kubernetes-retired#181
Support nvidia.com/gpu as a resource managed by kube-…
This is a successor of kubernetes-retired#181
…batchd
This commit lets kube-batchd manage GPU resources
(nvidia.com/gpu) and allocate them for pods like other
resources (CPU and memory).
/cc @YujiOshima