Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rework device sharing in volcano #2643

Merged
merged 5 commits into from
Jan 30, 2023
Merged

rework device sharing in volcano #2643

merged 5 commits into from
Jan 30, 2023

Conversation

archlitchi
Copy link
Contributor

We implement a common interface for shareable devices(GPU,NPU,FPGA,...) called sharedDevicePool, and use it to reimplement current gpu-share mechanism. The goal is to let device-sharing easy to implement, and better organised. If you wish to grant vc-scheduler the ability to share another device, all you need to do is to implement these methods in sharedDevicePool, and place your logic under pkg/scheduler/api/devices. No other hackings are needed.

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 9, 2023
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make CI happy first.
And squash your code.

@@ -166,47 +166,48 @@ func (gs *GPUDevices) MonitorStatus() string {
}

func (gs *GPUDevices) AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error {
klog.V(4).Infoln("DeviceSharing:Into AllocateToPod", pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the parameter pod will be used in the log info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, you are right, now i'm using pod.Name instead

@zishen
Copy link
Contributor

zishen commented Jan 14, 2023

This rework so good, and can you make it a plugin to add a device?

Others map[string]interface{}
GPUDevices map[int]*GPUDevice
Others map[string]interface{}
SharedDevices map[string]SharedDevicePool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others is used for store custom resource. We can reuse it.

@@ -153,7 +153,7 @@ func NewNodeInfo(node *v1.Node) *NodeInfo {
nodeInfo.Allocatable = NewResource(node.Status.Allocatable).Add(nodeInfo.OversubscriptionResource)
nodeInfo.Capability = NewResource(node.Status.Capacity).Add(nodeInfo.OversubscriptionResource)
}
nodeInfo.setNodeGPUInfo(node)
nodeInfo.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use setNodeOthersResource to encapsulation the SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node)

@@ -376,7 +344,7 @@ func (ni *NodeInfo) SetNode(node *v1.Node) {
// setNode sets kubernetes node object to nodeInfo object without assertion
func (ni *NodeInfo) setNode(node *v1.Node) {
ni.setOversubscription(node)
ni.setNodeGPUInfo(node)
ni.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(ni.Name, node)
ni.setRevocableZone(node)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -395,13 +363,13 @@ func (ni *NodeInfo) setNode(node *v1.Node) {
ni.Idle.sub(ti.Resreq) // sub without assertion
ni.Releasing.Add(ti.Resreq)
ni.Used.Add(ti.Resreq)
ni.AddGPUResource(ti.Pod)
ni.SharedDevices[GPUSharingDevice].AddResource(ti.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to retain AddResource and put the SharedDevices[GPUSharingDevice].AddResource(ti.Pod) inside AddResource.

@@ -0,0 +1,25 @@
package api
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the copyright and change the name of this file to device_interface.go

GPUSharingDevice = "GpuShare"
)

type SharedDevicePool interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface covers the gpu number as well and npu in the future. It's better to change the name to from SharedDevicePool to Devices.


type SharedDevicePool interface {
//following two functions used in node_info
AddResource(pod *v1.Pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add function description for each interface.

SubResource(pod *v1.Pod)

//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change RequestInPod to HasDeviceRequest or some other name that is easy to understand.


//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
FitInPod(pod *v1.Pod) (bool, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest to change interface from FitInPod to Filter or FilterNode

//following four functions used in predicate
RequestInPod(pod *v1.Pod) bool
FitInPod(pod *v1.Pod) (bool, error)
AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change AllocateToPod and ReleaseFromPod to Allocate and Release

ReleaseFromPod(kubeClient kubernetes.Interface, pod *v1.Pod) error

//used for debug and monitor
MonitorStatus() string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change MonitorStatus to GetStatus or CheckStatus

@archlitchi
Copy link
Contributor Author

This rework so good, and can you make it a plugin to add a device?

I'd like to, but you still need to hack into scheduler.api even with this PR merged (ie. you must put your device-initialization logic in node_info.go in order for the scheduler to recognize your device). so it's hard for a plugin to do that

@william-wang william-wang added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jan 30, 2023
@william-wang
Copy link
Member

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 30, 2023
@volcano-sh-bot volcano-sh-bot merged commit f4e3e58 into volcano-sh:master Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants