-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable kube-arbitrator as scheduler for tensorflow #349
Comments
/assign |
@k82cn great. IIUC, current kube-arbitrator doesn't have a mechanism of gang scheduling yet. We have in-house implementation of the feature (very limited subset just for slack demo can be found here: https://github.com/mlkube/gang-scheduler/tree/master/cmd/scheduler). If the idea can be accepted by kube-arbitrator, I want to make PRs to the project. How do you think? |
@mitake , sure, that'll be great :). Please feel free to open the PR :). |
@k82cn thanks, I'll create a PR :) |
@mitake , sorry for confusion :( We just created a PR for gang-scheduling few days ago. In kube-arbitrator, we re-use PDB to define the min Pod requirement; the policy to try to meet "min available" firstly :). Regarding the PDB, it's also a point to discussion here; as you known, kube-arbitrator will also support other framework, e.g. Spark, so we will not parse the yaml/object of framework to get the replicas/desired. Anyway, it's also open for discussion :). |
/cc @foxish for what we are doing :). |
@k82cn thanks. Which is the PR for the gang scheduling? I'd like to try :) |
refer to kubernetes-retired/kube-batch#134 for more detail :). And here's the tutorial for it. If any issue, please let me know :). |
/cc @jinzhejz |
@k82cn thanks, I'll try it |
@k82cn Thanks for driving this! |
I am glad to investigate kube-arbitrator, try to use kube-batchd to schedule TF distributed jobs on Kubernetes. And if I am free during the summer I am happy to apply for the CNCF idea: https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads 😄 |
@gaocegege , great :). |
that's great to implement the queueing on k8s https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads, @gaocegege , If there some progress, please let me know, I can test it first. |
FYI There are some discussions about kubeflow integration in kubernetes-retired/kube-batch#156 |
for now, kube-arbitrator support gang-scheduler and 'pod priority within job'; so I think it's a good time to try the integration, is there anyone can help from tf-operator part? @jinzhejz , please append your demo video when it's ready. |
@k82cn this is great :) I'd like to help the integration (actually I'm working on it already) |
I opened a PR which lets tf-operator create PDB, which is required by kube-batchd for the gang scheduling: #452 |
@mitake @k82cn , here are two demo videos: gang-scheduler and pod priority within job |
I created a PR of kube-arbitrator for supporting GPU: kubernetes-retired/kube-batch#181 |
@mitake @k82cn , update two videos to show the difference between default k8s scheduler and kube-batchd when running tensorflow jobs BTW: in the videos, I used |
Now, To fix it temporarily, add a new option For example, In kubeflow, I found that all pods under the same tfjob have the same label For a long-term solution, |
Thanks for your awesome work! the group-label works well in tfjob v1alpha2 and v1alpha1. In v1alpha2 we use tfjobkey instead tfjobname but it can also work via the option I am wondering if batchd could schedule all tasks in one job to one node. As you all know, TensorFlow distributed jobs require GPU and good network connection. It is better to place all tasks in one node for better network condition. /cc @rc-zhang RC Zhang is interested in the GSoC idea https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads |
Is this true? I thought pods belong to a same tfjob share the common owner reference: https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/replicas.go#L165 |
I refer to kubeflow user guide and tf-operator readme to run kubeflow sample. I used the below steps to install kubeflow package, not sure whether it is latest.
yes, the group label is not necessary if the pods belong to the same tfjob share the same owner reference. |
Sorry that the tf-operator readme is outdated. Now we creates pods for ps and workers and the pods are all owned by the TFJob. Then that is awesome that we can not introduce the hack about labels :) |
If you want to use the latest version of tf-operator, you could have a look at https://github.com/kubeflow/tf-operator/blob/master/developer_guide.md I suggest building the operator and running it locally to serve. |
Thanks for the new guide, I will have a try :) |
If someone who has a GPU cluster can try this PR, it is really helpful :) kubernetes-retired/kube-batch#181 |
Really appreciate your work! Will it be merged into master? |
@gaocegege I think so. But of course I need reviews from @k82cn and @jinzhejz . If other developers can test it, it would be helpful for the maintainers of the kube-arbitrator :) |
I can test this next week. |
@pineking how about the situation of testing? |
Hi, I have tested it with my Kubernetes cluster which has one master and two workers having 8 p100 GPUs.
|
@ChanYiLin thanks for your testing. Could you provide the command lines in your test? |
@mitake
The following yaml file is how I launched a tfjob.
The job successfully create the pdb and the pods as below.
The following pdb was created when the job was launched.
So I took a look at some pod and found the pod like jack-tfjob-resnet50-worker-nt7n-4-acc4e had following events.
I am still trying to find the problems in the kube-batchd or tf-operator. Thank you! |
@ChanYiLin thanks for sharing the information. Could you also provide the command line options and config files of kube-batchd and tf-operator? |
@mitake
For tf-operator,
|
@ChanYiLin thanks, could you also provide the content of |
@mitake And I found the default configmap which is created by the ksonnet following the instructions from Kubeflow README does not contain the SchedulerName. Also the &v1alpha1.ControllerConfig{} does not store the value of SchedulerName In pkg/apis/tensorflow/v1alpha1/types.go
The tf-operator stores the scheduler name in the spec directly
and assigns to the pods when creating them.
In my pod, the scheduler name is set correctly.
I don't think the problems come from my configurations? |
@mitake
to
My test scenario works like a charm now! |
@ChanYiLin great! Thanks for testing again :) I'll make the PR ready to merge. |
The PR which adds GPU support to kube-arbitrator is already merged: kubernetes-retired/kube-batch#181 |
I think we can close that one. In upstream, we decide to use kube-arbitrator for batch related workload; here's the design kubernetes/community#2337 . For tf-operator, kube-arbitrator keeps the backward compatibility to support PDB; but if we want to support more features, e.g. nodeSelector, it's better to replace PDB with SchedulingSpec when design is finalized. Will open another PRs if necessary :) |
Per discussion at #165 , we'd like to take kube-arbitrator as scheduler; so I open this issue to trace all related sub tasks.
/cc @jlewi , @ScorpioCPH
The text was updated successfully, but these errors were encountered: