Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable kube-arbitrator as scheduler for tensorflow #349

Closed
k82cn opened this issue Jan 26, 2018 · 46 comments
Closed

Enable kube-arbitrator as scheduler for tensorflow #349

k82cn opened this issue Jan 26, 2018 · 46 comments

Comments

@k82cn
Copy link
Collaborator

k82cn commented Jan 26, 2018

Per discussion at #165 , we'd like to take kube-arbitrator as scheduler; so I open this issue to trace all related sub tasks.

/cc @jlewi , @ScorpioCPH

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

/assign

@mitake
Copy link
Contributor

mitake commented Jan 26, 2018

@k82cn great. IIUC, current kube-arbitrator doesn't have a mechanism of gang scheduling yet. We have in-house implementation of the feature (very limited subset just for slack demo can be found here: https://github.com/mlkube/gang-scheduler/tree/master/cmd/scheduler). If the idea can be accepted by kube-arbitrator, I want to make PRs to the project. How do you think?

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

@mitake , sure, that'll be great :). Please feel free to open the PR :).

@mitake
Copy link
Contributor

mitake commented Jan 26, 2018

@k82cn thanks, I'll create a PR :)

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

@mitake , sorry for confusion :( We just created a PR for gang-scheduling few days ago. In kube-arbitrator, we re-use PDB to define the min Pod requirement; the policy to try to meet "min available" firstly :).

Regarding the PDB, it's also a point to discussion here; as you known, kube-arbitrator will also support other framework, e.g. Spark, so we will not parse the yaml/object of framework to get the replicas/desired. Anyway, it's also open for discussion :).

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

/cc @foxish for what we are doing :).

@mitake
Copy link
Contributor

mitake commented Jan 26, 2018

@k82cn thanks. Which is the PR for the gang scheduling? I'd like to try :)

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

refer to kubernetes-retired/kube-batch#134 for more detail :). And here's the tutorial for it. If any issue, please let me know :).

@k82cn
Copy link
Collaborator Author

k82cn commented Jan 26, 2018

/cc @jinzhejz

@mitake
Copy link
Contributor

mitake commented Jan 26, 2018

@k82cn thanks, I'll try it

@jlewi
Copy link
Contributor

jlewi commented Jan 26, 2018

@k82cn Thanks for driving this!

@gaocegege
Copy link
Member

I am glad to investigate kube-arbitrator, try to use kube-batchd to schedule TF distributed jobs on Kubernetes. And if I am free during the summer I am happy to apply for the CNCF idea: https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads 😄

@k82cn
Copy link
Collaborator Author

k82cn commented Feb 5, 2018

@gaocegege , great :).

@pineking
Copy link
Member

pineking commented Feb 5, 2018

that's great to implement the queueing on k8s https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads, @gaocegege , If there some progress, please let me know, I can test it first.

@gaocegege
Copy link
Member

FYI There are some discussions about kubeflow integration in kubernetes-retired/kube-batch#156

@k82cn
Copy link
Collaborator Author

k82cn commented Mar 8, 2018

for now, kube-arbitrator support gang-scheduler and 'pod priority within job'; so I think it's a good time to try the integration, is there anyone can help from tf-operator part?

@jinzhejz , please append your demo video when it's ready.

@mitake
Copy link
Contributor

mitake commented Mar 8, 2018

@k82cn this is great :) I'd like to help the integration (actually I'm working on it already)

@mitake
Copy link
Contributor

mitake commented Mar 9, 2018

I opened a PR which lets tf-operator create PDB, which is required by kube-batchd for the gang scheduling: #452
The PR isn't tested yet and I'm still not fully sure its usage of kube-arbitrator is correct or not. It will be great if I can have comments.

@jinzhejz
Copy link

@mitake @k82cn , here are two demo videos: gang-scheduler and pod priority within job

@mitake
Copy link
Contributor

mitake commented Mar 16, 2018

I created a PR of kube-arbitrator for supporting GPU: kubernetes-retired/kube-batch#181
I'll share the testing result next week whether it can work well with the combination of kubeflow and TF learning task which uses GPU.

@jinzhejz
Copy link

@mitake @k82cn , update two videos to show the difference between default k8s scheduler and kube-batchd when running tensorflow jobs

BTW: in the videos, I used kubectl create -f tfjob.yaml to create a tensorflow job

@jinzhejz
Copy link

@mitake @gaocegege

Now, kube-batchd use owner reference of pods to group them into PodSet. In a tfjob, kubeflow create Master/PS/Worker as a k8s job and each job contains one pod, in this way, pods under the same tfjob will have different owner reference, kube-batchd will group them into different PodSet. Group pods into one PodSet when they belong to same deployment/tfjob is logged to trace it.

To fix it temporarily, add a new option --group-label for kube-batchd.

For example, kube-batchd --group-label=job_name ..., it means kube-batchd will group pods which have same job_name label as a PodSet even if they have different owner reference. If a pod has no job_name label, kube-batchd will group pods into a PodSet by pod owner references as before.

In kubeflow, I found that all pods under the same tfjob have the same label tf_job_name, so we can use kube-batchd --group-label tf_job_name ... for tfjob scheduling. And kubeflow should keep the label in future.

For a long-term solution, kube-batchd need to find owner reference of a pod recursively to groups them. However, it is not supported by current k8s client API.

@gaocegege
Copy link
Member

@jinzhejz @k82cn @mitake

Thanks for your awesome work! the group-label works well in tfjob v1alpha2 and v1alpha1. In v1alpha2 we use tfjobkey instead tfjobname but it can also work via the option --group-label tf_job_key.

I am wondering if batchd could schedule all tasks in one job to one node. As you all know, TensorFlow distributed jobs require GPU and good network connection. It is better to place all tasks in one node for better network condition.

/cc @rc-zhang RC Zhang is interested in the GSoC idea https://github.com/cncf/soc#batch-scheduling-and-queueing-for-data-processingml-workloads

@mitake
Copy link
Contributor

mitake commented Mar 22, 2018

@jinzhejz

In a tfjob, kubeflow create Master/PS/Worker as a k8s job and each job contains one pod, in this way, pods under the same tfjob will have different owner reference

Is this true? I thought pods belong to a same tfjob share the common owner reference: https://github.com/kubeflow/tf-operator/blob/master/pkg/trainer/replicas.go#L165
Probably I'm missing something. Could you point out if I'm making a wrong assumption?

@mitake
Copy link
Contributor

mitake commented Mar 22, 2018

@jinzhejz BTW the latest tf-operator creates pods directly instead of creating jobs after this commit: 6706903

@gaocegege
Copy link
Member

gaocegege commented Mar 22, 2018

@jinzhejz

Agreed with @mitake Now we create pods directly. Then maybe we does not need the group-label anymore?

@jinzhejz
Copy link

@mitake @gaocegege

I refer to kubeflow user guide and tf-operator readme to run kubeflow sample. I used the below steps to install kubeflow package, not sure whether it is latest.

ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/core
ks pkg install kubeflow/tf-serving
ks pkg install kubeflow/tf-job

yes, the group label is not necessary if the pods belong to the same tfjob share the same owner reference.

@gaocegege
Copy link
Member

@jinzhejz

Sorry that the tf-operator readme is outdated. Now we creates pods for ps and workers and the pods are all owned by the TFJob.

Then that is awesome that we can not introduce the hack about labels :)

@gaocegege
Copy link
Member

gaocegege commented Mar 22, 2018

If you want to use the latest version of tf-operator, you could have a look at https://github.com/kubeflow/tf-operator/blob/master/developer_guide.md

I suggest building the operator and running it locally to serve.

@jinzhejz
Copy link

Thanks for the new guide, I will have a try :)

@mitake
Copy link
Contributor

mitake commented Mar 30, 2018

If someone who has a GPU cluster can try this PR, it is really helpful :) kubernetes-retired/kube-batch#181

@gaocegege
Copy link
Member

gaocegege commented Mar 30, 2018

@mitake

Really appreciate your work! Will it be merged into master?

@mitake
Copy link
Contributor

mitake commented Mar 30, 2018

@gaocegege I think so. But of course I need reviews from @k82cn and @jinzhejz . If other developers can test it, it would be helpful for the maintainers of the kube-arbitrator :)

@pineking
Copy link
Member

If someone who has a GPU cluster can try this PR, it is really helpful :) kubernetes-retired/kube-batch#181

I can test this next week.

@mitake
Copy link
Contributor

mitake commented Apr 11, 2018

@pineking how about the situation of testing?

@ChanYiLin
Copy link
Member

Hi, I have tested it with my Kubernetes cluster which has one master and two workers having 8 p100 GPUs.
I found the kube-batchd scheduler still scheduled the worker pod requesting 1 GPU to the master node.

  1. kube-batchd scheduler has possibility to schedule the pod to the master node.
  2. Even the master node has no GPU, kube-batchd scheduler still schedule the worker pod to it.
  3. the PR doesn't seem to work for me.

@mitake
Copy link
Contributor

mitake commented Apr 17, 2018

@ChanYiLin thanks for your testing. Could you provide the command lines in your test?

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 18, 2018

@mitake
Below is the information of my cluster.

apollo25:
Taints: node-role.kubernetes.io/master:NoSchedule
Addresses:
  InternalIP:  10.66.66.25
  Hostname:    apollo25 (Master)
Capacity:
 cpu:     24
 memory:  98826404Ki
 pods:    110
Allocatable:
 cpu:     24
 memory:  98724004Ki
 pods:    110

apollo61:
Labels:        beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    gpu=yes
                    kubernetes.io/hostname=apollo61
Addresses:
  InternalIP:  10.66.66.61
  Hostname:    apollo61
Capacity:
 cpu:             28
 memory:          65693288Ki
 nvidia.com/gpu:  4
 pods:            110
Allocatable:
 cpu:             28
 memory:          65590888Ki
 nvidia.com/gpu:  4
 pods:            110

apollo62:
Labels:        beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    gpu=yes
                    kubernetes.io/hostname=apollo62
Addresses:
  InternalIP:  10.66.66.62
  Hostname:    apollo62
Capacity:
 cpu:             28
 memory:          32665208Ki
 nvidia.com/gpu:  4
 pods:            110
Allocatable:
 cpu:             28
 memory:          32562808Ki
 nvidia.com/gpu:  4
 pods:            110

The following yaml file is how I launched a tfjob.
It indicated that my job consisted of 7 Workers and a PS.
Each worker requested 1 GPU.
I also used nodeSelector to try to force kube-bachd to schedule my pod to apollo61 and apollo62 which have GPUs.

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  name: jack-tfjob-resnet50
  namespace: jack-kubeflow
spec:
  schedulerName: kube-batchd
  replicaSpecs:
  - replicas: 7
    template:
      spec:
        nodeSelector:
          gpu: "yes"
        containers:
        - args:
          - python
          - tf_cnn_benchmarks.py
          - --batch_size=8
          - --model=resnet50
          - --variable_update=parameter_server
          - --flush_stdout=true
          - --num_gpus=1
          image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
          name: tensorflow
          resources:
            requests:
              memory: "4096Mi"
              cpu: "2"
              nvidia.com/gpu: 1
            limits:
              memory: "4096Mi"
              cpu: "2"
              nvidia.com/gpu: 1
          workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        restartPolicy: OnFailure
    tfReplicaType: WORKER
  - replicas: 1
    template:
      spec:
        nodeSelector:
          gpu: "yes"
        containers:
        - args:
          - python
          - tf_cnn_benchmarks.py
          - --batch_size=32
          - --model=resnet50
          - --variable_update=parameter_server
          - --flush_stdout=true
          - --num_gpus=1
          image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
          name: tensorflow
          workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        restartPolicy: OnFailure
    tfReplicaType: PS
  terminationPolicy:
    chief:
      replicaName: WORKER
      replicaIndex: 0
  tfImage: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3

The job successfully create the pdb and the pods as below.
However, the number of Running pod was 6 and Unknown pod was 1

Status:
  Phase:   Running
  Reason:
  Replica Statuses:
    Replicas States:
      Running:            6
      Unknown:            1
    State:                Running
    Tf _ Replica _ Type:  WORKER
    Replicas States:
      Running:            1
    State:                Running
    Tf _ Replica _ Type:  PS
  State:                  Running
Events:
  Type    Reason            Age              From      Message
  ----    ------            ----             ----      -------
  Normal  SuccessfulCreate  10s              kubeflow  Created PDB: tf-job-pdb-jack-tfjob-resnet50
  Normal  SuccessfulCreate  10s              kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-0-24bg3
  Normal  SuccessfulCreate  10s              kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-1-avwka
  Normal  SuccessfulCreate  10s              kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-2-2s1qq
  Normal  SuccessfulCreate  10s              kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-3-6fxj4
  Normal  SuccessfulCreate  9s               kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-4-acc4e
  Normal  SuccessfulCreate  9s               kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-5-5bpl4
  Normal  SuccessfulCreate  8s               kubeflow  Created pod: jack-tfjob-resnet50-worker-nt7n-6-v61b3
  Normal  SuccessfulCreate  7s               kubeflow  Created pod: jack-tfjob-resnet50-ps-nt7n-0-2sel8
  Normal  SuccessfulCreate  3s (x8 over 7s)  kubeflow  (combined from similar events): Created Service: jack-tfjob-resnet50-ps-nt7n-0

The following pdb was created when the job was launched.
The current number of pods was 7 (1PS + 6Worker) as in the status of tfjob and the total was 11 indicating there was more than 1 pod in UNKNOWN state.

Name:           tf-job-pdb-jack-tfjob-resnet50
Namespace:      jack-kubeflow
Min available:  8
Selector:       runtime_id=nt7n,tf_job_name=jack-tfjob-resnet50
Status:
    Allowed disruptions:  0
    Current:              7
    Desired:              8 (1PS + 7Worker)
    Total:                11
Events:
  Type    Reason  Age              From               Message
  ----    ------  ----             ----               -------
  Normal  NoPods  1m (x2 over 1m)  controllermanager  No matching pods found

So I took a look at some pod and found the pod like jack-tfjob-resnet50-worker-nt7n-4-acc4e had following events.
It was scheduled to the apollo25 which has no GPU.

Name:           jack-tfjob-resnet50-worker-nt7n-4-acc4e
Namespace:      jack-kubeflow
Node:           apollo25/
Start Time:     Wed, 18 Apr 2018 22:47:35 +0800
Labels:         job_type=WORKER
                kubeflow.org=
                runtime_id=nt7n
                task_index=4
                tf_job_name=jack-tfjob-resnet50
Annotations:    <none>
Status:         Failed
Reason:         OutOfnvidia.com/gpu
Message:        Pod Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0
IP:
Controlled By:  TFJob/jack-tfjob-resnet50
Containers:
  tensorflow:
    Image:  gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
    Port:   <none>
    Args:
      python
      tf_cnn_benchmarks.py
      --batch_size=8
      --model=resnet50
      --variable_update=parameter_server
      --flush_stdout=true
      --num_gpus=1
    Limits:
      cpu:             2
      memory:          4Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             2
      memory:          4Gi
      nvidia.com/gpu:  1
    Environment:
      TF_CONFIG:  {"cluster":{"ps":["jack-tfjob-resnet50-ps-nt7n-0:2222"],"worker":["jack-tfjob-resnet50-worker-nt7n-0:2222","jack-tfjob-resnet50-worker-nt7n-1:2222","jack-tfjob-resnet50-worker-nt7n-2:2222","jack-tfjob-resnet50-worker-nt7n-3:2222","jack-tfjob-resnet50-worker-nt7n-4:2222","jack-tfjob-resnet50-worker-nt7n-5:2222","jack-tfjob-resnet50-worker-nt7n-6:2222"]},"task":{"type":"worker","index":4},"environment":"cloud"}
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-k5v6t (ro)
Volumes:
  default-token-k5v6t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-k5v6t
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  gpu=yes
Tolerations:     node.alpha.kubernetes.io/notReady:NoExecute for 300s
                 node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Warning  OutOfnvidia.com/gpu  8m    kubelet, apollo25  Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0

I am still trying to find the problems in the kube-batchd or tf-operator.

Thank you!

@mitake
Copy link
Contributor

mitake commented Apr 19, 2018

@ChanYiLin thanks for sharing the information. Could you also provide the command line options and config files of kube-batchd and tf-operator?

@ChanYiLin
Copy link
Member

@mitake
For kube-batchd,

        - ./opt/kube-batchd
        - --kubeconfig=/tmp/kubernetes/conf/admin.conf
        - --scheduler-name=kube-batchd

For tf-operator,

        - /opt/mlkube/tf-operator
        - --controller-config-file=/etc/config/controller_config_file.yaml
        - --alsologtostderr
        - -v=1
        - --enable-gang-scheduling=true

@mitake
Copy link
Contributor

mitake commented Apr 23, 2018

@ChanYiLin thanks, could you also provide the content of /etc/config/controller_config_file.yaml? Did you specified SchedulerName in the file? I think you specified correctly, just want to make it sure.

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 23, 2018

@mitake
The content of /etc/config/controller_config_file.yaml is from the configmap.

And I found the default configmap which is created by the ksonnet following the instructions from Kubeflow README does not contain the SchedulerName.

Also the &v1alpha1.ControllerConfig{} does not store the value of SchedulerName

In pkg/apis/tensorflow/v1alpha1/types.go

// ControllerConfig is a structure for storing the controller configuration
type ControllerConfig struct {
	// Accelerators is a map from the name of the accelerator to the config for that accelerator.
	// This should match the value specified as a container limit.
	// e.g. alpha.kubernetes.io/nvidia-gpu
	Accelerators map[string]AcceleratorConfig

	// Path to the file containing the grpc server source
	GrpcServerFilePath string
}

The tf-operator stores the scheduler name in the spec directly

// TFJobSpec structure for storing the TFJob specifications
type TFJobSpec struct {
	// TODO(jlewi): Can we we get rid of this and use some value from Kubernetes or a random ide.
	RuntimeId string

	// ReplicaSpecs specifies the TF replicas to run.
	ReplicaSpecs []*TFReplicaSpec `json:"replicaSpecs"`

	// TFImage defines the tensorflow docker image that should be used for default parameter server
	TFImage string `json:"tfImage,omitempty"`

	// TerminationPolicy specifies the condition that the tfjob should be considered finished.
	TerminationPolicy *TerminationPolicySpec `json:"terminationPolicy,omitempty"`

	// SchedulerName specifies the name of scheduler which should handle the TFJob
	SchedulerName string `json:"schedulerName,omitempty"`
}

and assigns to the pods when creating them.

// CreatePodWithIndex will create a new pod with specify index
func (s *TFReplicaSet) CreatePodWithIndex(index int32) (*v1.Pod, error) {
	taskLabels := s.LabelsByIndex(index)

	pod := &v1.Pod{
		ObjectMeta: meta_v1.ObjectMeta{
			Name:   s.genPodName(index),
			Labels: taskLabels,
			OwnerReferences: []meta_v1.OwnerReference{
				helper.AsOwner(s.Job.job),
			},
		},
		Spec: *s.Spec.Template.Spec.DeepCopy(),
	}

	pod.Spec.SchedulerName = s.Job.SchedulerName()
        ...

In my pod, the scheduler name is set correctly.

...
    image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
    imagePullPolicy: IfNotPresent
    name: tensorflow
    resources:
      limits:
        cpu: "2"
        memory: 4Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "2"
        memory: 4Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-k5v6t
      readOnly: true
    workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
  dnsPolicy: ClusterFirst
  nodeName: apollo61
  nodeSelector:
    gpu: "yes"
  restartPolicy: OnFailure
  schedulerName: kube-batchd
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
...

I don't think the problems come from my configurations?

@ChanYiLin
Copy link
Member

ChanYiLin commented Apr 25, 2018

@mitake
Hi, I found you have done some changes to the PR.
According to
kubernetes-retired/kube-batch#181 (review)
I followed the revision and changed the

(r.GPU < rr.GPU || rr.GPU == 0)

to

r.GPU <= rr.GPU

My test scenario works like a charm now!

@mitake
Copy link
Contributor

mitake commented Apr 25, 2018

@ChanYiLin great! Thanks for testing again :) I'll make the PR ready to merge.

@mitake
Copy link
Contributor

mitake commented May 29, 2018

The PR which adds GPU support to kube-arbitrator is already merged: kubernetes-retired/kube-batch#181
Probably we can close this issue? @jlewi @k82cn

@k82cn
Copy link
Collaborator Author

k82cn commented Jul 3, 2018

The PR which adds GPU support to kube-arbitrator is already merged: kubernetes-retired/kube-batch#181

I think we can close that one. In upstream, we decide to use kube-arbitrator for batch related workload; here's the design kubernetes/community#2337 . For tf-operator, kube-arbitrator keeps the backward compatibility to support PDB; but if we want to support more features, e.g. nodeSelector, it's better to replace PDB with SchedulingSpec when design is finalized. Will open another PRs if necessary :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants