Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Coscheduling #2337

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
180 changes: 180 additions & 0 deletions contributors/design-proposals/scheduling/coscheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
---
kep-number: 24
title: Coscheduling
authors:
- "@k82cn"
owning-sig: sig-scheduling, machine-learning WG
reviewers:
- "@bsalamat"
- "@vishh"
approvers:
- "@bsalamat"
- "@vishh"
editor: TBD
creation-date: 2018-07-03
last-updated: 2018-10-12
status: provisional
---

# Coscheduling

## Table of Contents

* [Table of Contents](#table-of-contents)
* [Motivation](#motivation)
* [Function Detail](#function%20detail)
* [API Definitation](#api%20definition)
* [Lifecycle Management](#lifecycle%20management)
* [Scheduling](#scheduling)
* [Customized Controller](#customized%20controller)
* [Feature Interaction](#feature%20interaction)
* [Multi-scheduler](#multi-scheduler)
* [Priority/Preemption](#priority/preemption)
* [Pod RestartPolicy](#pod%20restartPolicy)
* [Admission Controller](#admission%20controller)
* [Kubectl](#kubectl)
* [References](#references)

## Motivation

Kubernetes has become a popular solution for orchestrating containerized workloads; it has been largely successful in orchestrating serving and storage workloads, and, with native K8s support for Spark. Meanwhile, the community also try to run Machine Learning (ML) workloads on Kubernetes, e.g. [kubeflow/tf-operator](https://github.com/kubeflow/tf-operator). When running a Tensorflow/MPI job, all tasks of a job must be start together; otherwise, did not start anyone of tasks. If the resource is enough to run all 'tasks', everything is fine; but it's not true for most of case, especially in the on-prem environment. In worst case, all jobs are pending here because of deadlock: every job only start part of tasks, and waits for the other tasks to start. It'll be worse in federation for cross-domain case which is out of scope in this doc.

After the discussion at [Coscheduling/Gang-scheduling](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv) proposal, we decide to implement Coscheduling in [kube-batch](https://github.com/kubernetes-sigs/kube-batch) by CRDs. kube-batch focuses on "batch" workload in kubernetes, and will share the same [scheduling frameworks](https://github.com/kubernetes/community/pull/2281) when it's ready. This document is used to provide definition of API object and the scheduler behaviour of Coscheduling.

## Function Detail

### API Definition

The following requirements are identified during the discussion of this feature:

1. Existing workload can use this feature without (or with a few) configuration changes
2. Pods of a group/gang may have different `PodSpec` (and/or belong to different collections)
3. Existing controllers which are responsible for managing life cycle of collections work well with this feature

To meet the requirements above, the following **Kind** is introduced by CRD under `incubator.scheduling.k8s.io/v1alpha1` **Group**/**Version**.

```go
// PodGroup defines the scheduling requirement of a pod group
type PodGroup struct {
k82cn marked this conversation as resolved.
Show resolved Hide resolved
metav1.TypeMeta
metav1.ObjectMeta

// Spec defines the behavior of a pod group.
// +optional
Spec PodGroupSpec

// Status represents the current information about a pod group.
// This data may not be up to date.
// +optional
Status PodGroupStatus
}

// PodGroupSpec represents the template of a pod group.
type PodGroupSpec struct {
// MinMembers defines the minimal number of members/tasks to run the pod group;
// if there's not enough resources to start all tasks, the scheduler
// will not start anyone.
MinMembers int

// TotalResources defines the total resource the PodGroup requests to run
// Pods.
TotalResources v1.ResourceList
}

// PodGroupStatus represents the current state of a pod group.
type PodGroupStatus struct {
// The number of admitted pods.
// +optional
Admitted int32

// The number of actively running pods.
// +optional
Running int32
k82cn marked this conversation as resolved.
Show resolved Hide resolved

// The number of pods which reached phase Succeeded.
// +optional
Succeeded int32

// The number of pods which reached phase Failed.
// +optional
Failed int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is tracking Succeeded and Failed useful at the pod Group level? This seems to make an assumption that pod groups are typically used only for run to completion jobs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just give a summary of pod status here; not only for completion jobs.

}
```

The `PodGroup`, which is a namespaced object, specifies the attributes and status of a pod group, e.g. number of pods in a group. To define which pods are member of `PodGroup`, the following annotation key is introduced for `Pod`; the annotation key is used for this alpha feature, and it'll be changed to a more permanent form, such a field, when moving `PodGroup` to core.

```go
scheduling.k8s.io/group-name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Mark it alpha explicitly & be consistent with the object kind - alpha.incubator.scheduling.k8s.io

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm .... , mentioned this alpha feature in doc and release notes (in future); so when we migrate this feature to upstream, the user did not need update annotation in pod.

```

The `scheduling.k8s.io/group-name` annotation specifies the `PodGroup` that it belongs to; and the pod can only belong to the `PodGroup` in the same namespace. The pod, controlled by different collections, can also belong to the same `PodGroup`. Because of performance concern, it does not use `LabelSelector` to build the relationship between `PodGroup` and `Pod`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to add that the use of annotations is temporary and for prototyping purposes. We will change it to a more permanent form, such a field, in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, why not consider adding an alpha field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alpha field should be added to core, but PodGroup will be a CRD. So, I think we should postpone adding the field to sometime in the future when we move PodGroup to the core.


### Lifecycle Management
k82cn marked this conversation as resolved.
Show resolved Hide resolved

As the lifecycle of Pods in PodGroup may be different from controller to another, the lifecycle of the members is not managed by the coscheduling feature. Each collection controller may implement or already have the mean to manage lifecycle of its members. The scheduler'll record related events for controller to manage pods, e.g. `Unschedulable`. A controller of `PodGroup` will be introduced later for lifecycle management, e.g. restart the whole `PodGroup`, according to the configuration, when the number of running pods drop below `spec.MinMembers` at run-time.

The update to `PodGroup` is not supported for now; and deleting `PodGroup` does not impact Pod's status.

### Scheduling

The scheduler only watches `PodGroup` and `Pod`. It'll reconstruct 'Job' by annotation of Pod and `PodGroup`, the `Pod`s are considered as 'Task' of 'Job'; if annotation is empty, the scheduler records an unschedulable event of pod to ask user/controller to resubmit it. The schduler does not schedule pods until its `PodGroup` is created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if annotation is empty, the scheduler records an unschedulable

So the batch scheduler that you intend to develop will not schedule pods that do not need gang scheduling?

Copy link
Member Author

@k82cn k82cn Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the first version. As we do not know when/whether scheduler will get PodGroup in future, the PodGroup is required currently, but user can set minMember to 1.

I'm also thinking to add another "flag" to identify whether PodGroup is required to schedule the pod. Supporting jobs without gang-scheduling PodGroup in kube-batch should be another proposal.


As batch scheduler and default scheduler may be running in parallel; the batch scheduler follows multi-scheduler feature to only handle the `Pod` that submitted to it. The batch scheduler does scheduling as follow:

1. Reconstructing 'Job' by the annotation of `Pod` and `PodGroup`
2. If there are less Pods than `minMembers` of `PodGroup`, the 'job' will not be scheduled; and an unschedulable event of `Pod` will be recorded
3. In `allocate` phase, scheduler will
* record an `Unschedulable` event of `PodGroup` if some pods are running but `succeeded + pending + running < minMembers`, the controller takes action according to its configuration
* allocate (but not bind) resource to Pods according to Pod's spec, e.g. `NodeAffinity`
* bind all Pods to hosts until job is ready: if `minMembers` <= `allocated Pods` + `pending Pods`, it's ready when `minMembers` <= `allocated Pods`; otherwise, `numMember` <= `allocated Pods` + `succeeded Pods`
4. If can not allocate enough resources to the job, the pods stay pending; and the resource cannot be allocated to other job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues to consider:

  1. There is the risk of split brain where batch scheduler and the default scheduler might schedule onto the same resources which could in theory lead to one of the co-scheduled pods being rejected by the kubelet. What would you do if the batch scheduler cannot re-schedule that pod? Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV. Would it be possible to assign phantom pods to nodes to hold on to resources?
  2. Deadlock in the batch scheduler - If there are two jobs and both of them have half of their jobs allocated (not bound), then how will you make progress? Have you considered doing a transactional allocation - allocate all or free up allocated resources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we fall back on the controller to kill & resubmit the entire job? That feels like a sub-optimal solution from a user POV.

Yes, controller will re-submit the entire job. phantom pods maybe also preempted by high priority pods from default scheduler. For such a mix environment, prefer to only share resources between elastic 'jobs', e.g. Spark's executor and nginx/tomcat.

Deadlock in the batch scheduler ... allocate all or free up allocated resources?

In kube-batch 0.1, we use this option; two major concerns: 1. there're several computing here, pre-allocated and free up, 2. after free up, the resource may not be used by others, e.g. predicates.
So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in kube-batch 0.2/0.3, I'm thinking to use preemption (or backfill) to handle this case: kube-batch will try to allocate resource by jobs's minMember as much as possible; if not enough resource, preempt allocated resource firstly which is occupied by pod-group but can not use it.

What triggers this "preemption"? When there are two pod groups with the same priority and each one is partially allocated and there is no more resources in the cluster, which one preempts the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure which term is better: preempt or backfill; anyway, the assumed/allocated resource will be in "Allocated" state if the job did not get enough resource; and after allocation phase, those allocated but not bound resource will be re-calculated for other jobs. For your case, the two jobs will try to get those allocated resources in order (e.g. FCFS).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if FCFS works in this case. What if the pods of the two jobs created/processed alternatively? For example, batch scheduler processes pod 1 of job 1 and then processes pod 1 of job 2, then pod 2 of job 1, then pod 2 of job 2, and so on.

Copy link
Member Author

@k82cn k82cn Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in kube-batch, it'll do allocation in job/podgroup level (FCFS also for job/podgroup); so for this case, we will not handle pod2 of job2 until all pod in job1 are handled :)


That may make resources (less than job's resource request) idle for a while, e.g. a huge job. The solution, e.g. backfill other smaller jobs to improve the resource utilization, will be proposed in coming release. In `allocate` phase, only pod's `NodeAffinity` takes effect; the other predicates/priorities will be included on-demand in coming release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backfilling soungs too optimistic. many k8s clusters may not have that many jobs to backfill.

Copy link
Member Author

@k82cn k82cn Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only one of solutions options :)


### Customized Controller

A typical example of customized controller is [kubeflow/tf-operator](https://github.com/kubeflow/tf-operator), which managed the Pods for TensorFlow on Kubernetes, required `gang-scheduling` in upstream. Here's an example of customized controller that demonstrated the usage of `gang-scheduling` in `kube-batch`.

Usually, CRD ([CustomResourceDefinitions](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)) feature is used to introduce a customized **Kind**, named `CRDJob` as example. The customized controller, named `CRDJobController`, watches it and manages lifecycle of it:

1. For each `CRDJob`, `CRDJobController` creates a `CRDJob` and `PodGroup` (one `CRDJob` with one `PodGroup` as example). The attributes of `PodGroup`should be set accordingly, e.g `numMember`; it's up to customized controller on how to manage relationship between `PodGroup` and `CRDJob`, e.g. `metadata.name`.
2. When `CRDJobController` create Pods, its annotation should be set accordingly. `kube-batch` follows gang-scheduling logic to schedule those pods in batch.
3. When pods failed/deleted/unschedulable, it is up to `CRDJobController` on how to manage `CRDJob`'s lifecycle. For example, if `CRDJobController` manages lifecycle itself, set `.spec.Policy` of `PodGroup` to nil; otherwise, `PodGroupController` will manage the lifecycle as described above.
4. If `CRDJob` was deleted, the `PodGroup` must be deleted accordingly.

## Feature Interaction

### Multi-scheduler

Since multiple schedulers work in parallel, there may be decision conflict between different schedulers; and the kubelet will reject one pod (failed) if conflict. The controller will handle rejected pods based on its lifecycle policy for failed pods. Users and cluster admins may reduce the probability of such conflicts by partitioning the clusters logically, for example, by placing node-affinity to distinct set of nodes on various groups of pods.

### Priority/Preemption

A rejected or preempted batch/run-to-completion pod may trigger a restart of the whole `PodGroup`. This can have negative impact on performance. The solution on how to handle conflicts better will be proposed in coming release.

The default scheduler should also consider `PodGroup` when preempting pods, similar to `PodDisruptionBudgets`.

### Pod RestartPolicy

Pod's `RestartPolicy` still works as before. But for batch/run-to-compelete workload, it's better to set `RestartPolicy` to `Never` to avoid endless restart loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is batch distinguished here? Restart policy could similarly cause endless restart loop in other types of tasks. Kubelet has backoff mechanism to avoid a tight restart loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this tip is not relevant to this proposal. It is better suited for a controller design proposal/user-guide. And the job controller already has some documentation on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several people asked this when using kube-arbitrator, so I highlight here. I'm ok to highlight in readme or other place.


### Admission Controller

If quota runs out in the middle of creating a group of pods, a few members of a `PodGroup` may be created, while the rest will be denied by the `ResourceQuota` admission controller. `.spec.TotalResource` is added in `PodGroup` to address this problem. When a `PodGroup` is created with `.spec.TotalResource`, so much quota is reserved for the group if there is available quota. Pods of group use the already reserved quota. By setting `.spec.TotalResource` properly, one can ensure that Pods of a `PodGroup` have enough quota at creation time. The design on `Quota` enhancement to support `.spec.TotalResource` will be proposed later for review.

### Kubectl

kubectl is enhanced to support `PodGroup` by [kubectl plugins](https://kubernetes.io/docs/tasks/extend-kubectl/kubectl-plugins/), including its status.

## Roadmap

1. `PodGroup` CRD and Coscheduling (by kube-batch) (1.13)
1. `PodGroup` controller (by kube-batch) (1.13)
1. Admission controller for `.spec.TotalResource` (1.13, 1.14)

## References

* [Coscheduling in Kubernetes](https://docs.google.com/document/d/1AUwcvTtULNvow5M9e428FnlvINO1uQ7ojRoTGuTp4DA/edit#heading=h.ckn8nv2jj0xv)
* [Indexed Job](https://github.com/kubernetes/kubernetes/issues/14188)
* [Schedule a group of pods all at once](https://github.com/kubernetes/kubernetes/issues/16845)
* [kubeflow/tf-operator: Prevent scheduling deadlocks](https://github.com/kubeflow/tf-operator/issues/165)