Merge pull request #350 from k82cn/add-design

Add design
volcano-sh · Jul 18, 2019 · a4e3fcf · a4e3fcf
2 parents 7b3f0f0 + 5c9e9c9
commit a4e3fcf
Show file tree

Hide file tree

Showing 14 changed files with 646 additions and 0 deletions.
diff --git a/docs/design/delay-pod-creation.md b/docs/design/delay-pod-creation.md
@@ -0,0 +1,116 @@
+# Delay Pod Creation
+
+@k82cn; Jan 7, 2019
+
+## Table of Contents
+
+   * [Delay Pod Creation](#delay-pod-creation)
+      * [Table of Contents](#table-of-contents)
+      * [Motivation](#motivation)
+      * [Function Detail](#function-detail)
+         * [State](#state)
+         * [Action](#action)
+         * [Admission Webhook](#admission-webhook)
+      * [Feature interaction](#feature-interaction)
+         * [Queue](#queue)
+         * [Quota](#quota)
+         * [Operator/Controller](#operatorcontroller)
+      * [Others](#others)
+         * [Compatibility](#compatibility)
+      * [Roadmap](#roadmap)
+      * [Reference](#reference)
+
+Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
+
+## Motivation
+
+For a batch system, there're always several pending jobs because of limited resources and throughput.
+Different with other kubernetes type, e.g. Deployment, DaemonSet, it's better to delay pods creation for
+batch workload to reduce apiserver pressure and speed up scheduling (e.g. less pending pods to consider).
+In this document, several enhancements are introduced to delay pod creation.
+
+## Function Detail
+
+### State
+
+A new state, named `InQueue`, will be introduced to denote the phase that jobs are ready to be allocated.
+After `InQueue`, the state transform map is updated as follow.
+
+| From          | To             | Reason  |
+|---------------|----------------|---------|
+| Pending       | InQueue        | When it's ready to allocate resource to job |
+| InQueue       | Pending        | When there's not enough resources anymore |
+| InQueue       | Running        | When every pods of `spec.minMember` are running |
+
+The `InQueue` is a new state between `Pending` and `Running`; and it'll let operators/controllers start to
+create pods. If it meets errors, e.g. unschedulable, it rollbacks to `Pending` instead of `InQueue` to
+avoid retry-loop. 
+
+### Action
+
+Currently, `kube-batch` supports several actions, e.g. `allocate`, `preempt`; but all those actions are executed
+based on pending pods. To support `InQueue` state, a new action, named `enqueue`, will be introduced.
+
+By default, `enqueue` action will handle `PodGroup`s in FCFS policy; `enqueue` will go through all PodGroup
+(by creation timestamp) and update PodGroup's phase to `InQueue` if:
+
+* there're enough idle resources for `spec.minResources` of `PodGroup`
+* there're enough quota for `spec.minResources` of `PodGroup`
+
+As `kube-batch` handling `PodGroup` by `spec.minResources`, the operator/controller may create more `Pod`s than
+`spec.minResources`; in such case, `preempt` action will be enhanced to evict overused `PodGroup` to release
+resources.
+
+### Admission Webhook
+
+To guarantee the transaction of `spec.minResources`, a new `MutatingAdmissionWebhook`, named `PodGroupMinResources`,
+is introduced. `PodGroupMinResources` make sure
+
+* the summary of all PodGroups' `spec.minResources` in a namespace not more than `Quota`
+* if resources are reserved by `spec.minResources`, the resources can not be used by others
+
+Generally, it's better to let total `Quota` to be more than available resources in cluster, as some pods maybe
+unschedulable because of scheduler's algorithm, e.g. predicates.
+
+## Feature interaction
+
+### Queue
+
+The resources will be shared between `Queue`s algorithm, e.g. proportion by default. If the resources can not be
+fully used because of fragment, `backfill` action will help on that. If `Queue` used more resources than its
+deserved, `reclaim` action will help to balance resources. The Pod can not be evicted currently if eviction will
+break `spec.minMember`; it'll be enhanced for job level eviction.
+
+### Quota
+
+To delay pod creation, both `kube-batch` and `PodGroupMinResources` will watch `ResourceQuota` to decide which
+`PodGroup` should be in queue firstly. The decision maybe invalid because of race condition, e.g. other
+controllers create Pods. In such case, `PodGroupMinResources` will reject `PodGroup` creation and keep `InQueue`
+state until `kube-batch` transform it back to `Pending`. To avoid race condition, it's better to let `kube-batch`
+manage `Pod` number and resources (e.g. CPU, memory) instead of `Quota`.
+
+### Operator/Controller
+
+The Operator/Controller should follow the above "protocol" to work together with scheduler. A new component,
+named `PodGroupController`, will be introduced later to enforce this protocol if necessary.
+
+## Others
+
+### Compatibility
+
+To support this new feature, a new state and a new action are introduced; so when the new `enqueue` action is
+disabled in the configuration, it'll keep the same behaviour as before.
+
+## Roadmap
+
+* `InQueue` phase and `enqueue` action (v0.5+)
+* Admission Controller (v0.6+)
+
+## Reference
+
+* [Coscheduling](https://github.com/kubernetes/enhancements/pull/639)
+* [Delay Pod creation](https://github.com/kubernetes-sigs/kube-batch/issues/539)
+* [PodGroup Status](https://github.com/kubernetes-sigs/kube-batch/blob/master/doc/design/podgroup-status.md)
+* [Support 'spec.TotalResources' in PodGroup](https://github.com/kubernetes-sigs/kube-batch/issues/401)
+* [Dynamic Admission Control](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#write-an-admission-webhook-server)
+* [Add support for podGroup number limits for one queue](https://github.com/kubernetes-sigs/kube-batch/issues/452)
diff --git a/docs/design/drf.md b/docs/design/drf.md
@@ -0,0 +1,33 @@
+## Dominant Resource Fairness (DRF)
+
+## Introduction
+Dominant Resource Fairness (DRF), a generalization of max-min fairness to multiple resource types is a resource allocation policy that handles multiple resource types.
+
+Dominant resource - a resource of specific type (cpu, memory, gpu) which is most demanded by given job among other resources it needs. This resource is identified as a share of the total cluster resources of the same type.
+
+DRF computes the share of dominant resource allocated to a job (dominant share) and tries to maximize the smallest dominant share in the system.
+Schedule the task of the job with smallest dominant resource
+
+
+## Kube-Batch Implementation
+DRF calculate shares for each job. The share is the highest value of  ratio of the (allocated resource/Total Resource) of the three resource types CPU, Memory and GPU.
+This share value is used for job ordering and task premption.
+
+#### 1. Job Ordering:
+  The job having the lowest share will have higher priority.
+  In the example below all the tasks task1, task2 of job1 and task3 and task4 of job2 is already allocated to the cluster.  	  	
+  ![drfjobordering](./images/drfjobordering.png)
+
+
+ ##### 1.1 Gang Scheduling with DRF in job ordering ( Gang -> DRF)
+   Gang scheduling sorts the job based on whether the job has atleast **minAvailable** task already (allocated + successfully completed + pipelined) or not.
+   Jobs which has not met the minAvailable criteria has higher priority than jobs which has met
+   the minAvailable criteria.
+
+   For the jobs which has met the minAvailable criteria will be sorted according to DRF.
+
+   ![gangwithdrf](./images/gangwithdrf.png)
+
+#### 2. Task Preemption:
+
+The preemptor can only preempt other tasks only if the share of the preemptor is less than the share of the preemptee after recalculating the resource allocation  of the premptor and preemptee.
diff --git a/docs/design/execution-flow.md b/docs/design/execution-flow.md
@@ -0,0 +1,48 @@
+## Execution of the Scheduler for Allocating the Workloads to Node
+
+The Allocation of the workloads to the node in scheduler happens in each session and the workflow of the session is illustrated in the below diagram.
+
+1. Session Opens every 1 sec
+2. In very session local copies of Queues, JobsMap, PendingTasks and Node List is created.
+3. For Each Jobs in the Session
+    1. If the Queued ID in the Job exists in the Local Copy of Queues then :
+        1. Add the Queue in the Local Copy of Queues
+            1. If the QueueID exists in local copyof JobsMap:
+                1. Push the Job in the JobsMap
+            2. If Not then Add the QueueID as the key in Local JobsMap and add the job in the Map.
+    2. If Not then Give warning and continue to Step 3
+4. For Each Item in Local Queues
+    1. Pop an Queue from the Queues
+        1. Check if Queue is overused
+            1. If Yes then Continue to Step 4
+            2. If Not then get the list of JobsList from the JobsMap for the particular Queue.
+                1. If List is empty then continue to Step 4
+                2. If Yes then Pop a Job from the JobsList
+                    1. If Job exits the Local PendingTasks
+                        1. If Not then : 
+                            1. Create a Local Task List
+                                1. Get the List of Each Tasks in the pending state for that job
+                                    1. If the required resource for the job is Empty then go back to previous step
+                                    2. If Not then Add the tasks in the Local TasksList
+                            2. Add the local Tasks List to PendingTasks for that Job
+                        2. If Yes then :
+                            1. For each Tasks from the pendingTasksList for the Job.
+                                1. Pop the tasks
+                                2. Get the list of predicate nodes for the task.
+                                3. Score the predicate nodes and sort it.
+                                4. For each node in the sorted predicated + scored nodes
+                                    1. If the Resource required by task is less than Idle resource of Nodes
+                                        1. If Resource required by task is less than  releasing resource of Node
+                                            1. Then Add the tasks in Pipeline
+                        3. Check if Job is ready to be allocated
+                            1. If yes the push the Job
+                            2. If No then add the Queue back to the list.
+                3. Continue till all the Job is ready
+    2. Continue till each Queue is processed.      
+
+
+
+
+
+
+![Execution flow graph](../../images/AllocateDesign.png)
diff --git a/docs/design/drf - fairshare.md → docs/design/fairshare.md b/docs/design/drf - fairshare.md → docs/design/fairshare.md
diff --git a/docs/design/images/drfjobordering.png b/docs/design/images/drfjobordering.png
diff --git a/docs/design/images/gangwithdrf.png b/docs/design/images/gangwithdrf.png
diff --git a/docs/design/metrics.md b/docs/design/metrics.md
@@ -0,0 +1,39 @@
+## Scheduler Monitoring
+
+## Introduction
+Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).
+
+This document describes metrics we want to add into kube-batch to better monitor performance.
+
+## Metrics
+In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions. 
+
+All the metrics are prefixed with `kube_batch_`. 
+
+### kube-batch execution
+This metrics track execution of plugins and actions of kube-batch loop.
+
+| Metric name | Metric type | Labels | Description |
+| ----------- | ----------- | ------ | ----------- |
+| e2e_scheduling_latency | histogram |  | E2e scheduling latency in seconds |
+| plugin_latency | histogram | `plugin`=&lt;plugin_name&gt; | Schedule latency for plugin |
+| action_latency | histogram | `action`=&lt;action_name&gt; | Schedule latency for action |
+| task_latency | histogram | `job`=&lt;job_id&gt; `task`=&lt;task_id&gt; | Schedule latency for each task |
+
+
+### kube-batch operations
+This metrics describe internal state of kube-batch.
+
+| Metric name | Metric type | Labels | Description |
+| ----------- | ----------- | ------ | ----------- |
+| pod_schedule_errors | Counter |  | The number of kube-batch failed due to an error |
+| pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job |
+| pod_preemption_victims | Counter | | Number of selected preemption victims |
+| total_preemption_attempts | Counter |  | Total preemption attempts in the cluster till now |
+| unschedule_task_count | Counter | `job`=&lt;job_id&gt; | The number of tasks failed to schedule |
+| unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration |
+| job_retry_counts | Counter | `job`=&lt;job_id&gt; | The number of retry times of one job |
+
+
+### kube-batch Liveness
+Healthcheck last time of kube-batch activity and timeout
diff --git a/docs/design/node-priority.md b/docs/design/node-priority.md
@@ -0,0 +1,18 @@
+## Node Priority in Kube-Batch
+
+This feature allows `kube-batch` to schedule workloads based on the priority of the Nodes, Workloads will be scheduled on Nodes with higher priority and these priorities will be calculated based on different parameters like `ImageLocality`, `Most/Least Requested Nodes`...etc. 
+A basic flow for the Node priority functions is depicted below.
+
+![Node Priority Flow](../images/Node-Priority.png) 
+
+Currently in kube-batch `Session` is opened every 1 sec and the workloads which are there in Queue goes through `Predicate` to find a suitable set of Nodes where workloads can be scheduled and after that it goes through `Allocate` function to assign the Pods to the Nodes and then goes to `Preempt` if applicable.
+
+Node Priority can be introduced in the current flow for `Allocate` and `Preempt` function. Once we have set of Nodes where we can scheduled the workloads then flow will go through `Prioritize` function which will do the following things :
+
+ - Run all the priority functions on all the list Nodes which is given by `Predicate` function in a parallel go-routine.
+ - Score the Node based on whether the `Priority Rule` satisfies the Workload scheduling criteria.
+ - Once the scores are returned from all the `PriorityFn` then aggregate the scoring and identify the Node with highest scoring.
+ - Delegate this selected Node in last step to `AllocateFn` to Bind the workload to the Node.
+
+Currently there are multiple `PriorityFn` available with default Scheduler of Kubernetes. Going forward with each release we will implement all the priority functions in kube-batch based on their importance to batch scheduling.
+
diff --git a/docs/design/plugin-conf.md b/docs/design/plugin-conf.md
@@ -0,0 +1,79 @@
+# Dynamic Plugins Configuration
+
+## Table of Contents
+
+   * [Dynamic Plugins Configuration](#dynamic-plugins-configuration)
+      * [Table of Contents](#table-of-contents)
+      * [Motivation](#motivation)
+      * [Function Detail](#function-detail)
+      * [Feature Interaction](#feature-interaction)
+         * [ConfigMap](#configmap)
+      * [Reference](#reference)
+
+Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)
+
+## Motivation
+
+There are several plugins and actions in `kube-batch` right now; the users may want to only enable part of plugins and actions. This document is going to introduce dynamic plugins configuration, so the users can configure `kube-batch` according to their
+scenario on the fly.
+
+## Function Detail
+
+The following YAML format will be introduced for dynamic plugin configuration:
+
+```yaml
+actions: "list_of_action_in_order"
+tiers:
+- plugins:
+  - name: "plugin_1"
+    disableJobOrder: true
+  - name: "plugin_2"
+- plugins:
+  - name: "plugin_3"
+    disableJobOrder: true
+```
+
+The `actions` is a list of actions that will be executed by `kube-batch` in order, separated
+by commas. Refer to the [tutorial](https://github.com/kubernetes-sigs/kube-batch/issues/434) for
+the list of supported actions in `kube-batch`. Those actions will be executed in order, although
+the "order" maybe incorrect; the `kube-batch` does not enforce that.
+
+The `tiers` is a list of plugins that will be used by related actions, e.g. `allocate`. It includes
+several tiers of plugin list by `plugins`; if it fits plugins in high priority tier, the action will not
+go through the plugins in lower priority tiers. In each tier, it's considered passed if all the plugins are
+fitted in `plugins.names`.
+
+The `options` defines the detail behaviour of each plugins, e.g. whether preemption is enabled. If not
+specific, `true` is default value. For now, `preemptable`, `jobOrder`, `taskOrder` are supported.
+
+Takes following example as demonstration:
+
+1. The actions `"reclaim, allocate, backfill, preempt"` will be executed in order by `kube-batch`
+1. `"priority"` has higher priority than `"gang, drf, predicates, proportion"`; a job with higher priority
+will preempt other jobs, although it's already allocated "enough" resource according to `"drf"`
+1. `"tiers.plugins.drf.disableTaskOrder"` is `true`, so `drf` will not impact task order phase/action
+
+```yaml
+actions: "reclaim, allocate, backfill, preempt"
+tiers:
+- plugins:
+  - name: "priority"
+  - name: "gang"
+- plugins:
+  - name: "drf"
+    disableTaskOrder: true
+  - name: "predicates"
+  - name: "proportion"
+```
+
+## Feature Interaction
+
+### ConfigMap
+
+`kube-batch` will read the plugin configuration from command line argument `--scheduler-conf`; user can
+use `ConfigMap` to acesss the volume of `kube-batch` pod during deployment.
+
+## Reference
+
+* [Add preemption by Job priority](https://github.com/kubernetes-sigs/kube-batch/issues/261)
+* [Support multiple tiers for Plugins](https://github.com/kubernetes-sigs/kube-batch/issues/484)