Design proposal of stable scheduling in TiDB

pingcap · May 10, 2019 · 33cc256 · 33cc256
1 parent 162cf4c
commit 33cc256
Showing 1 changed file with 164 additions and 0 deletions.
diff --git a/docs/design-proposals/tidb-stable-scheduling.md b/docs/design-proposals/tidb-stable-scheduling.md
@@ -0,0 +1,164 @@
+# TiDB Stable Scheduling
+
+This document presents a design to schedule new pod of TiDB member to its
+previous node in certain circumstances.
+
+## Table of Contents
+
+- [Glossy](#glossy)
+- [Motivation](#motivation)
+- [Goals](#goals)
+  * [Non-Goals](#non-goals)
+- [Use Cases](#use-cases)
+  * [No need to update IP addresses of TiDB in load balancer outside of the Kubernetes cluster](#no-need-to-update-ip-addresses-of-tidb-in-load-balancer-outside-of-the-kubernetes-cluster)
+- [Proposal](#proposal)
+- [Implementation](#implementation)
+- [Alternatives](#alternatives)
+  * [Deploy TiDB members on all nodes](#deploy-tidb-members-on-all-nodes)
+  * [Leverage load balancer health check mechanism](#leverage-load-balancer-health-check-mechanism)
+  * [Use local PV to bind TiDB member to a node](#use-local-pv-to-bind-tidb-member-to-a-node)
+- [Limitations](#limitations)
+  * [No guarantee if there is another scheduler to schedule pods to TiDB nodes](#no-guarantee-if-there-is-another-scheduler-to-schedule-pods-to-tidb-nodes)
+    + [Workarounds](#workarounds)
+  * [Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements](#cannot-schedule-new-pod-of-tidb-member-back-to-its-node-if-the-node-does-not-meet-new-requirements)
+
+## Glossy
+
+- Pod: A deployable object in Kubernetes.
+- Node: A node is worker machine to run pods in Kubernetes.
+- TiDB cluster: TiDB cluster is database cluster which is constructed by TiDB
+  server, PD server and TiKV server. Each TiDB/PD/TiKV server is a cluster too
+  which consist of one or more members.
+- TiDB server: One of key components of TiDB cluster. It's access point of TiDB
+  cluster.
+- TiDB member: member of TiDB server which holds a unique network identifier.
+- TiDB pod: for each TiDB member, there is at most one running pod. It may
+  crash or be replaced by a new pod.
+- Load balancer: A component which proxies traffic for applications, e.g. LVS,
+  HAProxy, F5, etc.
+
+## Motivation
+
+There are use cases that users need to access TiDB server from outside the
+Kubernetes cluster. But in some environments (e.g. on bare-metal machines), we
+may lack load balancer solution or need to configure IP addresses of TiDB
+services in existing load balancer.
+
+In this scenario, we need to use `NodePort` service. By default, NodePort
+services are available cluster-wide on all nodes but cannot propagate client's
+IP address to the end pods and may cause a second hop. 
+
+For propagating client's IP address and better performance, we can
+set `externalTrafficPolicy` of service to `Local`. A side-effect is the service of
+TiDB will be accessible only on the nodes which a running TiDB pod. To avoid
+manual intervention to update IP addresses in load balancer when performing
+a rolling update, we prefer to schedule new pod of TiDB member to its previous node.
+
+## Goals
+
+- Able to schedule new pod of TiDB member to its previous node
+
+### Non-Goals
+
+- Stable scheduling for TiKV/PD pods
+- Guarantee that new pod of TiDB member will be scheduled to its node if
+  there is another scheduler in cluster which may schedule pods to its node
+- Guarantee that new pod of TiDB member will be scheduled to its node if some
+  scheduling requirements changed
+
+## Use Cases
+
+### No need to update IP addresses of TiDB in load balancer outside of the Kubernetes cluster
+
+When a TiDB cluster is running in a dedicated Kubernetes cluster or nodes of
+TiDB cluster are reserved for it. After a rolling update of TiDB cluster is
+done, new pods of TiDB member will be scheduled to their previous nodes. User
+does not need to update IP addresses in load balancer.
+
+## Proposal
+
+Currently, we have tidb-scheduler to schedule all pods of TiDB cluster. We can
+write a new predicate function for pods of TiDB server. In this new predicate
+function we can choose the previous node of TiDB member if it exists in candidate
+nodes.
+
+Note that it's not possible for tidb-scheduler to schedule the new pod of TiDB
+member back to its node if the node does not meet the new scheduling
+requirements (e.g. CPU/Memory, Taints).
+
+## Implementation
+
+At first, we track assigned node of TiDB member in status of TiDBCluster.
+
+```
+type TiDBMember struct {
+  ...
+	// Node hosting pod of this TiDB member.
+	NodeName string `json:"node,omitempty"`
+}
+```
+
+In new predicate `StableScheduling`, we filter out other nodes for TiDB pod if
+previous node for this TiDB member exists in candidate nodes.
+
+## Alternatives
+
+### Deploy TiDB members on all nodes
+
+If TiDB members are running on all nodes, clients can access TiDB server with
+any node IP address.
+
+We can achieve this by using DaemonSet or Deployment with pod anti-affinity.
+But if the Kubernetes cluster is large, to deploy TiDB members of each TiDB
+cluster on every node is very inefficient and will consume too many resources.
+
+### Leverage load balancer health check mechanism
+
+Load balancers often have health checks on its backends, it will remove invalid
+backend automatically. We can add all nodes into load balancer. Drawbacks with
+this solution are:
+
+- maybe too heavy for LB if the Kubernetes cluster is large
+  - `NumberOfTiDBClusters` x `NumberOfNodes` ports should be health checked
+- need to add every new node into the backend of LB
+- hard to monitor LB (not all failed backends must be fixed)
+
+Some of them can be alleviated by restricting TiDB members in a fixed set of
+nodes (by using NodeSelector/NodeAffinity/Taints&Tolerations). This requires
+the user to pre-select the nodes to run TiDB pods.
+
+### Use local PV to bind TiDB member to a node
+
+Local PVs are local resources of a node. If a TiDB member is using a local PV
+on a node, this node is the only available node for it.
+
+This solution is like using a fixed set of nodes to run TiDB pods and does not
+require the user to pre-select nodes. But every pod is bound to one node, its
+new pod will be pending forever if resources of the node are consumed by other
+nodes.
+
+Another drawback is it requires to bind a dummy PV for each TiDB pod which is
+complex to manage and will confuse the user.
+
+## Limitations
+
+### No guarantee if there is another scheduler to schedule pods to TiDB nodes
+
+This is because when old pod of TiDB member is terminated, resources of its
+node are released and other schedulers may schedule other pods onto this node.
+When `tidb-scheduler` schedules new pod of TiDB member, its previous node may
+not fit.
+
+#### Workarounds
+
+- Reserve nodes for TiDB members (e.g. by taints)
+
+### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements
+
+If we upgrade TiDB pods to request more resources, it is possible that its node node
+may not have enough resources for the new pod.
+
+It applies if some other scheduling requirements are changed, e.g.
+
+- NodeSelector
+- Tolerations