Skip to content

Commit

Permalink
Design proposal of stable scheduling in TiDB
Browse files Browse the repository at this point in the history
  • Loading branch information
cofyc committed May 10, 2019
1 parent 162cf4c commit 33cc256
Showing 1 changed file with 164 additions and 0 deletions.
164 changes: 164 additions & 0 deletions docs/design-proposals/tidb-stable-scheduling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# TiDB Stable Scheduling

This document presents a design to schedule new pod of TiDB member to its
previous node in certain circumstances.

## Table of Contents

- [Glossy](#glossy)
- [Motivation](#motivation)
- [Goals](#goals)
* [Non-Goals](#non-goals)
- [Use Cases](#use-cases)
* [No need to update IP addresses of TiDB in load balancer outside of the Kubernetes cluster](#no-need-to-update-ip-addresses-of-tidb-in-load-balancer-outside-of-the-kubernetes-cluster)
- [Proposal](#proposal)
- [Implementation](#implementation)
- [Alternatives](#alternatives)
* [Deploy TiDB members on all nodes](#deploy-tidb-members-on-all-nodes)
* [Leverage load balancer health check mechanism](#leverage-load-balancer-health-check-mechanism)
* [Use local PV to bind TiDB member to a node](#use-local-pv-to-bind-tidb-member-to-a-node)
- [Limitations](#limitations)
* [No guarantee if there is another scheduler to schedule pods to TiDB nodes](#no-guarantee-if-there-is-another-scheduler-to-schedule-pods-to-tidb-nodes)
+ [Workarounds](#workarounds)
* [Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements](#cannot-schedule-new-pod-of-tidb-member-back-to-its-node-if-the-node-does-not-meet-new-requirements)

## Glossy

- Pod: A deployable object in Kubernetes.
- Node: A node is worker machine to run pods in Kubernetes.
- TiDB cluster: TiDB cluster is database cluster which is constructed by TiDB
server, PD server and TiKV server. Each TiDB/PD/TiKV server is a cluster too
which consist of one or more members.
- TiDB server: One of key components of TiDB cluster. It's access point of TiDB
cluster.
- TiDB member: member of TiDB server which holds a unique network identifier.
- TiDB pod: for each TiDB member, there is at most one running pod. It may
crash or be replaced by a new pod.
- Load balancer: A component which proxies traffic for applications, e.g. LVS,
HAProxy, F5, etc.

## Motivation

There are use cases that users need to access TiDB server from outside the
Kubernetes cluster. But in some environments (e.g. on bare-metal machines), we
may lack load balancer solution or need to configure IP addresses of TiDB
services in existing load balancer.

In this scenario, we need to use `NodePort` service. By default, NodePort
services are available cluster-wide on all nodes but cannot propagate client's
IP address to the end pods and may cause a second hop.

For propagating client's IP address and better performance, we can
set `externalTrafficPolicy` of service to `Local`. A side-effect is the service of
TiDB will be accessible only on the nodes which a running TiDB pod. To avoid
manual intervention to update IP addresses in load balancer when performing
a rolling update, we prefer to schedule new pod of TiDB member to its previous node.

## Goals

- Able to schedule new pod of TiDB member to its previous node

### Non-Goals

- Stable scheduling for TiKV/PD pods
- Guarantee that new pod of TiDB member will be scheduled to its node if
there is another scheduler in cluster which may schedule pods to its node
- Guarantee that new pod of TiDB member will be scheduled to its node if some
scheduling requirements changed

## Use Cases

### No need to update IP addresses of TiDB in load balancer outside of the Kubernetes cluster

When a TiDB cluster is running in a dedicated Kubernetes cluster or nodes of
TiDB cluster are reserved for it. After a rolling update of TiDB cluster is
done, new pods of TiDB member will be scheduled to their previous nodes. User
does not need to update IP addresses in load balancer.

## Proposal

Currently, we have tidb-scheduler to schedule all pods of TiDB cluster. We can
write a new predicate function for pods of TiDB server. In this new predicate
function we can choose the previous node of TiDB member if it exists in candidate
nodes.

Note that it's not possible for tidb-scheduler to schedule the new pod of TiDB
member back to its node if the node does not meet the new scheduling
requirements (e.g. CPU/Memory, Taints).

## Implementation

At first, we track assigned node of TiDB member in status of TiDBCluster.

```
type TiDBMember struct {
...
// Node hosting pod of this TiDB member.
NodeName string `json:"node,omitempty"`
}
```

In new predicate `StableScheduling`, we filter out other nodes for TiDB pod if
previous node for this TiDB member exists in candidate nodes.

## Alternatives

### Deploy TiDB members on all nodes

If TiDB members are running on all nodes, clients can access TiDB server with
any node IP address.

We can achieve this by using DaemonSet or Deployment with pod anti-affinity.
But if the Kubernetes cluster is large, to deploy TiDB members of each TiDB
cluster on every node is very inefficient and will consume too many resources.

### Leverage load balancer health check mechanism

Load balancers often have health checks on its backends, it will remove invalid
backend automatically. We can add all nodes into load balancer. Drawbacks with
this solution are:

- maybe too heavy for LB if the Kubernetes cluster is large
- `NumberOfTiDBClusters` x `NumberOfNodes` ports should be health checked
- need to add every new node into the backend of LB
- hard to monitor LB (not all failed backends must be fixed)

Some of them can be alleviated by restricting TiDB members in a fixed set of
nodes (by using NodeSelector/NodeAffinity/Taints&Tolerations). This requires
the user to pre-select the nodes to run TiDB pods.

### Use local PV to bind TiDB member to a node

Local PVs are local resources of a node. If a TiDB member is using a local PV
on a node, this node is the only available node for it.

This solution is like using a fixed set of nodes to run TiDB pods and does not
require the user to pre-select nodes. But every pod is bound to one node, its
new pod will be pending forever if resources of the node are consumed by other
nodes.

Another drawback is it requires to bind a dummy PV for each TiDB pod which is
complex to manage and will confuse the user.

## Limitations

### No guarantee if there is another scheduler to schedule pods to TiDB nodes

This is because when old pod of TiDB member is terminated, resources of its
node are released and other schedulers may schedule other pods onto this node.
When `tidb-scheduler` schedules new pod of TiDB member, its previous node may
not fit.

#### Workarounds

- Reserve nodes for TiDB members (e.g. by taints)

### Cannot schedule new pod of TiDB member back to its node if the node does not meet new requirements

If we upgrade TiDB pods to request more resources, it is possible that its node node
may not have enough resources for the new pod.

It applies if some other scheduling requirements are changed, e.g.

- NodeSelector
- Tolerations

0 comments on commit 33cc256

Please sign in to comment.