Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local SSD on GKE #577

Merged
merged 4 commits into from
Jun 14, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion deploy/gcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ First of all, make sure the following items are installed on your machine:

* [Google Cloud SDK](https://cloud.google.com/sdk/install)
* [terraform](https://www.terraform.io/downloads.html) >= 0.12
* [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.11
* [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.14
* [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.0 and < 3.0.0
* [jq](https://stedolan.github.io/jq/download/)

Expand Down Expand Up @@ -236,3 +236,8 @@ terraform destroy
You have to manually delete disks in the Google Cloud Console, or with `gcloud` after running `terraform destroy` if you do not need the data anymore.

> *Note*: When `terraform destroy` is running, an error with the following message might occur: `Error reading Container Cluster "my-cluster": Cluster "my-cluster" has status "RECONCILING" with message""`. This happens when GCP is upgrading the kubernetes master node, which it does automatically at times. While this is happening, it is not possible to delete the cluster. When it is done, run `terraform destroy` again.


## More information

Please view our [operation guide](../../docs/operation-guide.md).
10 changes: 5 additions & 5 deletions deploy/gcp/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,7 @@ resource "google_container_node_pool" "pd_pool" {

node_config {
machine_type = var.pd_instance_type
image_type = "UBUNTU"
local_ssd_count = 1
local_ssd_count = 0

taint {
effect = "NO_SCHEDULE"
Expand Down Expand Up @@ -150,6 +149,8 @@ resource "google_container_node_pool" "tikv_pool" {
node_config {
machine_type = var.tikv_instance_type
image_type = "UBUNTU"
// This value cannot be changed (instead a new node pool is needed)
// 1 SSD is 375 GiB
local_ssd_count = 1

taint {
Expand Down Expand Up @@ -316,9 +317,8 @@ resource "null_resource" "setup-env" {
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user $(gcloud config get-value account)
kubectl create serviceaccount --namespace kube-system tiller
kubectl apply -f manifests/crd.yaml
kubectl apply -f manifests/startup-script.yaml
kubectl apply -f manifests/local-volume-provisioner.yaml
kubectl apply -f manifests/gke-storage.yml
kubectl apply -k manifests/local-ssd
kubectl apply -f manifests/gke/persistent-disk.yaml
kubectl apply -f manifests/tiller-rbac.yaml
helm init --service-account tiller --upgrade --wait
until helm ls; do
Expand Down
1 change: 1 addition & 0 deletions deploy/gcp/manifests/gke
1 change: 0 additions & 1 deletion deploy/gcp/manifests/gke-storage.yml

This file was deleted.

8 changes: 8 additions & 0 deletions deploy/gcp/manifests/local-ssd/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

bases:
- ../../../../manifests/gke/local-ssd-provision

patches:
- overlays/terraform-local-ssd-provision.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: local-volume-provisioner
spec:
template:
spec:
tolerations:
- operator: Exists
effect: "NoSchedule"
- operator: Exists
effect: "NoSchedule"
135 changes: 0 additions & 135 deletions deploy/gcp/manifests/local-volume-provisioner.yaml

This file was deleted.

55 changes: 0 additions & 55 deletions deploy/gcp/manifests/startup-script.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion deploy/gcp/templates/tidb-cluster-values.yaml.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ pd:
# different classes might map to quality-of-service levels, or to backup policies,
# or to arbitrary policies determined by the cluster administrators.
# refer to https://kubernetes.io/docs/concepts/storage/storage-classes
storageClassName: local-storage
storageClassName: pd-ssd

# Image pull policy.
imagePullPolicy: IfNotPresent
Expand Down
8 changes: 7 additions & 1 deletion docs/google-kubernetes-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ When you see `Running`, it's time to hit <kbd>Ctrl</kbd>+<kbd>C</kbd> and procee
The first TiDB component we are going to install is the TiDB Operator, using a Helm Chart. TiDB Operator is the management system that works with Kubernetes to bootstrap your TiDB cluster and keep it running. This step assumes you are in the `tidb-operator` working directory:

kubectl apply -f ./manifests/crd.yaml &&
kubectl apply -f ./manifests/gke-storage.yml &&
kubectl apply -f ./manifests/gke/persistent-disk.yml &&
helm install ./charts/tidb-operator -n tidb-admin --namespace=tidb-admin

We can watch the operator come up with:
Expand Down Expand Up @@ -177,3 +177,9 @@ The above commands only delete the running pods, the data is persistent. If you
Once you have finished experimenting, you can delete the Kubernetes cluster with:

gcloud container clusters delete tidb


## More information

For production deployments, view our [operation guide](./operation-guide.md), and look at the GKE section.
We also have a simple [terraform based deployment](../deploy/gcp/README.md).
17 changes: 17 additions & 0 deletions docs/operation-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,23 @@ TiDB Operator uses `values.yaml` as TiDB cluster configuration file. It provides

For other settings, the variables in `values.yaml` are self-explanatory with comments. You can modify them according to your need before installing the charts.


## GKE

On GKE, local SSD volumes by default are limited to 375 GiB size and perform worse than persistent disk.

For proper performance, you must:

* install the Linux guest environment, which can only be done on the Ubuntu image, not the COS image
* make sure SSD is mounted with the `nobarrier` option.

We have a [daemonset which does the above performance fixes](../manifests/gke/local-ssd-optimize.yaml).
We also have a [daemonset that fixes performance and combines all SSD disks together with lvm](../manifests/gke/local-ssd-provision.yaml).
The terraform deployment will automatically install that.

> **Note**: This setup that combines local SSD assumes you are running only one process that needs local SSD per VM.


## Deploy TiDB cluster

After TiDB Operator and Helm are deployed correctly and configuration completed, TiDB cluster can be deployed using following command:
Expand Down
57 changes: 57 additions & 0 deletions manifests/gke/local-ssd-optimize.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: local-ssd-startup
namespace: kube-system
labels:
app: local-ssd-startup
spec:
template:
metadata:
labels:
app: local-ssd-startup
spec:
hostPID: true
nodeSelector:
cloud.google.com/gke-os-distribution: ubuntu
cloud.google.com/gke-local-ssd: "true"
containers:
- name: local-ssd-startup
image: gcr.io/google-containers/startup-script:v1
securityContext:
privileged: true
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 100m
memory: 100Mi
env:
- name: STARTUP_SCRIPT
value: |
#!/usr/bin/env bash
set -euo pipefail
apt-get update
apt-get install -y software-properties-common
apt-add-repository universe
apt-get update
declare -a PKG_LIST=(python-google-compute-engine \
python3-google-compute-engine \
google-compute-engine-oslogin \
gce-compute-image-packages)
for pkg in ${PKG_LIST[@]}; do
apt-get install -y $pkg || echo "Not available: $pkg"
done
mount | grep -v nobarrier | awk '/ssd/{print $1}' | xargs -i mount {} -o remount,nobarrier
volumeMounts:
- mountPath: /mnt/disks
name: local-ssd
mountPropagation: Bidirectional
tolerations:
- effect: NoSchedule
operator: Exists
volumes:
- name: local-ssd
hostPath:
path: /mnt/disks
5 changes: 5 additions & 0 deletions manifests/gke/local-ssd-provision/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- local-ssd-provision.yaml
Loading