Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add terraform script to auto deploy TiDB cluster on AWS #401

Merged
merged 7 commits into from
May 2, 2019
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions deploy/aws/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.terraform/
credentials/
terraform.tfstate
terraform.tfstate.backup
.terraform.tfstate.lock.info
84 changes: 84 additions & 0 deletions deploy/aws/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Deploy TiDB Operator and TiDB cluster on AWS EKS

## Requirements:
* [awscli](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html) >= 1.16.73
* [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/#install-kubectl) >= 1.11
* [helm](https://github.com/helm/helm/blob/master/docs/install.md#installing-the-helm-client) >= 2.9.0
* [jq](https://stedolan.github.io/jq/download/)
* [aws-iam-authenticator](https://github.com/kubernetes-sigs/aws-iam-authenticator#4-set-up-kubectl-to-use-authentication-tokens-provided-by-aws-iam-authenticator-for-kubernetes)

## Configure awscli

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

## Setup

The default setup will create a new VPC and a t2.micro instance as bastion machine. And EKS cluster with the following ec2 instance worker nodes:

* 3 m5d.xlarge instances for PD
* 3 i3.2xlarge instances for TiKV
* 2 c4.4xlarge instances for TiDB
* 1 c5.xlarge instance for monitor


``` shell
$ git clone https://github.com/pingcap/tidb-operator
$ cd tidb-operator/cloud/aws
$ terraform init
$ terraform apply
```

After `terraform apply` is executed successfully, you can access the `monitor_endpoint` using your web browser.

To access TiDB cluster, use the following command to first ssh into the bastion machine, and then connect it via MySQL client:

``` shell
ssh -i credentials/k8s-prod-my-cluster.pem ec2-user@<bastion_ip>
mysql -h <tidb_dns> -P <tidb_port> -u root
```

If the DNS name is not resolvable, be patient and wait a few minutes.

You can interact with the EKS cluster using `kubectl` and `helm` with the kubeconfig file `credentials/kubeconfig_<cluster_name>`. The default `cluster_name` is `my-cluster`, you can change it in the variables.tf.

``` shell
# By specifying --kubeconfig argument
kubectl --kubeconfig credentials/kubeconfig_<cluster_name> get po -n tidb
helm --kubeconfig credentials/kubeconfig_<cluster_name> ls

# Or setting KUBECONFIG environment variable
export KUBECONFIG=$PWD/credentials/kubeconfig_<cluster_name>
kubectl get po -n tidb
helm ls
```

> **NOTE:** You have to manually delete the EBS volumes after running `terraform destroy` if you don't need the data on the volumes any more.

## Upgrade TiDB cluster

To upgrade TiDB cluster, modify `tidb_version` variable to a higher version in variables.tf and run `terraform apply`.

> *Note*: The upgrading doesn't finish immediately. You can watch the upgrading process by `watch kubectl --kubeconfig credentials/kubeconfig_<cluster_name> get po -n tidb`

## Scale TiDB cluster

To scale TiDB cluster, modify `tikv_count` or `tidb_count` to your desired count, and then run `terraform apply`.

> *Note*: Currently, scaling in is not supported since we cannot determine which node to scale. Scaling out needs a few minutes to complete, you can watch the scaling out by `watch kubectl --kubeconfig credentials/kubeconfig_<cluster_name> get po -n tidb`

## Customize

By default, the terraform script will create a new VPC. You can use an existing VPC by setting `create_vpc` to `false` and specify your existing VPC id and subnet ids to `vpc_id` and `subnets` variables.

An ec2 instance is also created by default as bastion machine to connect to the created TiDB cluster, because the TiDB service is exposed as an [Internal Elastic Load Balancer](https://aws.amazon.com/blogs/aws/internal-elastic-load-balancers/). The ec2 instance has MySQL and Sysbench pre-installed, so you can SSH into the ec2 instance and connect to TiDB using the ELB endpoint. You can disable the bastion instance creation by setting `create_bastion` to `false` if you already have an ec2 instance in the VPC.

The TiDB version and component count are also configurable in variables.tf, you can customize these variables to suit your need.

Currently, the instance type of TiDB cluster component is not configurable because PD and TiKV relies on [NVMe SSD instance store](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html), different instance types have different disks.

## TODO

- [ ] Use [cluster autoscaler](https://github.com/kubernetes/autoscaler)
- [ ] Allow create a minimal TiDB cluster for testing
- [ ] Make the resource creation synchronously to follow Terraform convention
- [ ] Make more parameters customizable
6 changes: 6 additions & 0 deletions deploy/aws/bastion-userdata
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#cloud-config
packages:
- mysql
runcmd:
- curl -s https://packagecloud.io/install/repositories/akopytov/sysbench/script.rpm.sh | bash
- yum -y install sysbench
1 change: 1 addition & 0 deletions deploy/aws/charts/tidb-cluster
1 change: 1 addition & 0 deletions deploy/aws/charts/tidb-operator
52 changes: 52 additions & 0 deletions deploy/aws/data.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
data "aws_availability_zones" "available" {}

data "aws_ami" "amazon-linux-2" {
most_recent = true

owners = ["amazon"]

filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}

data "template_file" "tidb_cluster_values" {
template = "${file("${path.module}/templates/tidb-cluster-values.yaml.tpl")}"
vars {
cluster_version = "${var.tidb_version}"
pd_replicas = "${var.pd_count}"
tikv_replicas = "${var.tikv_count}"
tidb_replicas = "${var.tidb_count}"
}
}

# kubernetes provider can't use computed config_path right now, see issue:
# https://github.com/terraform-providers/terraform-provider-kubernetes/issues/142
# so we don't use kubernetes provider to retrieve tidb and monitor connection info,
# instead we use external data source.
# data "kubernetes_service" "tidb" {
# depends_on = ["helm_release.tidb-cluster"]
# metadata {
# name = "tidb-cluster-tidb"
# namespace = "tidb"
# }
# }

# data "kubernetes_service" "monitor" {
# depends_on = ["helm_release.tidb-cluster"]
# metadata {
# name = "tidb-cluster-grafana"
# namespace = "tidb"
# }
# }

data "external" "tidb_service" {
depends_on = ["null_resource.wait-tidb-ready"]
program = ["bash", "-c", "kubectl --kubeconfig credentials/kubeconfig_${var.cluster_name} get svc -n tidb tidb-cluster-tidb -ojson | jq '.status.loadBalancer.ingress[0]'"]
}

data "external" "monitor_service" {
depends_on = ["null_resource.wait-tidb-ready"]
program = ["bash", "-c", "kubectl --kubeconfig credentials/kubeconfig_${var.cluster_name} get svc -n tidb tidb-cluster-grafana -ojson | jq '.status.loadBalancer.ingress[0]'"]
}
242 changes: 242 additions & 0 deletions deploy/aws/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
provider "aws" {
region = "${var.region}"
}

module "key-pair" {
source = "cloudposse/key-pair/aws"
version = "0.3.2"

name = "${var.cluster_name}"
namespace = "k8s"
stage = "prod"
ssh_public_key_path = "${path.module}/credentials/"
generate_ssh_key = "true"
private_key_extension = ".pem"
chmod_command = "chmod 600 %v"
}

resource "aws_security_group" "ssh" {
name = "${var.cluster_name}"
description = "Allow SSH access for bastion instance"
vpc_id = "${var.create_vpc ? module.vpc.vpc_id : var.vpc_id}"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = "${var.ingress_cidr}"
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}

module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "1.60.0"
name = "${var.cluster_name}"
cidr = "${var.vpc_cidr}"
create_vpc = "${var.create_vpc}"
azs = ["${data.aws_availability_zones.available.names[0]}", "${data.aws_availability_zones.available.names[1]}", "${data.aws_availability_zones.available.names[2]}"]
private_subnets = "${var.private_subnets}"
public_subnets = "${var.public_subnets}"
enable_nat_gateway = true
single_nat_gateway = true

# The following tags are required for ELB
private_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
public_subnet_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
vpc_tags = {
"kubernetes.io/cluster/${var.cluster_name}" = "shared"
}
}

module "ec2" {
source = "terraform-aws-modules/ec2-instance/aws"
version = "1.21.0"
name = "${var.cluster_name}-bastion"
instance_count = "${var.create_bastion ? 1:0}"
ami = "${data.aws_ami.amazon-linux-2.id}"
instance_type = "${var.bastion_instance_type}"
key_name = "${module.key-pair.key_name}"
associate_public_ip_address = true
monitoring = false
user_data = "${file("bastion-userdata")}"
vpc_security_group_ids = ["${aws_security_group.ssh.id}"]
subnet_ids = "${split(",", var.create_vpc ? join(",", module.vpc.public_subnets) : join(",", var.subnets))}"

tags = {
app = "tidb"
}
}

module "eks" {
# source = "terraform-aws-modules/eks/aws"
# version = "2.3.1"
# We can not use cluster autoscaler for pod with local PV due to the limitations listed here:
# https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#i-have-a-couple-of-pending-pods-but-there-was-no-scale-up
# so we scale out by updating auto-scaling-group desired_capacity directly via the patched version of aws eks module
source = "github.com/tennix/terraform-aws-eks?ref=v2.3.1-patch"
cluster_name = "${var.cluster_name}"
cluster_version = "${var.k8s_version}"
config_output_path = "credentials/"
subnets = "${split(",", var.create_vpc ? join(",", module.vpc.private_subnets) : join(",", var.subnets))}"
vpc_id = "${var.create_vpc ? module.vpc.vpc_id : var.vpc_id}"

# instance types: https://aws.amazon.com/ec2/instance-types/
# instance prices: https://aws.amazon.com/ec2/pricing/on-demand/

worker_groups = [
{
# pd
name = "pd_worker_group"
key_name = "${module.key-pair.key_name}"
# WARNING: if you change instance type, you must also modify the corresponding disk mounting in pd-userdata.sh script
# instance_type = "c5d.xlarge" # 4c, 8G, 100G NVMe SSD
instance_type = "m5d.xlarge" # 4c, 16G, 150G NVMe SSD
root_volume_size = "50" # rest NVMe disk for PD data
public_ip = false
kubelet_extra_args = "--register-with-taints=dedicated=pd:NoSchedule --node-labels=dedicated=pd"
asg_desired_capacity = "${var.pd_count}"
asg_max_size = "${var.pd_count + 2}"
additional_userdata = "${file("pd-userdata.sh")}"
},
{ # tikv
name = "tikv_worker_group"
key_name = "${module.key-pair.key_name}"
# WARNING: if you change instance type, you must also modify the corresponding disk mounting in tikv-userdata.sh script
instance_type = "i3.2xlarge" # 8c, 61G, 1.9T NVMe SSD
root_volume_type = "gp2"
root_volume_size = "100"
public_ip = false
kubelet_extra_args = "--register-with-taints=dedicated=tikv:NoSchedule --node-labels=dedicated=tikv"
asg_desired_capacity = "${var.tikv_count}"
asg_max_size = "${var.tikv_count + 2}"
additional_userdata = "${file("tikv-userdata.sh")}"
},
{ # tidb
name = "tidb_worker_group"
key_name = "${module.key-pair.key_name}"
instance_type = "c4.4xlarge" # 16c, 30G
root_volume_type = "gp2"
root_volume_size = "100"
public_ip = false
kubelet_extra_args = "--register-with-taints=dedicated=tidb:NoSchedule --node-labels=dedicated=tidb"
asg_desired_capacity = "${var.tidb_count}"
asg_max_size = "${var.tidb_count + 2}"
},
{ # monitor
name = "monitor_worker_group"
key_name = "${module.key-pair.key_name}"
instance_type = "c5.xlarge" # 4c, 8G
root_volume_type = "gp2"
root_volume_size = "100"
public_ip = false
asg_desired_capacity = 1
asg_max_size = 3
}
]

worker_group_count = "4"

tags = {
app = "tidb"
}
}

# kubernetes and helm providers rely on EKS, but terraform provider doesn't support depends_on
# follow this link https://github.com/hashicorp/terraform/issues/2430#issuecomment-370685911
# we have the following hack
resource "local_file" "kubeconfig" {
# HACK: depends_on for the helm and kubernetes provider
# Passing provider configuration value via a local_file
depends_on = ["module.eks"]
sensitive_content = "${module.eks.kubeconfig}"
filename = "${path.module}/credentials/kubeconfig_${var.cluster_name}"
}

# kubernetes provider can't use computed config_path right now, see issue:
# https://github.com/terraform-providers/terraform-provider-kubernetes/issues/142
# so we don't use kubernetes provider to retrieve tidb and monitor connection info,
# instead we use external data source.
# provider "kubernetes" {
# config_path = "${local_file.kubeconfig.filename}"
# }

provider "helm" {
insecure = true
# service_account = "tiller"
# install_tiller = true # currently this doesn't work, so we install tiller in the local-exec provisioner. See https://github.com/terraform-providers/terraform-provider-helm/issues/148
kubernetes {
config_path = "${local_file.kubeconfig.filename}"
}
}

resource "null_resource" "setup-env" {
depends_on = ["module.eks"]

provisioner "local-exec" {
working_dir = "${path.module}"
command = <<EOS
kubectl apply -f manifests/crd.yaml
kubectl apply -f manifests/local-volume-provisioner.yaml
kubectl apply -f manifests/gp2-storageclass.yaml
kubectl apply -f manifests/tiller-rbac.yaml
helm init --service-account tiller --upgrade --wait
until helm ls; do
echo "Wait tiller ready"
done
helm version
EOS
environment = {
KUBECONFIG = "${local_file.kubeconfig.filename}"
}
}
}

resource "helm_release" "tidb-operator" {
depends_on = ["null_resource.setup-env"]
name = "tidb-operator"
namespace = "tidb-admin"
chart = "${path.module}/charts/tidb-operator"
}

resource "helm_release" "tidb-cluster" {
depends_on = ["helm_release.tidb-operator"]
name = "tidb-cluster"
namespace = "tidb"
chart = "${path.module}/charts/tidb-cluster"
values = [
"${data.template_file.tidb_cluster_values.rendered}"
]
}

resource "null_resource" "wait-tidb-ready" {
depends_on = ["helm_release.tidb-cluster"]

provisioner "local-exec" {
command = <<EOS
until kubectl get po -n tidb -lapp.kubernetes.io/component=tidb | grep Running; do
echo "Wait TiDB pod running"
sleep 5
done
until kubectl get svc -n tidb tidb-cluster-tidb | grep elb; do
echo "Wait TiDB service ready"
sleep 5
done
until kubectl get svc -n tidb tidb-cluster-grafana | grep elb; do
echo "Wait monitor service ready"
sleep 5
done
EOS
environment = {
KUBECONFIG = "${local_file.kubeconfig.filename}"
}
}
}
Loading