diff --git a/docs/how-to-guides/upgrade-etcd.md b/docs/how-to-guides/upgrade-etcd.md new file mode 100644 index 000000000..68b04abc4 --- /dev/null +++ b/docs/how-to-guides/upgrade-etcd.md @@ -0,0 +1,110 @@ +# Upgrading etcd + +## Contents + +- [Introduction](#introduction) +- [Steps](#steps) + - [Step 1: Find out the IP and SSH](#step-1-find-out-the-ip-and-ssh) + - [Step 2: Create necessary directories with correct permissions](#step-2-create-necessary-directories-with-correct-permissions) + - [Step 3: Upgrade etcd](#step-3-upgrade-etcd) + - [Step 4: Verify upgrade](#step-4-verify-upgrade) + - [Step 5: Verify using `etcdctl`](#step-5-verify-using-etcdctl) + +## Introduction + +[Etcd](https://etcd.io/) is the most crucial component of a Kubernetes cluster. It stores the cluster state. + +This document will provide step by step guide on upgrading etcd in Lokomotive. + +## Steps + +Repeat the following steps on all the controller node one node at a time. + +### Step 1: Find out the IP and SSH + +Find the IP of the controller node by visiting the cloud provider dashboard and ssh into it. + +```bash +ssh core@ +``` + +### Step 2: Create necessary directories with correct permissions + +Latest etcd (`v3.4.10`) necessitates the data directory permissions to be `0700`, accordingly change the permissions. Verify the permissions are changed to `rwx------`. + +```bash +sudo chmod 0700 /var/lib/etcd/ +sudo ls -ld /var/lib/etcd/ +``` + +If the node reboots, we need the right settings in place so that `systemd-tmpfile` service does not alter the permissions of the data directory. To make the changes made above persistent run the following command: + +```bash +echo "d /var/lib/etcd 0700 etcd etcd - -" | sudo tee /etc/tmpfiles.d/etcd-wrapper.conf +``` + +### Step 3: Upgrade etcd + +Run the following commands: + +> **NOTE**: Before proceeding to other commands, set the `etcd_version` variable to the latest etcd version. + +```bash +export etcd_version= + +sudo sed -i "s,ETCD_IMAGE_TAG=.*,ETCD_IMAGE_TAG=${etcd_version}," \ + /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf +sudo systemctl daemon-reload +sudo systemctl restart etcd-member +``` + +### Step 4: Verify upgrade + +Verify that the etcd service is in `active (running)` state: + +```bash +sudo systemctl status --no-pager etcd-member +``` + +Run the following command to see logs of the process since the last restart: + +```bash +sudo journalctl _SYSTEMD_INVOCATION_ID=$(sudo systemctl \ + show -p InvocationID --value etcd-member.service) +``` + +Once you see the following log line, you can discern that the etcd daemon has come up without errors: + +```log +etcdserver: starting server... [version: 3.4.10, cluster version: to_be_decided] +``` + +Once you see the following log line, you can discern that the etcd has rejoined the cluster without issues: + +```log +embed: serving client requests on 10.88.81.1:2379 +``` + +### Step 5: Verify using `etcdctl` + +We can use `etcdctl` client to verify the state of etcd cluster. + +> **NOTE**: Before proceeding to other commands, set the `no_of_controller_nodes` variable to the number of controller nodes in the cluster. + +```bash +export no_of_controller_nodes= + +# Find the endpoint of this node's etcd: +export endpoint=$(grep ETCD_ADVERTISE_CLIENT_URLS /etc/systemd/system/etcd-member.service.d/40-etcd-cluster.conf | cut -d"=" -f3 | tr -d '"') + +export flags="--cacert=/etc/ssl/etcd/etcd-client-ca.crt \ + --cert=/etc/ssl/etcd/etcd-client.crt \ + --key=/etc/ssl/etcd/etcd-client.key \ + --endpoints=${endpoint}" + +# Verify: +sudo ETCDCTL_API=3 etcdctl member list $flags +sudo ETCDCTL_API=3 etcdctl endpoint health $flags +``` + +The last command should report each node as healthy.