Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPO-143: Adds playbook for restoring RDS from a snapshot #5

Merged
merged 4 commits into from
Apr 10, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 51 additions & 114 deletions source/manual/how-to-backup-and-restore-in-aws-rds.html.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
owner_slack: "#govuk-2ndline-tech"
owner_slack: "#trade-tariff-infrastructure"
title: Backup and restore databases in AWS RDS
section: Backups
layout: manual_layout
Expand All @@ -8,15 +8,10 @@ parent: "/manual.html"

This playbook describes how to restore a database instance using Amazon's [RDS Backups](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html) feature.

We use RDS Backups to give us fully nightly backups and point-in-time recovery (PITR) (also known as continuous data protection or CDP).

> This playbook does not cover restoring from [govuk_env_sync backups](govuk-env-sync.html).
We use RDS Backups to give us full nightly backups and point-in-time recovery (PITR) (also known as continuous data protection or CDP).

<!-- Force markdown to separate these quotes -->

> We only run RDS Backup in the production environment. To run a test restore in staging or integration, you must first take a manual snapshot from the AWS console.
>
> Make sure the snapshot's name contains the name of the app (e.g. `local-links-manager`), and remember to delete it afterwards.
> We also generate automated backups using a [backups lambda][lambda-backups]. These are stored in an s3 bucket per environment but do not include a snapshot history.

## Restore an RDS instance via the AWS CLI

Expand All @@ -29,159 +24,101 @@ Before you get started you need to know:

For more information, read the [AWS documentation on Restoring from a DB Snapshot](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_RestoreFromSnapshot.html).

### 1. Retrieve a list of all snapshot ARNs for your application
### 1. Retrieve the relevant database information

In this example we are using `local-links-manager`:
In this example, we're using `describe-db-instances` to identify the instances we want a snapshot for.

```sh
environment=production
gds-cli aws govuk-${environment?}-admin \
aws rds describe-db-snapshots \
--query 'DBSnapshots[].DBSnapshotArn' \
| grep local-links-manager
# Find the database you want to find snapshots for (easily identified by its name)
aws rds describe-db-instances | jq '.DBInstances | .[] | {DBInstanceIdentifier, DBName}'
DATABASE_ID="<replace_with_previous_output>"
```

Choose the ARN of the most recent backup, and store it in an environment variable:
Find and export the relevant VPC and Security Group configuration for your RDS restore

```sh
snapshot_arn="<e.g. arn:aws:rds:eu-west-1:210287912431:snapshot:rds:local-links-manager-postgres-2022-07-05-01-09>"
aws rds describe-db-instances \
--db-instance-identifier terraform-20230623123439228000000001 \
--query 'DBInstances[].[VpcSecurityGroups[].VpcSecurityGroupId,DBParameterGroups[].DBParameterGroupName,DBSubnetGroup.DBSubnetGroupName]'
```

### 2. Find which database the snapshot was generated by

For example:
Example of the output:

```sh
gds-cli aws govuk-${environment?}-admin \
aws rds describe-db-snapshots \
--db-snapshot-identifier ${snapshot_arn} \
--query 'DBSnapshots[].DBInstanceIdentifier'
```
* vpc-security-group-id = sg-XXXXXXXX
* db-parameter-group-name = local-links-manager-postgres-XXXXXXXXXX
* db-subnet-group-name = blue-govuk-rds-subnet

Store the `DBInstanceIdentifier` as a variable:
Now export the result:

```sh
db_instance_identifier="<e.g. local-links-manager-postgres>"
DB_SUBNET_GROUP_NAME="<replace_with_previous_output>"
VPC_SECURITY_GROUP_ID="<replace_with_previous_output>" # A comma-separated list of sg ids
DB_PARAMETER_GROUP_NAME="<replace_with_previous_output>"
```

### 3. Ensure the restored database has the same security groups

The restored database must have the same security groups and be in the same VPC (that's the "subnet group name" parameter) as the original one, otherwise, apps won't be able to connect to it. Therefore the database needs to be restored in the same VPC and with the same security groups as the original instance the snapshot came from.

After running the command below, you now have all the parameters you need (snapshot-arn, db-instance-identifier, security-group-id, db-parameter-group-name, and db-subnet-group-name) to restore the database and change the restored database's security groups to match the original's.
### 2. Retrieve a list of all snapshot ARNs for your database name

```sh
gds-cli aws govuk-${environment?}-admin \
aws rds describe-db-instances \
--db-instance-identifier ${db_instance_identifier?} \
--query 'DBInstances[].[VpcSecurityGroups[].VpcSecurityGroupId,DBParameterGroups[].DBParameterGroupName,DBSubnetGroup.DBSubnetGroupName]'
```

Example of the output:

* vpc-security-group-id = sg-XXXXXXXX
* db-parameter-group-name = local-links-manager-postgres-XXXXXXXXXX
* db-subnet-group-name = blue-govuk-rds-subnet

Store the output as environment variables:
aws rds describe-db-snapshots | jq '.DBSnapshots | .[] | select(.DBInstanceIdentifier = "$DATABASE_ID") | {DBInstanceIdentifier, DBSnapshotIdentifier}'

```sh
vpc_security_group_id="<replace_with_previous_output>"
db_parameter_group_name="<replace_with_previous_output>"
db_subnet_group_name="<replace_with_previous_output>"
# Decide which snapshot you want and set its identifier below
SNAPSHOT_IDENTIFIER="<replace_with_previous_output>"
```

### 4. Restore the database instance from a snapshot
### 3. Restore the database instance from a snapshot

> The restored database must have the same security groups and be in the same VPC (that's the "subnet group name" parameter) as the original one, otherwise, apps won't be able to connect to it. Therefore the database needs to be restored in the same VPC and with the same security groups as the original instance the snapshot came from.

Using the stored variables from the previous steps:

```sh
gds-cli aws govuk-${environment?}-admin \
aws rds restore-db-instance-from-db-snapshot \
--db-subnet-group-name ${db_subnet_group_name?} \
--db-instance-identifier restored-${db_instance_identifier?} \
--db-snapshot-identifier ${snapshot_arn?}
aws rds restore-db-instance-from-db-snapshot \
--db-subnet-group-name $DB_SUBNET_GROUP_NAME \
--db-instance-identifier restored-$DATABASE_ID \
--db-snapshot-identifier $SNAPSHOT_ARN \
--vpc-security-group-ids $VPC_SECURITY_GROUP_ID
```

To see the newly created database instance, log into AWS Console > RDS > Databases > filter for your database name. You should see the original and newly created one.

### 5. Test the database has been fully restored
### 4. Test the database has been fully restored

Before moving on to the next step we need to ensure that the database has been fully restored and is ready to be used:

```sh
gds-cli aws govuk-${environment?}-admin \
aws rds wait db-instance-available \
--db-instance-identifier restored-${db_instance_identifier?}
aws rds wait db-instance-available --db-instance-identifier restored-${DATABASE_ID}
```

This command will wait until the database is ready, and then exit without any output.

### 6. Get the new database's hostname
### 5. Get the new database's hostname

Once the database is ready, fetch its hostname:
Make a note of the new endpoint address:

```sh
gds-cli aws govuk-${environment?}-admin \
aws rds describe-db-instances \
--db-instance-identifier "restored-${db_instance_identifier?}" \
--query 'DBInstances[].Endpoint.Address'
aws rds describe-db-instances \
--db-instance-identifier "restored-${DATABASE_ID}" \
--query 'DBInstances[].Endpoint.Address'
```

Make a note of this.

### 7. Connect to the restored backup database
### 6. Update the existing secrets manager secret value

This requires updating the `govuk/local-links-manager/postgres` secret in AWS Secrets Manager.
This requires updating the existing secrets manager secret for the database you've just restored

1. Log in to AWS in the correct environment: `gds aws govuk-${environment?}-admin -l`
1. In AWS Secrets Manager, search for and click on `govuk/local-links-manager/postgres`.
1. Log in to AWS in the correct environment: `development, staging or production`
1. In AWS Secrets Manager, search for and click on the relevant secret
1. Under the "Overview" tab, in the "Secret Value" section, select "Retrieve Secret Value".
1. Make a note of the existing value, in case you need to revert the changes (for example, if performing a drill).
1. Click "Edit", and replace the value of the "host" and "dbInstanceIdentifier" fields with the URL and identifier of the new database instance. Click "Save".

> Some of our apps currently refer to their database directly (e.g. `app-name.hex-string.eu-west-1.rds.amazonaws.com`), some of them refer to their database indirectly via a `CNAME` record (e.g. `app-name.blue.staging.govuk-internal.digital`). In either case, you can replace this with the URL of the new database instance.

1. Log into Argo CD in the correct environment ([integration](https://argo.eks.integration.govuk.digital/),
[staging](https://argo.eks.staging.govuk.digital/), [production](https://argo.eks.production.govuk.digital/)).
1. Navigate to the `external-secrets` app, locate the `local-links-manager-postgres` external secret, select the "..." menu, and select "Refresh".
1. After refreshing this secret, the app's pods should automatically be restarted pointing at the correct database instance. To confirm that this happened, navigate to the `local-links-manager` app, locate the `local-links-manager` deployment, and check the uptime of the pods.

> If the pods were not automatically restarted, select the "..." menu next to the deployment, and select "Restart".

### 8. Check that the app is now using the restored database

Open a Rails console on the target app:

```sh
kubectl exec -n apps -it deploy/local-links-manager -- bundle exec rails c
```

Check which database ActiveRecord is connected to, and ensure it matches the hostname of the restored database:

```ruby
ActiveRecord::Base.connection_db_config.host
```
1. Click "Edit", and replace the value of the `host` and `dbInstanceIdentifier` fields with the URL and identifier of the new database instance. Click "Save".

## Delete an obsolete database
### 7. Redeploy the affected ECS applications

> PLEASE BE CAREFUL WHEN EXECUTING THIS COMMAND AS IT CANNOT BE UNDONE
> The execution role in ECS passes secret values to a new revision of the application. This process is triggered by a standard deployment using terraform with a new docker image.

For reference, here is the [AWS documentation for deleting a database instance](https://docs.aws.amazon.com/cli/latest/reference/rds/delete-db-instance.html#delete-db-instance).
1. Open an empty PR you want to cut a release for
1. Seek approval to merge this PR (for `staging` and `production` releases)
1. Manually gated production releases will need to be approved after the staging workflow has completed

It is likely that the restored database is missing data since the snapshot was taken and you
will want to have a copy of the original database for comparison before deleting it.

The command below will create a DB snapshot before the DB instance is deleted. If you don't want this, omit the `--final-db-snapshot-identifier` parameter.

```sh
environment=production
db_instance_identifier=<e.g. local-links-manager-postgres>
snapshot_name=<e.g. local-links-manager-postgres-final-snapshot>
gds-cli aws govuk-${environment?}-admin \
aws rds delete-db-instance \
--db-instance-identifier ${db_instance_identifier?} \
--final-db-snapshot-identifier ${snapshot_name?}
```
You'll want to keep an eye on the `#tariff-alerts` channel and validate the application is still running using your usual process.

You can check the snapshot is available by navigating to RDS > Snapshots in the AWS console.
[lambda-backups]: https://github.com/trade-tariff/trade-tariff-lambdas-database-backups
2 changes: 1 addition & 1 deletion spec/helpers/commit_helpers_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
RSpec.describe CommitHelpers do
let(:helper) { Class.new { extend CommitHelpers } }
let(:uncommitted_file) do
path_to_file = "config.rb"
path_to_file = "tmp/.gitignore"
File.write(path_to_file, "new uncommitted file")
path_to_file
end
Expand Down
1 change: 1 addition & 0 deletions tmp/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@