Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL-Archiving + PITR #231

Merged
merged 51 commits into from
Jul 5, 2024
Merged

WAL-Archiving + PITR #231

merged 51 commits into from
Jul 5, 2024

Conversation

davissp14
Copy link
Contributor

@davissp14 davissp14 commented Jun 25, 2024

This PR introduces a WAL Archiving solution that integrates with any S3 compatible object storage. It also provides various remote recovery options for restoring into a new app. While in-place recovery options are possible with the provided functionality, the commands must be issued manually. The current automated recovery process will fail if there's an existing postgresql directory.

Docs for in-place restores will be coming soon.

Enabling WAL Archiving

To enable, set the following secret:

S3_ARCHIVE_CONFIG=https://<AWS_ACCESS_KEY_ID>:<AWS_SECRET_ACCESS_KEY>@<AWS_ENDPOINT_URL_S3>/<BUCKET_NAME>/<PG_APP_NAME>

This most straight forward way to get started would be to create a new Tigris bucket via:

flyctl ext tigris create

AWS_ACCESS_KEY_ID: tid_key_id
AWS_ENDPOINT_URL_S3: https://fly.storage.tigris.dev
AWS_SECRET_ACCESS_KEY: tsec_secret_key
BUCKET_NAME: fragrant-glade-2283

Then set the secret with fly secrets set S3_ARCHIVE_CONFIG=... --app <pg-app-name>

Viewing/Updating WAL Archiving Configuration

Definitions

recovery_window (default: 7d)
Used as a retention policy. Backups older than the specified value will be pruned.
The shortest allowed recovery window is 1d.

Units available:

  • d - Days
  • w - Weeks
  • y - Years

minimum_redundancy (default 3)
The minimum number of backups that should always be available. Must be >= 0

full_backup_frequency (default: 24h)
The frequency in which full backups are taken. Must be >= 1h

archive_timeout (default: 60s)
This is a limit on how old unarchived data can be. Archiving typically happens once the WAL segment is full, however, this can cause issues for less active databases. So if the WAL segment doesn't fill by the specified timeout, a wal-switch will occur and force an archive event.

WARNING - Having too short of a archive timeout can have performance implications and bloat object storage.

View your current configuration

First, you can view the existing settings by running the following curl command:

curl http://<app-name>.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "7d"
    }
}

Update your configuration

To update a configuration option, you can run:

curl http://shaun-barman-00.internal:5500/commands/admin/settings/update/barman -d '{"recovery_window": "1d"}'
{"result":{"message":"Updated","restart_required":true}}

Then when we re-run the view command we can see the recovery window has been updated:

curl http://shaun-barman-00.internal:5500/commands/admin/settings/view/barman -s | python3 -m json.tool
{
    "result": {
        "archive_timeout": "60s",
        "full_backup_frequency": "24h",
        "minimum_redundancy": "3",
        "recovery_window": "1d"
    }
}

Note - This is a manual process, but the idea is that it make it straight forward to surface within a client and/or UI. These settings currently require a cluster restart, but in the future it should be possible to configure at runtime.

Remote Restore

Provision a new Postgres with the following secret:

S3_ARCHIVE_REMOTE_RESTORE_CONFIG=https://<access_key_id>:<access_secret>@<s3-endpoint>/<source-bucket-name>/<source-bucket-directory>

Note: If you don't specify a target it will default to restoring to the most recent point-in-time.

Target options

target
Specifies that recovery should end as soon as a consistent state is reached. The only allowed value here is "immediate".

targetTime
This parameter specifies the time stamp up to which recovery will proceed. The timestamp should follow RFC3339 formatting.

targetName
This parameter specifies the named restore point.

targetLSN (Not implemented)
This parameter specifies the LSN of the write-ahead log location up to which recovery will proceed.

TargetXID (Not implemented)
This parameter specifies the transaction ID up to which recovery will proceed.

Additional target options

targetInclusive (default: true)
Specifies whether to stop just after the specified recovery target (true), or just before the recovery target (false).

targetAction (default: promote)
Specifies what action the server should take once the recovery target is reached

targetTimeline (default: current-timeline)
Specifies recovering into a particular timeline. The default is to recover along the same timeline that was current when the base backup was taken.

Example usage

S3_ARCHIVE_REMOTE_RESTORE_CONFIG=https://<access_key_id>:<access_secret>@<s3-endpoint>/<source-bucket-name>/<source-bucket-directory>?targetTime=2024-07-03T20:23:46Z

For more information on these settings, see: https://www.postgresql.org/docs/current/runtime-config-wal.html#RUNTIME-CONFIG-WAL-RECOVERY-TARGET

@davissp14 davissp14 force-pushed the object-storage-integration branch 2 times, most recently from fefe347 to 3e1a877 Compare June 25, 2024 19:01
@benwaffle
Copy link
Contributor

can we change the env vars to match the default fly tigris create ones?

internal/flypg/pg.go Outdated Show resolved Hide resolved
@davissp14 davissp14 marked this pull request as ready for review July 2, 2024 18:24
@davissp14 davissp14 changed the title Object storage integration WAL-Archiving + PITR Jul 2, 2024
internal/flypg/node.go Outdated Show resolved Hide resolved
@benwaffle
Copy link
Contributor

I'm seeing an error listing backups:

2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]panic: failed to restore base backup: failed to resolve backup target by time: failed to parse backup end time: parsing time "" as "Mon Jan 2 15:04:05 2006": cannot parse "" as "Mon"
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]goroutine 1 [running]:
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]main.panicHandler({0x9ba280?, 0xc000106340})
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]	/go/src/github.com/fly-apps/fly-postgres/cmd/start/main.go:198 +0x55
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]main.main()
2024-07-03T06:07:43Z app[6833e41a77d908] ewr [info]	/go/src/github.com/fly-apps/fly-postgres/cmd/start/main.go:75 +0xf50

here's the json:

postgres@6833e41a77d908:~$ barman-cloud-backup-list --cloud-provider aws-s3 --endpoint-url https://fly.storage.tigris.dev --profile restore --format json s3://dbdbdbbucket dbdbdb | python3 -m json.tool
{
    "backups_list": [
        {
            "backup_label": null,
            "begin_offset": 216,
            "begin_time": "Tue Jul  2 21:05:44 2024",
            "begin_wal": "00000005000000000000002A",
            "begin_xlog": "0/2A0000D8",
            "compression": null,
            "config_file": "/data/postgresql/postgresql.conf",
            "copy_stats": null,
            "deduplicated_size": null,
            "end_offset": null,
            "end_time": null,
            "end_wal": null,
            "end_xlog": null,
            "error": "failure uploading data (connection already closed)",
            "hba_file": "/data/postgresql/pg_hba.conf",
            "ident_file": "/data/postgresql/pg_ident.conf",
            "included_files": [
                "/data/postgresql/postgresql.auto.conf",
                "/data/postgresql/postgresql.internal.conf"
            ],
            "mode": null,
            "pgdata": "/data/postgresql",
            "server_name": "cloud",
            "size": null,
            "status": "FAILED",
            "systemid": "7332222271544570189",
            "tablespaces": null,
            "timeline": 5,
            "version": 150006,
            "xlog_segment_size": 16777216,
            "backup_id": "20240702T210544"
        },
        {
            "backup_label": "'START WAL LOCATION: 0/2E000028 (file 00000005000000000000002E)\\nCHECKPOINT LOCATION: 0/2E000060\\nBACKUP METHOD: streamed\\nBACKUP FROM: primary\\nSTART TIME: 2024-07-02 21:07:03 UTC\\nLABEL: Barman backup cloud 20240702T210703\\nSTART TIMELINE: 5\\n'",
            "begin_offset": 40,
            "begin_time": "Tue Jul  2 21:07:03 2024",
            "begin_wal": "00000005000000000000002E",
            "begin_xlog": "0/2E000028",
            "compression": null,
            "config_file": "/data/postgresql/postgresql.conf",
            "copy_stats": {
                "total_time": 5.192484,
                "number_of_workers": 2,
                "analysis_time": 0,
                "analysis_time_per_item": {
                    "data": 0
                },
                "copy_time_per_item": {
                    "data": 4.423057
                },
                "serialized_copy_time_per_item": {
                    "data": 2.484593
                },
                "copy_time": 4.423057,
                "serialized_copy_time": 2.484593
            },
            "deduplicated_size": null,
            "end_offset": 80,
            "end_time": "Tue Jul  2 21:07:05 2024",
            "end_wal": "00000005000000000000002F",
            "end_xlog": "0/2F000050",
            "error": null,
            "hba_file": "/data/postgresql/pg_hba.conf",
            "ident_file": "/data/postgresql/pg_ident.conf",
            "included_files": [
                "/data/postgresql/postgresql.auto.conf",
                "/data/postgresql/postgresql.internal.conf"
            ],
            "mode": null,
            "pgdata": "/data/postgresql",
            "server_name": "cloud",
            "size": null,
            "status": "DONE",
            "systemid": "7332222271544570189",
            "tablespaces": null,
            "timeline": 5,
            "version": 150006,
            "xlog_segment_size": 16777216,
            "backup_id": "20240702T210703"
        }
    ]
}

@davissp14
Copy link
Contributor Author

davissp14 commented Jul 3, 2024

What do the query params look like for the S3_ARCHIVE_REMOTE_RESTORE_CONFIG secret?

EDIT
Oh, I see. The backup failed, so it doesn't have an end_time. Good find.

UPDATE
Issue should be fixed.

@davissp14 davissp14 merged commit 135085d into master Jul 5, 2024
5 of 6 checks passed
@davissp14 davissp14 deleted the object-storage-integration branch July 5, 2024 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants