Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Paul Banks <pbanks@hashicorp.com>
  • Loading branch information
im2nguyen and banks authored Feb 24, 2023
1 parent 22a4aad commit c3e53c3
Show file tree
Hide file tree
Showing 3 changed files with 14 additions and 15 deletions.
5 changes: 2 additions & 3 deletions website/content/docs/agent/config/config-files.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1650,14 +1650,13 @@ Valid time units are 'ns', 'us' (or 'µs'), 'ms', 's', 'm', 'h'."
configure the `interval` and set [`enabled`](#raft_logstore_verification_enabled)
to `true` to correctly enable intervals. We recommend using an interval
between `30s` and `5m`. The performance overhead is insignificant if the
interval is set to `5m` or less. We recommend setting an interval to
control how frequently the report logs appear for human observation.
interval is set to `5m` or less.

- `boltdb` ((#raft_logstore_boltdb)) - Object that configures options for
Raft's `boltdb` backend. It has no effect if the `backend` is not `boltdb`.

- `no_freelist_sync` ((#raft_logstore_boltdb_no_freelist_sync)) - Set to
`true` to disable storing BoltDB freelist to disk within the
`true` to disable storing BoltDB's freelist to disk within the
`raft.db` file. Disabling freelist syncs reduces the disk IO required
for write operations, but could potentially increase start up time
because Consul must scan the database to find free space
Expand Down
20 changes: 10 additions & 10 deletions website/content/docs/agent/wal-logstore/enable.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ The likelihood of the following potential risks is low to very low:

- If WAL corrupts data on a Consul server agent, the server data cannot be recovered. Restart the server with an empty data directory and reload its state from the leader to resolve the issue.
- WAL may corrupt data or contain a defect that causes the server to panic and crash. WAL may not restart if the defect is recurs when WAL reads from the logs on startup. Restart the server with an empty data directory and reload its state from the leader to resolve the issue.
- Clients may read corrupted data from the Consul server, such as invalid IP addresses or unmatching tokens, if WAL corrupts data. This is unlikely even if a recuring defect cause WAL to corrupt data because replication uses objects cached in memory rather than reads from disk. Restore the server to resolve the issue.
- If you enable a Consul OSS server to use WAL or enable WAL on a voting server with Consul Enterprise, WAL may corrupt the server's state, become the leader, and replicate the corrupted state to all other servers. In this scenario only a restore from backup would recover a completely un-corrupt state. Test WAL on a non-voting server in Enterprise to preven this scenario.
- Clients may read corrupted data from the Consul server, such as invalid IP addresses or unmatching tokens, if WAL corrupts data. This is unlikely even if a recuring defect cause WAL to corrupt data because replication uses objects cached in memory rather than reads from disk. Restart the server with an empty data directory and reload its state from the leader to resolve the issue.
- If you enable a Consul OSS server to use WAL or enable WAL on a voting server with Consul Enterprise, WAL may corrupt the server's state, become the leader, and replicate the corrupted state to all other servers. In this case only a restore from backup would recover a completely un-corrupt state. Test WAL on a non-voting server in Enterprise to prevent this. You can add a new non-voting server to the cluster to test with if there are no existing ones.


## Enable log verification
Expand Down Expand Up @@ -66,7 +66,7 @@ When complete, log entries for the servers should resemble the following status:

## Select target server to enable WAL

If you are using Consul OSS or Consul Enterprise without non-voting servers, select a follower server to enable WAL. As noted in [Risks](#risks), Consul Enterprise users with non-voting servers should first select a non-voting server.
If you are using Consul OSS or Consul Enterprise without non-voting servers, select a follower server to enable WAL. As noted in [Risks](#risks), Consul Enterprise users with non-voting servers should first select a non-voting server, or consider adding another server as a non-voter to test on.

Retrieve the current state of the servers by running the following command:

Expand All @@ -76,7 +76,7 @@ $ consul operator raft list-peers

## Stop target server

Stop the target server gracefully. For example, if you are using `systemcmd`,
Stop the target server gracefully. For example, if you are using `systemd`,
run the following command:

```shell-session
Expand All @@ -89,10 +89,10 @@ If your environment uses configuration management automation that might interfer

Temporarily moving the data directory to a different location is less destructive than deleting it. We recommend doing this in case you unsuccessfully enable WAL. Do not use the old data directory (`/data-dir/raft.bak`) for recovery after restarting the server. We recommend eventually deleting the old directory.

The following example moves the data atfrom `/data-dir` in the configuration file to `/temp/data-dir`.
The following example assumes the `data_dir` in the server's configuration is `/data-dir` and renames it to `/data-dir.bak`.

```shell-session
$ mv /data-dir/raft /temp/data-dir/raft.bak
$ mv /data-dir/raft /data-dir/raft.bak
```

When switching backend, you must always remove the _whole raft directory_ not just the `raft.db` file or `wal` directory. This is bedause the log must always be consistent with the snapshots to avoid undefined behavior or data loss.
Expand All @@ -113,7 +113,7 @@ raft_logstore {

## Start target server

Start the target server. For example, if you are using `systemcmd`, run the following command:
Start the target server. For example, if you are using `systemd`, run the following command:

```shell-session
$ systemctl start consul
Expand All @@ -129,13 +129,13 @@ $ consul operator raft list-peers

Refer to [Monitor Raft metrics and logs for WAL](/consul/docs/agent/wal-logstore/monitoring) for details.

We recommend leaving the cluster in the test configuration for several days or weeks. If you do not record negative metrics or verification errors in logs, then you should have more confidence that WAL operates corerctly under varying workloads and during routine server restarts.
We recommend leaving the cluster in the test configuration for several days or weeks assuming no errors observed. An extended test provides more confidence that WAL operates correctly under varying workloads and during routine server restarts. If any errors are observed, you should end the test immediately and report them.

If you disabled configuration management automation, consider reenabling it during the testing phase. Monitor the automation so that you can verify that it does not fix the Consul configuration file and remove the different backend.
If you disabled configuration management automation, consider reenabling it during the testing phase to pick up other updates for the host. You must ensure that it does not revert the Consul configuration file and remove the different backend configuration. One way to do this is add the `raft_logstore` block to a separate file that is not managed by your automation. This file can either be added to the directory if [`-config-dir`](/consul/docs/agent/config/cli-flags#_config_dir) is used or as an additional [`-config-file`](/consul/docs/agent/config/cli-flags#_config_file) argument.

## Next steps

- If you see any verification errors, performance anomalies or other suspicious behavior from the target server during the test, you should follow [the procedure to revert back to BoltDB](/consul/docs/agent/wal-logstore/revert-to-boltdb).
- If you see any verification errors, performance anomalies or other suspicious behavior from the target server during the test, you should immediately follow [the procedure to revert back to BoltDB](/consul/docs/agent/wal-logstore/revert-to-boltdb). Please report failure via GitHub.

- If you do not see errors and would like to expand the test further, you can repeat the above procedure on another target server. We suggest waiting after each test expansion and slowly rolling WAL out to other parts of your environment. Once the majority of your servers use WAL, any bugs not yet discovered may result in cluster unavailability.

Expand Down
4 changes: 2 additions & 2 deletions website/content/docs/agent/wal-logstore/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ This topic provides an overview of the experimental WAL (write-ahead log) LogSto

## WAL versus BoltDB

WAL implements a traditional log with rotating, append-only log files. WAL resolves many issues with the existing `LogStore` provided by the BoltDB backend. The BoltDB `LogStore` is a copy-on-write BTree, which is not optimized for append-only workloads.
WAL implements a traditional log with rotating, append-only log files. WAL resolves many issues with the existing `LogStore` provided by the BoltDB backend. The BoltDB `LogStore` is a copy-on-write BTree, which is not optimized for append-only, write-heavy workloads.

### BoltDB storage scalability issues

The existing BoltDB log store inefficiently stores append-only logs to disk because it was designed as a full key-value database. It is a single file that only ever grows. Deleting the oldest logs, which Consul does regularly when it makes new snapshots of the state, leaves free space in the file. The free space must be tracked so that Consul can reuse it on future writes. By contrast, a simple segmented log can delete the oldest log files from disk.
The existing BoltDB log store inefficiently stores append-only logs to disk because it was designed as a full key-value database. It is a single file that only ever grows. Deleting the oldest logs, which Consul does regularly when it makes new snapshots of the state, leaves free space in the file. The free space must be tracked in a `freelist` so that BoltDB can reuse it on future writes. By contrast, a simple segmented log can delete the oldest log files from disk.

A burst of writes at double or triple the normal volume can suddenly cause the log file to grow to several times its steady-state size. After Consul takes the next snapshot and truncates the oldest logs, the resulting file is mostly empty space.

Expand Down

0 comments on commit c3e53c3

Please sign in to comment.