Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions for 3.5->3.4 downgrade. #847

Merged
merged 1 commit into from
Jun 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions content/en/docs/v3.5/downgrades/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
title: Downgrading
weight: 6000
description: Downgrading etcd clusters and applications
---
276 changes: 276 additions & 0 deletions content/en/docs/v3.5/downgrades/downgrade_3_5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
---
title: Downgrade etcd from 3.5 to 3.4
weight: 6650
description: Processes, checklists, and notes on downgrading etcd from 3.5 to 3.4
---

In the general case, downgrading from etcd 3.5 to 3.4 can be a zero-downtime, rolling downgrade:
siyuanfoundation marked this conversation as resolved.
Show resolved Hide resolved
- one by one, stop the etcd 3.5 processes and replace them with etcd 3.4 processes
- after starting any 3.4 processes, new features in 3.5 are not longer available to the cluster

Before [starting a downgrade](#downgrade-procedure), read through the rest of this guide to prepare.

### Downgrade checklists

content/en/docs/v3.5/op-guide/authentication/rbac.md

**NOTE:** If your cluster enables auth, rolling downgrade from 3.5 isn't supported because 3.5 [changes a format of WAL entries related to auth](https://github.com/etcd-io/etcd/pull/11943). You can follow the [authentification instructions](../../op-guide/authentication/rbac/) to disable auth, and delete all users first.

Highlighted breaking changes from 3.5 to 3.4:

#### Difference in flags

If you are using any of the following flags in your 3.5 configurations, make sure to remove, rename, or change the default value when downgrading to 3.4.

**NOTE** The diff is based on version 3.5.14 and v.3.4.33. The actual diff would be dependent on your patch version, check with `diff <(etcd-3.5/bin/etcd -h | grep \\-\\-) <(etcd-3.4/bin/etcd -h | grep \\-\\-)` first.

```diff
# flags not available in 3.4
siyuanfoundation marked this conversation as resolved.
Show resolved Hide resolved
-etcd --socket-reuse-port
-etcd --socket-reuse-address
-etcd --raft-read-timeout
-etcd --raft-write-timeout
-etcd --v2-deprecation
-etcd --client-cert-file
-etcd --client-key-file
-etcd --peer-client-cert-file
-etcd --peer-client-key-file
-etcd --self-signed-cert-validity
-etcd --enable-log-rotation --log-rotation-config-json=some.json
-etcd --experimental-enable-distributed-tracing --experimental-distributed-tracing-address='localhost:4317' --experimental-distributed-tracing-service-name='etcd' --experimental-distributed-tracing-instance-id='' --experimental-distributed-tracing-sampling-rate='0'
-etcd --experimental-compact-hash-check-enabled --experimental-compact-hash-check-time='1m'
-etcd --experimental-downgrade-check-time
-etcd --experimental-memory-mlock
-etcd --experimental-txn-mode-write-with-shared-buffer
-etcd --experimental-bootstrap-defrag-threshold-megabytes
-etcd --experimental-stop-grpc-service-on-defrag

# same flag with different names
-etcd --backend-bbolt-freelist-type=map
+etcd --experimental-backend-bbolt-freelist-type=array

# same flag different defaults
-etcd --pre-vote=true
+etcd --pre-vote=false

-etcd --logger=zap
+etcd --logger=capnslog
```

#### `etcd --logger zap`

3.4 defaults to `--logger=capnslog` while 3.5 defaults `--logger=zap`.

If you want to keep using `zap`, it needs to be explicitly specified.

```diff
+etcd --logger=zap --log-outputs=stderr

+# to write logs to stderr and a.log file at the same time
+etcd --logger=zap --log-outputs=stderr,a.log
```

#### Difference in Prometheus metrics

```diff
# metrics not available in 3.4
-etcd_debugging_mvcc_db_compaction_last
```

### Server downgrade checklists

#### Downgrade requirements

To ensure a smooth rolling downgrade, the running cluster must be healthy. Check the health of the cluster by using the `etcdctl endpoint health` command before proceeding.

The 3.4 version to downgrade to must be >= 3.4.32.

#### Preparation

Before downgrading etcd, always test the services relying on etcd in a staging environment before deploying the downgrade to the production environment.

Before beginning, [download the snapshot backup](../../op-guide/maintenance/#snapshot-backup). Should something go wrong with the downgrade, it is possible to use this backup to [rollback](#rollback) back to existing etcd version. Please note that the `snapshot` command only backs up the v3 data. For v2 data, see [backing up v2 datastore](/docs/v2.3/admin_guide#backing-up-the-datastore).

Before beginning, download the latest release of etcd 3.4, and make sure its version is >= 3.4.32.

#### Mixed versions

While downgrading, an etcd cluster supports mixed versions of etcd members, and operates with the protocol of the lowest common version. The cluster is considered downgraded once any of its members is downgraded to version 3.4. Internally, etcd members negotiate with each other to determine the overall cluster version, which controls the reported version and the supported features.

#### Limitations

Note: If the cluster only has v3 data and no v2 data, it is not subject to this limitation.

If the cluster is serving a v2 data set larger than 50MB, each newly downgraded member may take up to two minutes to catch up with the existing cluster. Check the size of a recent snapshot to estimate the total data size. In other words, it is safest to wait for 2 minutes between downgrading each member.

For a much larger total data size, 100MB or more , this one-time process might take even more time. Administrators of very large etcd clusters of this magnitude can feel free to contact the [etcd team][etcd-contact] before downgrading, and we'll be happy to provide advice on the procedure.

#### Rollback

If any member has been downgraded to 3.4, the cluster version will be downgraded to 3.4, and operations will be "3.4" compatible. You would need to follow the [Upgrade etcd from 3.4 to 3.5](../../upgrades/upgrade_3_5/) instructions to rollback.

Please [download the snapshot backup](../../op-guide/maintenance/#snapshot-backup) to make downgrading the cluster possible even after it has been completely downgraded.

### Downgrade procedure

This example shows how to downgrade a 3-member 3.5 etcd cluster running on a local machine.

#### Step 1: check downgrade requirements

Is the cluster healthy and running 3.5.x?

```bash
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint health
<<COMMENT
localhost:2379 is healthy: successfully committed proposal: took = 2.118638ms
localhost:22379 is healthy: successfully committed proposal: took = 3.631388ms
localhost:32379 is healthy: successfully committed proposal: took = 2.157051ms
COMMENT

curl http://localhost:2379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT

curl http://localhost:22379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT

curl http://localhost:32379/version
<<COMMENT
{"etcdserver":"3.5.0","etcdcluster":"3.5.0"}
COMMENT
```

#### Step 2: download snapshot backup from leader
jmhbnz marked this conversation as resolved.
Show resolved Hide resolved

[Download the snapshot backup](../../op-guide/maintenance/#snapshot-backup) to provide a downgrade path should any problems occur.

#### Step 3: stop one existing etcd server

Before stopping the server, check if it is the leader

```bash
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint status -w=table
<<COMMENT
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| localhost:2379 | 8211f1d0f64f3269 | 3.5.13 | 20 kB | true | false | 2 | 9 | 9 | |
| localhost:22379 | 91bc3c398fb3c146 | 3.5.13 | 20 kB | false | false | 2 | 9 | 9 | |
| localhost:32379 | fd422379fda50e48 | 3.5.13 | 20 kB | false | false | 2 | 9 | 9 | |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
COMMENT
```
If the server to be stopped is the leader, you can avoid some downtime by `move-leader` to another server before stopping this server.
```bash
etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 move-leader 91bc3c398fb3c146

etcdctl --endpoints=localhost:2379,localhost:22379,localhost:32379 endpoint status -w=table
<<COMMENT
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| localhost:2379 | 8211f1d0f64f3269 | 3.5.13 | 20 kB | false | false | 3 | 11 | 11 | |
| localhost:22379 | 91bc3c398fb3c146 | 3.5.13 | 20 kB | true | false | 3 | 11 | 11 | |
| localhost:32379 | fd422379fda50e48 | 3.5.13 | 20 kB | false | false | 3 | 11 | 11 | |
+-----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
COMMENT
```

When each etcd process is stopped, expected errors will be logged by other cluster members. This is normal since a cluster member connection has been (temporarily) broken:

```bash
{"level":"info","ts":"2024-05-14T20:25:47.051124Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"91bc3c398fb3c146 became leader at term 3"}
{"level":"info","ts":"2024-05-14T20:25:47.051139Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 91bc3c398fb3c146 elected leader 91bc3c398fb3c146 at term 3"}

^C{"level":"warn","ts":"2024-05-14T20:27:09.094119Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
{"level":"warn","ts":"2024-05-14T20:27:09.09427Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269","error":"EOF"}
{"level":"warn","ts":"2024-05-14T20:27:09.095535Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"8211f1d0f64f3269","error":"failed to dial 8211f1d0f64f3269 on stream MsgApp v2 (peer 8211f1d0f64f3269 failed to find local node 91bc3c398fb3c146)"}
{"level":"warn","ts":"2024-05-14T20:27:09.43915Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"91bc3c398fb3c146","remote-peer-id":"8211f1d0f64f3269"}
{"level":"warn","ts":"2024-05-14T20:27:11.085646Z","caller":"etcdserver/cluster_util.go:294","msg":"failed to reach the peer URL","address":"http://127.0.0.1:12380/version","remote-member-id":"8211f1d0f64f3269","error":"Get \"http://127.0.0.1:12380/version\": dial tcp 127.0.0.1:12380: connect: connection refused"}
{"level":"warn","ts":"2024-05-14T20:27:11.085718Z","caller":"etcdserver/cluster_util.go:158","msg":"failed to get version","remote-member-id":"8211f1d0f64f3269","error":"Get \"http://127.0.0.1:12380/version\": dial tcp 127.0.0.1:12380: connect: connection refused"}
{"level":"warn","ts":"2024-05-14T20:27:13.557385Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"8211f1d0f64f3269","rtt":"416.079µs","error":"dial tcp 127.0.0.1:12380: connect: connection refused"}
```

#### Step 4: restart the etcd server with same configuration + `--next-cluster-version-compatible`

Restart the etcd server with same configuration but with the new etcd binary and `--next-cluster-version-compatible`.

```diff
-etcd-old --name s1 \
+etcd-new --name s1 \
--data-dir /tmp/etcd/s1 \
--listen-client-urls http://localhost:2379 \
--advertise-client-urls http://localhost:2379 \
--listen-peer-urls http://localhost:2380 \
--initial-advertise-peer-urls http://localhost:2380 \
--initial-cluster s1=http://localhost:2380,s2=http://localhost:22380,s3=http://localhost:32380 \
--initial-cluster-token tkn \
--initial-cluster-state existing
--next-cluster-version-compatible
```

The new 3.4 etcd will publish its information to the cluster. At this point, cluster will start to operate as 3.4 protocol, which is the lowest common version.

> `{"level":"info","ts":"2024-05-13T21:05:43.981445Z","caller":"membership/cluster.go:561","msg":"set initial cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","cluster-version":"3.0"}`

> `{"level":"info","ts":"2024-05-13T21:05:43.982188Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.0"}`

> `{"level":"info","ts":"2024-05-13T21:05:43.982312Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.0","from":"3.5"}`

> `{"level":"info","ts":"2024-05-13T21:05:43.982376Z","caller":"api/capability.go:77","msg":"enabled capabilities for version","cluster-version":"3.5"}`

> `{"level":"info","ts":"2024-05-13T21:05:44.000672Z","caller":"etcdserver/server.go:2152","msg":"published local member to cluster through raft","local-member-id":"8211f1d0f64f3269","local-member-attributes":"{Name:infra1 ClientURLs:[http://127.0.0.1:2379]}","request-path":"/0/members/8211f1d0f64f3269/attributes","cluster-id":"ef37ad9dc622a7c4","publish-timeout":"7s"}`

> `{"level":"info","ts":"2024-05-13T21:05:46.452631Z","caller":"membership/cluster.go:549","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"8211f1d0f64f3269","from":"3.5","from":"3.4"}`

Verify that each member, and then the entire cluster, becomes healthy with the new 3.4 etcd binary:

```bash
etcdctl endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
<<COMMENT
localhost:32379 is healthy: successfully committed proposal: took = 2.337471ms
localhost:22379 is healthy: successfully committed proposal: took = 1.130717ms
localhost:2379 is healthy: successfully committed proposal: took = 2.124843ms
COMMENT
```

Un-downgraded members will log info like the following

```
{"level":"info","ts":"2024-05-13T21:05:46.450764Z","caller":"etcdserver/server.go:2633","msg":"updating cluster version using v2 API","from":"3.5","to":"3.4"}
{"level":"info","ts":"2024-05-13T21:05:46.452419Z","caller":"membership/cluster.go:576","msg":"updated cluster version","cluster-id":"ef37ad9dc622a7c4","local-member-id":"91bc3c398fb3c146","from":"3.5","to":"3.4"}
{"level":"info","ts":"2024-05-13T21:05:46.452547Z","caller":"etcdserver/server.go:2652","msg":"cluster version is updated","cluster-version":"3.4"}
```

#### Step 5: repeat *step 3* and *step 4* for rest of the members

When all members are downgraded, check the health status and version of the cluster:

```bash
endpoint health --endpoints=localhost:2379,localhost:22379,localhost:32379
<<COMMENT
localhost:2379 is healthy: successfully committed proposal: took = 492.834µs
localhost:22379 is healthy: successfully committed proposal: took = 1.015025ms
localhost:32379 is healthy: successfully committed proposal: took = 1.853077ms
COMMENT

curl http://localhost:2379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT

curl http://localhost:22379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT

curl http://localhost:32379/version
<<COMMENT
{"etcdserver":"3.4.32","etcdcluster":"3.4.0"}
COMMENT
```

[etcd-contact]: https://groups.google.com/g/etcd-dev
12 changes: 12 additions & 0 deletions content/en/docs/v3.5/downgrades/downgrading-etcd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Downgrading etcd clusters and applications
weight: 6500
description: Documentation list for downgrading etcd clusters and applications
---

This section contains documents specific to downgrading etcd clusters and applications.

## Downgrading an etcd v3.x cluster
* [Downgrade etcd from 3.5 to 3.4](../downgrade_3_5/)

[migrate-apps]: ../../op-guide/v2-migration/
1 change: 1 addition & 0 deletions content/en/docs/v3.5/upgrades/upgrading-etcd.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This section contains documents specific to upgrading etcd clusters and applicat
* [Upgrade etcd from 3.1 to 3.2](../upgrade_3_2/)
* [Upgrade etcd from 3.2 to 3.3](../upgrade_3_3/)
* [Upgrade etcd from 3.3 to 3.4](../upgrade_3_4/)
* [Upgrade etcd from 3.4 to 3.5](../upgrade_3_5/)

## Upgrading from etcd v2.3
* [Upgrade a v2.3 cluster to v3.0](../upgrade_3_0/)
Expand Down
1 change: 1 addition & 0 deletions content/en/docs/v3.6/upgrades/upgrading-etcd.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ This section contains documents specific to upgrading etcd clusters and applicat
* [Upgrade etcd from 3.1 to 3.2](../upgrade_3_2/)
* [Upgrade etcd from 3.2 to 3.3](../upgrade_3_3/)
* [Upgrade etcd from 3.3 to 3.4](../upgrade_3_4/)
* [Upgrade etcd from 3.4 to 3.5](../upgrade_3_5/)

## Upgrading from etcd v2.3
* [Upgrade a v2.3 cluster to v3.0](../upgrade_3_0/)
Expand Down