Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cluster to start after power off when ETCD_DISABLE_PRESTOP is set #38514

Merged
merged 4 commits into from
Aug 28, 2023
Merged

Allow cluster to start after power off when ETCD_DISABLE_PRESTOP is set #38514

merged 4 commits into from
Aug 28, 2023

Conversation

bklei
Copy link
Contributor

@bklei bklei commented Jun 22, 2023

Description of the change

When etcd is configured with a static member list (ETCD_DISABLE_PRESTOP set), simply scaling the cluster to zero then back up, or powering off/on the cluster, all pods start in CLBO. This change exits the setup code in this case and lets etcd attempt to join the cluster once other members start.

Benefits

Better resiliency.

Possible drawbacks

Can't think of any.

Applicable issues

N/A

Additional information

Tested scaling down and up, forcing loss of quorum:

etcd 18:49:49.25 Subscribe to project updates by watching https://github.com/bitnami/containers
etcd 18:49:49.26 Submit issues and feature requests at https://github.com/bitnami/containers/issues
etcd 18:49:49.26
etcd 18:49:49.27 INFO  ==> ** Starting etcd setup **
etcd 18:49:49.31 INFO  ==> Validating settings in ETCD_* env vars..
etcd 18:49:49.32 WARN  ==> You set the environment variable ALLOW_NONE_AUTHENTICATION=yes. For safety reasons, do not use this flag in a production environment.
etcd 18:49:49.34 INFO  ==> Initializing etcd
etcd 18:49:49.34 INFO  ==> Generating etcd config file using env variables
etcd 18:49:49.48 INFO  ==> Detected data from previous deployments
etcd 18:49:49.50 INFO  ==> The member will try to join the cluster by it's own
etcd 18:49:49.56 INFO  ==> ** etcd setup finished! **

etcd 18:49:49.61 INFO  ==> ** Starting etcd **
{"level":"warn","ts":"2023-06-22T18:49:49.653919Z","caller":"embed/config.go:673","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2023-06-22T18:49:49.654191Z","caller":"etcdmain/config.go:350","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/opt/bitnami/etcd/conf/etcd.yaml"}
{"level":"info","ts":"2023-06-22T18:49:49.654213Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","--config-file","/opt/bitnami/etcd/conf/etcd.yaml"]}
{"level":"info","ts":"2023-06-22T18:49:49.654298Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/bitnami/etcd/data","dir-type":"member"}
{"level":"warn","ts":"2023-06-22T18:49:49.65435Z","caller":"embed/config.go:673","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2023-06-22T18:49:49.654364Z","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["http://0.0.0.0:2380"]}
{"level":"info","ts":"2023-06-22T18:49:49.654576Z","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]}
{"level":"info","ts":"2023-06-22T18:49:49.654686Z","caller":"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.9","git-sha":"bdbbde998","go-version":"go1.19.9","go-os":"linux","go-arch":"amd64","max-cpu-set":24,"max-cpu-available":24,"member-initialized":true,"name":"cray-bos-bitnami-etcd-1","data-dir":"/bitnami/etcd/data","wal-dir":"","wal-dir-dedicated":"","member-dir":"/bitnami/etcd/data/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://cray-bos-bitnami-etcd-1.cray-bos-bitnami-etcd-headless.services.svc.cluster.local:2380"],"listen-peer-urls":["http://0.0.0.0:2380"],"advertise-client-urls":["http://cray-bos-bitnami-etcd-1.cray-bos-bitnami-etcd-headless.services.svc.cluster.local:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
ncn-m002:~ # /opt/cray/platform-utils/etcd/etcd-util.sh endpoint_status cray-bos
### cray-bos-bitnami-etcd-0 Endpoint Status: ###
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 75c23b83965f55de |   3.5.9 |  1.2 MB |      true |      false |        14 |      83862 |              83862 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
### cray-bos-bitnami-etcd-1 Endpoint Status: ###
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6a95fcc74f1f8616 |   3.5.9 |  1.3 MB |     false |      false |        14 |      83862 |              83862 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
### cray-bos-bitnami-etcd-2 Endpoint Status: ###
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 2b4984dc5c79bd55 |   3.5.9 |  1.2 MB |     false |      false |        14 |      83862 |              83862 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

@github-actions github-actions bot added the etcd label Jun 22, 2023
@github-actions github-actions bot added the triage Triage is needed label Jun 22, 2023
@github-actions github-actions bot added in-progress and removed triage Triage is needed labels Jun 22, 2023
@bitnami-bot bitnami-bot removed the request for review from carrodher June 22, 2023 22:23
@bitnami-bot bitnami-bot requested a review from mdhont June 22, 2023 22:23
@mdhont
Copy link
Contributor

mdhont commented Jun 23, 2023

We are going to review this logic internally as we want to further investigate the behaviour of the field that you propose to modify with a variable. We will notify you in this PR when there is any news.
Thank you very much for the contribution!

@bitnami-bot bitnami-bot added the verify Execute verification workflow for these changes label Jun 23, 2023
@github-actions
Copy link

github-actions bot commented Jul 9, 2023

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

@github-actions github-actions bot added the stale 15 days without activity label Jul 9, 2023
@bklei
Copy link
Contributor Author

bklei commented Jul 10, 2023

We are going to review this logic internally as we want to further investigate the behaviour of the field that you propose to modify with a variable. We will notify you in this PR when there is any news. Thank you very much for the contribution!

@mdhont any updates here?

@github-actions github-actions bot added the triage Triage is needed label Jul 10, 2023
@carrodher carrodher removed stale 15 days without activity triage Triage is needed bitnami labels Jul 10, 2023
@carrodher carrodher requested a review from mdhont July 10, 2023 13:53
@carrodher carrodher assigned mdhont and unassigned Mauraza Jul 10, 2023
@@ -664,6 +664,13 @@ etcd_initialize() {
if is_boolean_yes "$ETCD_DISABLE_PRESTOP"; then
info "The member will try to join the cluster by it's own"
export ETCD_INITIAL_CLUSTER_STATE=existing
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change the comment:

# If ETCD_DISABLE_PRESTOP is set, we won't dynamically adjust membership. In this case
# we'll return and allow etcd to start and join the statically configured cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you bet -- changing..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhont any additional changes we need here?

@github-actions
Copy link

This Pull Request has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thank you for your contribution.

@github-actions github-actions bot added stale 15 days without activity and removed stale 15 days without activity labels Jul 27, 2023
Signed-off-by: Brad Klein <bklein@cray.com>
@@ -664,6 +664,8 @@ etcd_initialize() {
if is_boolean_yes "$ETCD_DISABLE_PRESTOP"; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a misunderstanding on where the code needs to be placed. I meant this:

Suggested change
if is_boolean_yes "$ETCD_DISABLE_PRESTOP"; then
if is_boolean_yes "$ETCD_DISABLE_PRESTOP"; then
info "The member will try to join the cluster by it's own"
export ETCD_INITIAL_CLUSTER_STATE=existing
elif ! is_healthy_etcd_cluster; then
warn "Cluster not responding!"
if is_boolean_yes "$ETCD_DISASTER_RECOVERY"; then
latest_snapshot_file="$(find /snapshots/ -maxdepth 1 -type f -name 'db-*' | sort | tail -n 1)"

So removing this duplicate code:

            elif ! is_healthy_etcd_cluster; then
                warn "Cluster not responding!"
            fi
            member_id="$(get_member_id)"
            if ! is_healthy_etcd_cluster; then
                warn "Cluster not responding!"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sincere apologies for the churn :(. I've pushed another commit after testing with your suggested change, worked great! let me know if this looks ok now?

@@ -666,10 +666,6 @@ etcd_initialize() {
export ETCD_INITIAL_CLUSTER_STATE=existing
elif ! is_healthy_etcd_cluster; then
warn "Cluster not responding!"
fi
member_id="$(get_member_id)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove this. Instead, you can add it below line 663:

    if [[ ${#initial_members[@]} -gt 1 ]]; then
        member_id="$(get_member_id)" 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, added back

Copy link
Contributor

@mdhont mdhont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm! Thanks for your contribution.

Brad Klein added 2 commits August 17, 2023 15:47
Signed-off-by: Brad Klein <bklein@cray.com>
Signed-off-by: Brad Klein <bklein@cray.com>
@mdhont mdhont merged commit a276caa into bitnami:main Aug 28, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
etcd solved verify Execute verification workflow for these changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants