Add disaster recovery documentation. #584

diegs · 2017-06-15T01:07:17Z

Fixes #432, also one way to address issues raised in #112.

diegs · 2017-06-15T17:30:59Z

xiang90 · 2017-06-16T05:25:02Z

Documentation/disaster-recovery.md

+```
+bootkube recover --asset-dir=recovered --etcd-backup-file=backup --kubeconfig=/etc/kubernetes/kubeconfig
+```
+


we tried this with our field engs yesterday.

there are a few things we need to make sure before running this script:

kubelet is running on the machine

no related containers are running (old etcd, old api server, etc.. this is also applied to other recovery cases i believe)

docker state is clean (docker ps -a does not contain old states of relevant containers). kubelet has bugs that it might incorrectly believes the static pod has dead when old state exists.

/var/etcd dir is clean on ALL master nodes

Cool, do you want me to add this directly to the documentation?

Also this is not really true of the other recovery situations. This makes it sound like you should basically destroy and recreate all your master nodes before using this recovery approach.

@diegs just fyi. we can address them later.

xiang90 · 2017-06-16T05:26:44Z

Documentation/disaster-recovery.md

+control plane can be extracted directly from the api-server:
+
+```
+bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig


one relevant issue: field engs suggest to rename --asset-dir to output-asset-dir. when they first tried without our help, they tried to pass in the old asset-dir in here.

Thanks for the feedback! Created #589

xiang90 · 2017-06-16T05:27:11Z

Awesome start!

xiang90 · 2017-06-16T22:04:50Z

LGTM

aaronlevy

Two minor notes that I wouldn't block on - more open-ended. LGTM

aaronlevy · 2017-06-19T18:42:05Z

Documentation/disaster-recovery.md

+
+To minimize the likelihood of any of the these scenarios, production
+self-hosted clusters should always run in a [high-availability
+configuration](https://kubernetes.io/docs/admin/high-availability/).


I'm on the fence about linking to those docs -- as they're pretty different to how self-hosted HA works (which we need docs for: #311). It does touch on some important topics like leader-election, but even then we already have that and all we care about is scaling replica counts (for example).

Ok, added a TODO linking to the issue instead for now.

aaronlevy · 2017-06-19T18:46:34Z

Documentation/disaster-recovery.md

+For more information, see the [Pod Checkpointer
+README](https://github.com/kubernetes-incubator/bootkube/blob/master/cmd/checkpoint/README.md).
+
+## Bootkube Recover


We may want to have some kind of versioning convention. I'm assuming right now it's: you should always use the latest bootkube release when running recover. This may not be a confusion point, but I wonder if users will try and use the same bootkube release that they installed with (which is probably fine in most cases, unless there are new bug fixes they should have).

Added note recommending to always use the latest version.

coresolve · 2017-06-19T19:57:47Z

We need to test this against a tectonic cluster. Not just the bootkube rendered cluster. Looks really good though.

diegs · 2017-06-19T20:22:25Z

@coresolve sgtm, we should go through that (and especially with self-hosted etcd) next.

Add disaster recovery documentation.

23c8b40

diegs requested review from aaronlevy and xiang90 June 15, 2017 01:07

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 15, 2017

add asciicast.

1deaa2f

diegs self-assigned this Jun 15, 2017

xiang90 reviewed Jun 16, 2017

View reviewed changes

diegs added 2 commits June 16, 2017 15:03

Merge branch 'master' into recovery-docs

873c4b4

Reference new --recovery-dir flag.

e8a8520

aaronlevy approved these changes Jun 19, 2017

View reviewed changes

Address aaronlevy feedback.

1090923

diegs merged commit 8b303d7 into kubernetes-retired:master Jun 19, 2017

diegs deleted the recovery-docs branch June 19, 2017 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add disaster recovery documentation. #584

Add disaster recovery documentation. #584

diegs commented Jun 15, 2017

diegs commented Jun 15, 2017

xiang90 Jun 16, 2017 •

edited

Loading

diegs Jun 16, 2017 •

edited

Loading

xiang90 Jun 16, 2017

xiang90 Jun 16, 2017

diegs Jun 16, 2017

xiang90 commented Jun 16, 2017

xiang90 commented Jun 16, 2017

aaronlevy left a comment

aaronlevy Jun 19, 2017

diegs Jun 19, 2017

aaronlevy Jun 19, 2017

diegs Jun 19, 2017

coresolve commented Jun 19, 2017

diegs commented Jun 19, 2017

Add disaster recovery documentation. #584

Add disaster recovery documentation. #584

Conversation

diegs commented Jun 15, 2017

diegs commented Jun 15, 2017

xiang90 Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

diegs Jun 16, 2017 • edited Loading

Choose a reason for hiding this comment

xiang90 Jun 16, 2017

Choose a reason for hiding this comment

xiang90 Jun 16, 2017

Choose a reason for hiding this comment

diegs Jun 16, 2017

Choose a reason for hiding this comment

xiang90 commented Jun 16, 2017

xiang90 commented Jun 16, 2017

aaronlevy left a comment

Choose a reason for hiding this comment

aaronlevy Jun 19, 2017

Choose a reason for hiding this comment

diegs Jun 19, 2017

Choose a reason for hiding this comment

aaronlevy Jun 19, 2017

Choose a reason for hiding this comment

diegs Jun 19, 2017

Choose a reason for hiding this comment

coresolve commented Jun 19, 2017

diegs commented Jun 19, 2017

xiang90 Jun 16, 2017 •

edited

Loading

diegs Jun 16, 2017 •

edited

Loading