-
Notifications
You must be signed in to change notification settings - Fork 219
*: add backup file option to recover subcommand #528
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet!
cmd/bootkube/recover.go
Outdated
@@ -46,8 +48,10 @@ func init() { | |||
cmdRecover.Flags().StringVar(&recoverOpts.etcdCertificatePath, "etcd-certificate-path", "", "Path to an existing certificate that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-private-key-path, and must have etcd configured to use TLS with matching secrets.") | |||
cmdRecover.Flags().StringVar(&recoverOpts.etcdPrivateKeyPath, "etcd-private-key-path", "", "Path to an existing private key that will be used for TLS-enabled communication between the apiserver and etcd. Must be used in conjunction with --etcd-ca-path and --etcd-certificate-path, and must have etcd configured to use TLS with matching secrets.") | |||
cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-servers", "", "List of etcd server URLs including host:port, comma separated.") | |||
cmdRecover.Flags().StringVar(&recoverOpts.etcdServers, "etcd-backup-file", "", "Path to the etcd backup file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&recoverOpts.etcdBackupFile
?
@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error { | |||
return err | |||
} | |||
backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix) | |||
case recoverOpts.etcdBackupFile != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just move this above the switch, set the etcdServer, and then let it fall througH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought about it. but the logging will be duplicated. it will look like
Attempting recovery using etcd backup file at ...
Attempting recovery using etcd cluster at ...
but, yea, i will try to clean up code duplication here in another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal.
pkg/recovery/etcd.go
Outdated
}{ | ||
// TODO: this already exists in bootkube/cmd. | ||
// do not duplicate this! | ||
Image: "quay.io/coreos/etcd:v3.1.6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be ok with factoring out an images
package.
cmd/bootkube/recover.go
Outdated
if err != nil { | ||
return err | ||
} | ||
defer recovery.CleanRecoveryEtcd(recoverOpts.podManifestPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ignoring error returned by function.
pkg/recovery/etcd.go
Outdated
|
||
const assetPathRecoveryEtcd = "recovery-etcd.yaml" | ||
|
||
func StartRecoveryEtcd(p, backupPath string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would put "Backup" somewhere in the name to be clearer.
pkg/recovery/etcd.go
Outdated
@@ -31,6 +37,16 @@ func NewEtcdBackend(client *clientv3.Client, pathPrefix string) Backend { | |||
} | |||
} | |||
|
|||
// NewSelfHostedEtcdBackend constructs a new etcdBackend for the given client and pathPrefix, and backup file. | |||
func NewSelfHostedEtcdBackend(client *clientv3.Client, pathPrefix, backupPath string) Backend { | |||
return &etcdBackend{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this violates the SRP.
Why not create a new type, etcdBackupBackend, which inherits etcdBackend, calls etcdBackend.Read(), and then adds the bootEtcd asset to the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
@@ -66,6 +70,22 @@ func runCmdRecover(cmd *cobra.Command, args []string) error { | |||
return err | |||
} | |||
backend = recovery.NewEtcdBackend(etcdClient, recoverOpts.etcdPrefix) | |||
case recoverOpts.etcdBackupFile != "": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal.
cmd/bootkube/recover.go
Outdated
}() | ||
|
||
bootkube.UserOutput("Waiting for etcd server to start...\n") | ||
// TODO: better way to detect the startup... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a poll loop in pkg/util/etcdutil/migrate.go#createBootstrapEtcdService that we can factor out
pkg/recovery/etcd_template.go
Outdated
--data-dir=/var/etcd/data && \ | ||
etcdctl \ | ||
--endpoints=http://localhost:32379 \ | ||
del /registry/ThirdPartyResourceData/etcd.coreos.com/clusters/kube-system/kube-etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slick, but we won't know if it fails. Might have to use the client instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no where in the code we can do this correctly from my understanding.
we have to delete the TPR after we recovered the boot etcd, but before API server starts to use it.
Maybe you can try to find a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original understanding was that you would:
(1) Start temporary etcd from backup, creating a datadir and extracting the manifests
(2) The bootstrap etcd would talk to the datadir you created in (1)
Thus, I thought you'd be able to delete the TPR from the running etcd in (1). However, it looks like (2) also starts up directly from the backup? In that case then I agree that there's no place in code to remove the TPR, so I suppose this should work. I'd say let's go ahead and try it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it looks like (2) also starts up directly from the backup?
Yes. It was not possible to reuse the same datadir since (2) needs Pod.IP, which is hard to know when we create datadir for (1). But as we start to use a well-known service IP for boot-etcd. It is possible now.
I still choose to use the backup file approach since we can save one flag for bootkube recovery. Or user needs to specify another etcd data dir which will persist after bootkube recover for the bootkube start command.
But it is up to you to decide which we should go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are recovering to a known asset-dir, it would be possible to put the etcd data dir in there (and use the service IP).
If the approach you are using appears to work right now, then I'd say let's stick with it to get something in and add a TODO/open an issue to explore the slightly cleaner approach.
@diegs This works now. Can you take a look? I am trying to extend if you want to give this a try, you can do:
|
@diegs I added a test script. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just some minor nits.
cmdRecover.Flags().StringVar(&recoverOpts.etcdPrefix, "etcd-prefix", "/registry", "Path prefix to Kubernetes cluster data in etcd.") | ||
cmdRecover.Flags().StringVar(&recoverOpts.kubeConfigPath, "kubeconfig", "", "Path to kubeconfig for communicating with the cluster.") | ||
cmdRecover.Flags().StringVar(&recoverOpts.podManifestPath, "pod-manifest-path", "/etc/kubernetes/manifests", "The location where the kubelet is configured to look for static pod manifests. (Only need to be set when recovering from a etcd backup file)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add validation that this flag is non-empty iff etcd-backup-file is non-empty (yes, even though it has a default value).
cmd/bootkube/recover.go
Outdated
case recoverOpts.etcdBackupFile != "": | ||
bootkube.UserOutput("Attempting recovery using etcd backup file at %q...\n", recoverOpts.etcdBackupFile) | ||
|
||
err := recovery.StartRecoveryEtcdForBackup(recoverOpts.podManifestPath, recoverOpts.etcdBackupFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: scope errs inside if statements (here and 2 places below)
@diegs Thanks for the review. All fixed. |
@xiang90 awesome! One last thing: can you please update the README? Then I think we're good to merge. |
Sure. I will do it late today or tomorrow (need to attend coreos fest). |
@diegs readme updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but please address my README comment before merging.
@@ -78,7 +78,13 @@ Recover from a running apiserver (i.e. if the scheduler pods are all down): | |||
bootkube recover --asset-dir=recovered --kubeconfig=/etc/kubernetes/kubeconfig | |||
``` | |||
|
|||
For a complete recovery example please see the [hack/multi-node/bootkube-test-recovery](hack/multi-node/bootkube-test-recovery) script. | |||
Recover from an etcd backup when self hosted etcd is enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update line 61 to add the new supported case, and add a section just like lines 72/78 with a command line example. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the command line example is already added two line below, right? or you want something different? i will update line 61.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry totally misread the diff. That SG, thanks!
CI failed with:
I do not think this is relevant with my PR. So I am going to merge this PR. |
rktbot run tests |
/cc @diegs
This is still WIP. But if you want to take a early view. Do it :P. I will keep this updated with my progress, and ping you again when it is finished and tested.