Finding a solution for etcd #277

jamiehannaford · 2017-05-24T09:26:10Z

A few weeks a go some folks on Slack brought up the idea of defining requirements for highly available etcd on kubeadm-provisioned clusters. The idea was to define requirements before the implementation of any solution. This discussion was continued in the sig-cluster-lifecycle meeting on May 16th 2017, where we came up with some initial critieria:

High availability
a. Recovers from member failure
b. Recovers from quorum loss
c. Recovery from full cluster failure i.e. power-off
d. Recovers from partial / failed / interrupted upgrades
Handles discovery of etcd peers
Secure by default
a. TLS encryption
b. Certificate rotation
Support multiple form factors
a. Non-self hosted
b. Self-hosted (optional)
Ability to restore from a backup (possibly not backup itself)
Upgrades
a. Rolling upgrades
b. Downgrades (but tricky because of etcd)
Resize/scale cluster from 1 -> 3 -> 5 members
Ease of installation/teardown

Are there any I've missed?

The next stage is proposing solutions that meet the above criteria and can be verified in a fork.

cc/ @timothysc @justinsb @philips @xiang90 @aaronlevy

jamiehannaford · 2017-05-24T09:43:59Z

xiang90 · 2017-05-24T16:04:03Z

Recovers from partial / failed / interrupted upgrades

see kubernetes-retired/bootkube#528

timothysc · 2017-05-24T22:20:49Z

@jamiehannaford if there is a branch you would like reviewed I'd be happy to go through it now.

jamiehannaford · 2017-05-25T12:53:41Z

@xiang90 Awesome. So if a failed upgrade occurs, the user can manual restore from a backup file. Is there a way that etcd can automatically check specific locations (like local FS, S3) for backups without the user needing to specify one manually?

For example, assume that the etcd-operator has been backing stuff up to a S3 container. When it initialises it checks the same bucket and boots from there (this assume the user hasn't changed any backup options).

jamiehannaford · 2017-05-25T12:55:36Z

@timothysc Thanks! The only branch I have is the one I submitted in my PR. I think you've already gone through this though. Unless you meant something else?

There are a bunch of comments on that PR which I can start to address as a next step forward. I think I'll also add TLS secrets to the PR too. Should I go ahead and do that?

luxas · 2017-05-29T13:04:50Z

There are a bunch of comments on that PR which I can start to address as a next step forward. I think I'll also add TLS secrets to the PR too. Should I go ahead and do that?

@jamiehannaford Feel free to. I'm gonna try to look at the TLS Secrets PR this week so it might yet change (@andrewrynhard), but I don't expect it to be part of v1.7 to give us a little more time to think about it until v1.8

anguslees · 2017-08-15T23:23:19Z

For example, assume that the etcd-operator has been backing stuff up to a S3 container. When it initialises it checks the same bucket and boots from there

I'd just like to highlight that doing something like this automatically is a terrible idea and will give you multiple sources of truth if etcd is internally partitioned.

I suspect recovery will need to be manually triggered, because by definition it is required when the etcd cluster is incapable of making robust automatic decisions.

xiang90 · 2017-08-15T23:26:05Z

I'd just like to highlight that doing something like this automatically is a terrible idea and will give you multiple sources of truth if etcd is internally partitioned.

totally agree. we designed this to be a manual work, at least at etcd operator side.

luxas · 2017-08-19T19:45:06Z

Moving milestone to v1.9. In v1.8, we're gonna stick with a local etcd instance listening on localhost.

@luxas

Automatic merge from submit-queue (batch tested with PRs 54593, 54607, 54539, 54105). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add HA feature gate and minVersion validation **What this PR does / why we need it**: As we add more feature gates, there might be occasions where a feature is only available on newer releases of K8s. If a user makes a mistake, we should notify them as soon as possible in the init procedure and not them go down the path of hard-to-debug component issues. Specifically with HA, we ideally need the new `TaintNodesByCondition` (added in v1.8.0 but working in v1.9.0). **Which issue this PR fixes:** kubernetes/kubeadm#261 kubernetes/kubeadm#277 **Release note**: ```release-note Feature gates now check minimum versions ``` /cc @kubernetes/sig-cluster-lifecycle-pr-reviews @luxas @timothysc

Automatic merge from submit-queue (batch tested with PRs 49840, 54937, 54543). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add self-hosted etcd API to kubeadm **What this PR does / why we need it**: This PR is part of a larger set that implements self-hosted etcd. This PR takes a first step by adding: 1. new API types in `cmd/kubeadm/app/apis` for configuring self-hosted etcd 2. new Go types in `cmd/kubeadm/app/phases/etcd/spec` used for constructing EtcdCluster CRDs for the etcd-operator. The reason we define these in trunk is because kubeadm cannot import `github.com/coreos/etcd-operator` as a dependency until it's in its own repo. Until then, we need to redefine the structs in our codebase. **Which issue this PR fixes**: kubernetes/kubeadm#261 kubernetes/kubeadm#277 **Special notes for your reviewer**: This is the first step PR in order to save reviewers from a goliath PR **Release note**: ```release-note NONE ```

luxas · 2017-11-20T22:05:45Z

Moving milestone for this to v1.10 as we depend on changes being made to the operator before we can use it and the code freeze is coming up.

timothysc · 2018-01-31T00:45:26Z

Given all the history here and recent feedback, we need to go with the non-operator option.

timothysc · 2018-04-20T14:56:42Z

So I'm going to close this issue and open a new one to outline the doc on using the existing commands to lay down etcd, we will likely have to wait until we have some of the other phases work done as well.

/cc @chuckha @fabriziopandini @stealthybox

timothysc assigned jamiehannaford and timothysc May 25, 2017

timothysc added this to the v1.8 milestone May 25, 2017

timothysc added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label May 25, 2017

jamiehannaford mentioned this issue Jun 1, 2017

RFE: Boot-strapping etcd cluster + operator #254

Closed

timothysc added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 6, 2017

luxas modified the milestones: v1.9, v1.8 Aug 19, 2017

mosoriob mentioned this issue Aug 31, 2017

HA Kubernetes Setup ChileanVirtualObservatory/jovial.chivo.cl#5

Open

This was referenced Oct 25, 2017

Add self-hosted etcd API to kubeadm kubernetes/kubernetes#54543

Merged

Add HA feature gate and minVersion validation kubernetes/kubernetes#54539

Merged

jamiehannaford mentioned this issue Nov 9, 2017

[WIP] Add etcd-operator to kubeadm kubernetes/kubernetes#55411

Closed

luxas added area/HA area/self-hosting labels Nov 20, 2017

luxas modified the milestones: v1.9, v1.10 Nov 20, 2017

timothysc modified the milestones: v1.10, v1.11 Jan 24, 2018

timothysc added the triaged label Jan 31, 2018

timothysc removed the triaged label Apr 13, 2018

timothysc closed this as completed Apr 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding a solution for etcd #277

Finding a solution for etcd #277

jamiehannaford commented May 24, 2017 •

edited

Loading

jamiehannaford commented May 24, 2017

xiang90 commented May 24, 2017

timothysc commented May 24, 2017

jamiehannaford commented May 25, 2017

jamiehannaford commented May 25, 2017 •

edited

Loading

luxas commented May 29, 2017

anguslees commented Aug 15, 2017

xiang90 commented Aug 15, 2017

luxas commented Aug 19, 2017

luxas commented Nov 20, 2017

timothysc commented Jan 31, 2018

timothysc commented Apr 20, 2018

Finding a solution for etcd #277

Finding a solution for etcd #277

Comments

jamiehannaford commented May 24, 2017 • edited Loading

jamiehannaford commented May 24, 2017

xiang90 commented May 24, 2017

timothysc commented May 24, 2017

jamiehannaford commented May 25, 2017

jamiehannaford commented May 25, 2017 • edited Loading

luxas commented May 29, 2017

anguslees commented Aug 15, 2017

xiang90 commented Aug 15, 2017

luxas commented Aug 19, 2017

luxas commented Nov 20, 2017

timothysc commented Jan 31, 2018

timothysc commented Apr 20, 2018

jamiehannaford commented May 24, 2017 •

edited

Loading

jamiehannaford commented May 25, 2017 •

edited

Loading