Control plane failure modes for high-availability documentation #43849

royalsflush · 2023-11-07T17:08:25Z

We likely need some brief documentation on what customers can expect in terms of the reliability of the control plane. We discussed the "majority" vs "less than majority" buckets of problems, would be great to have documentation that we can point to, in order to justify our reliability stance

k8s-ci-robot · 2023-11-07T17:08:34Z

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-11-07T17:08:35Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neolit123 · 2023-11-07T17:24:59Z

/transfer website

where the k8s documentation is located.

neolit123 · 2023-11-07T17:29:18Z

We likely need some brief documentation on what customers can expect in terms of the reliability of the control plane. We discussed the "majority" vs "less than majority" buckets of problems, would be great to have documentation that we can point to, in order to justify our reliability stance

when speaking about "majority" is this about etcd's raft algorithm? k8s core doesn't have this requirement directly.
also, when / where was this discussed?

neolit123 · 2023-11-07T17:30:35Z

/kind feature
/triage needs-information
/sigs docs
(tagging with docs until owner is established, if ever)

sftim · 2023-11-07T18:26:55Z

It'd be good to understand the gaps: what should https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ cover that it doesn't?

neolit123 · 2023-11-14T10:45:01Z

/close

the ticket has missing information; questions were not answered.
please update and re-open.

k8s-ci-robot · 2023-11-14T10:45:07Z

@neolit123: Closing this issue.

In response to this:

/close

the ticket has missing information; questions were not answered.
please update and re-open.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

royalsflush · 2023-11-15T09:35:45Z

Hi all, really sorry for the delay on my elaboration of this issue!

The context is that my team is working on Kubernetes reliability (as part of a product) and we want to understand the failure modes of the control plane. I had a chat with Han Kang about this offline, and I wanted to amend the details of this issue with our conversation of what I think is missing, but I wanted to review the links you all sent first to see if I was missing something. @sftim thank you very much for sending it over!

The part I wanted the most is the expectations of restrictions when one or more nodes of the control plane are down. We're currently working with a setup that considers HA as three control plane nodes, so we were trying to understand what were the consequences of:

A single node being down
The majority of nodes
All of them (we assume cluster down, but just for completeness)

So what I was asking was "what Kubernetes customers can expect in case of failure of their control plane nodes".

Let me know if this makes sense, and sorry again for the delay

neolit123 · 2023-11-15T09:43:43Z

what you are talking about makes sense, @royalsflush

please include more detail in the OP post:
#43849 (comment)

i don't mind us including more documentation about failures and recovery of the CP, as the documentation is lacking.
let's see what is actionable here.

/reopen

k8s-ci-robot · 2023-11-15T09:43:49Z

@neolit123: Reopened this issue.

In response to this:

what you are talking about makes sense, @royalsflush

please include more detail in the OP post:
#43849 (comment)

i don't mind us including more documentation about failures and recovery of the CP, as the documentation is lacking.
let's see what is actionable here.

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sftim · 2023-11-15T12:12:03Z

/sig architecture
/sig api-machinery
/remove-triage needs-information

Please revise (edit) the original issue description @royalsflush to explain what you want added to the documentation. You could write this as a user story or as a definition of done.

logicalhan · 2023-11-15T14:47:41Z

/assign

(I can take this, if y'all don't mind)

sftim · 2023-11-15T14:52:13Z

Thanks @logicalhan. These things are important.

sftim · 2023-11-15T14:52:29Z

/triage accepted
/priority important-longterm

sftim · 2023-11-15T15:02:07Z

I would add that we ideally ought to cover some of the less common situations too. I'll outline some below. What I hope is that someone carefully reading the docs can answer what the expected outcome is, without actually setting up a cluster or reading any source code.
“answer“ means working out if the expected behavior as seen by a client is: API usable; API unavailable / degraded; undefined behavior

Eg:

three control plane nodes (1 per zone); separate etcd hosts (1 per zone); full failure in exactly one zone; “perfect” client-side load balancing and retries
three control plane nodes (1 per zone); separate etcd hosts (1 per zone); etcd healthy but full API server failure in exactly one zone; “perfect” client-side load balancing and retries
even number of control plane nodes, of which all are healthy; separate etcd cluster has odd number of nodes and some (but fewer than half) have failed; “perfect” client-side load balancing and retries
even number of control plane nodes, only half of which all are healthy; separate etcd cluster has odd number of nodes and some (but fewer than half) have failed; “perfect” client-side load balancing and retries
stacked 3-node control plane; each API server only speaks to local etcd; one etcd fully unavailable; “dumb” round-robin style load balancing without health checks

I'm sure we could think up more; maybe we even have a list already?

We can produce - and publish - docs without meeting this ideal; I've mentioned it so we understand where we'd like to end up.

logicalhan · 2023-11-15T15:04:32Z

I would add that we ideally ought to cover some of the less common situations too. I'll outline some below. What I hope is that someone carefully reading the docs can answer what the expected outcome is, without actually setting up a cluster or reading any source code. “answer“ means working out if the expected behavior as seen by a client is: API usable; API unavailable / degraded; undefined behavior

Eg:

three control plane nodes (1 per zone); separate etcd hosts (1 per zone); full failure in exactly one zone; “perfect” client-side load balancing and retries

three control plane nodes (1 per zone); separate etcd hosts (1 per zone); etcd healthy but full API server failure in exactly one zone; “perfect” client-side load balancing and retries

even number of control plane nodes, of which all are healthy; separate etcd cluster has odd number of nodes and some (but fewer than half) have failed; “perfect” client-side load balancing and retries

even number of control plane nodes, only half of which all are healthy; separate etcd cluster has odd number of nodes and some (but fewer than half) have failed; “perfect” client-side load balancing and retries

stacked 3-node control plane; each API server only speaks to local etcd; one etcd fully unavailable; “dumb” round-robin style load balancing without health checks

I'm sure we could think up more; maybe we even have a list already?

We can produce - and publish - docs without meeting this ideal; I've mentioned it so we understand where we'd like to end up.

Additional scenarios:

stacked 3-node control plane; each API server only speaks to local etcd; two or more etcd fully unavailable; “dumb” round-robin style load balancing without health checks
stacked 5-node control plane; each API server only speaks to local etcd; one etcd fully unavailable; “dumb” round-robin style load balancing without health checks
stacked 5-node control plane; each API server only speaks to local etcd; two etcd fully unavailable; “dumb” round-robin style load balancing without health checks
stacked 5-node control plane; each API server only speaks to local etcd; three or more etcd fully unavailable; “dumb” round-robin style load balancing without health checks

logicalhan · 2023-11-15T15:05:38Z

I may group answers based on local or remote etcd hosts, since the answers are likely skewed to that distinction anyway.

sftim · 2023-11-15T15:09:00Z

These questions need not appear in the page; you could think of them as like unit tests for the docs. In other words, if a reviewer picks a question, can they - just by reading what's in the page - work out what the answer must be?

(we could even ask a large language AI model to help us check)

logicalhan · 2023-11-15T15:10:01Z

These questions need not appear in the page; you could think of them as like unit tests for the docs. In other words, if a reviewer picks a question, can they - just by reading what's in the page - work out what the answer must be?

(we could even ask a large language AI model to help us check)

I dig the framing.

sftim · 2023-11-15T15:43:59Z

#43903 feels slightly relevant (only slightly, though). I don't know how much we want to also cover upgrades and how they impact failure modes.

neolit123 · 2023-11-15T15:59:52Z

+1 to cover upgrades and rollback.

in KEP PRRs we require "downgradability" of k8s features, but etcd by design does not support downgrade well, yet:
etcd-io/etcd#15878 (comment)

kubeadm as a whole also does not support downgrades. it supports rollback, in case of component failure, but that may or may not work, depending on:

if it was a k8s component, hopefully all features, skews, etc properly guarantee downgrade
if it was etcd, who knows

#43903 feels slightly relevant (only slightly, though). I don't know how much we want to also cover upgrades and how they impact failure modes.

it's a bug in kubeadm's api-machinery usage and the etcd upgrade failure will trigger a rollback, unless the user workarounds it.
but since the rollback will restore an etcd with the same version, it will act as a component restart.

kumarankit999 · 2023-11-18T06:56:06Z

+1 @sftim , Can you reshare the docs for gaps

sftim · 2023-11-28T11:26:52Z

+1 @sftim , Can you reshare the docs for gaps

I don't understand what you'd like me to do here @kumarankit999. How would you know when I'd done what you're asking (can you frame it as a definition of done)?

If you mean #43849 (comment), I was the person who asked the question, and I do not have the answer to it.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Nov 7, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 7, 2023

k8s-ci-robot transferred this issue from kubernetes/kubernetes Nov 7, 2023

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. triage/needs-information Indicates an issue needs more information in order to work on it. sig/docs Categorizes an issue or PR as relevant to SIG Docs. labels Nov 7, 2023

k8s-ci-robot closed this as completed Nov 14, 2023

royalsflush changed the title ~~Control plane reliability documentation~~ Control plane failure modes for high-availability documentation Nov 15, 2023

k8s-ci-robot reopened this Nov 15, 2023

k8s-ci-robot added sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Nov 15, 2023

k8s-ci-robot assigned logicalhan Nov 15, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Nov 15, 2023

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane failure modes for high-availability documentation #43849

Control plane failure modes for high-availability documentation #43849

royalsflush commented Nov 7, 2023

k8s-ci-robot commented Nov 7, 2023

k8s-ci-robot commented Nov 7, 2023

neolit123 commented Nov 7, 2023

neolit123 commented Nov 7, 2023

neolit123 commented Nov 7, 2023

sftim commented Nov 7, 2023

neolit123 commented Nov 14, 2023

k8s-ci-robot commented Nov 14, 2023

royalsflush commented Nov 15, 2023

neolit123 commented Nov 15, 2023

k8s-ci-robot commented Nov 15, 2023

sftim commented Nov 15, 2023

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023

sftim commented Nov 15, 2023

sftim commented Nov 15, 2023

logicalhan commented Nov 15, 2023

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023 •

edited

Loading

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023 •

edited

Loading

neolit123 commented Nov 15, 2023 •

edited

Loading

kumarankit999 commented Nov 18, 2023

sftim commented Nov 28, 2023

Control plane failure modes for high-availability documentation #43849

Control plane failure modes for high-availability documentation #43849

Comments

royalsflush commented Nov 7, 2023

k8s-ci-robot commented Nov 7, 2023

k8s-ci-robot commented Nov 7, 2023

neolit123 commented Nov 7, 2023

neolit123 commented Nov 7, 2023

neolit123 commented Nov 7, 2023

sftim commented Nov 7, 2023

neolit123 commented Nov 14, 2023

k8s-ci-robot commented Nov 14, 2023

royalsflush commented Nov 15, 2023

neolit123 commented Nov 15, 2023

k8s-ci-robot commented Nov 15, 2023

sftim commented Nov 15, 2023

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023

sftim commented Nov 15, 2023

sftim commented Nov 15, 2023

logicalhan commented Nov 15, 2023

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023 • edited Loading

logicalhan commented Nov 15, 2023

sftim commented Nov 15, 2023 • edited Loading

neolit123 commented Nov 15, 2023 • edited Loading

kumarankit999 commented Nov 18, 2023

sftim commented Nov 28, 2023

sftim commented Nov 15, 2023 •

edited

Loading

sftim commented Nov 15, 2023 •

edited

Loading

neolit123 commented Nov 15, 2023 •

edited

Loading