Kubelet to checkpoint running pods (including kube-apiserver) #30065

maciaszczykm · 2016-08-04T10:02:20Z

Me and few other Fujitsu employees are interested in mentioned topic and would like to contribute in this area, so we have prepared this proposal to discuss it with whole community. Please share your opinion with us.

CC @kubernetes/sig-cluster-lifecycle @floreks @kenan435 @taimir @zreigz @cheld

Checkpointing the current API server configuration

Currently a single-master cluster installation of Kubernetes that is self-hosted cannot recover after a reboot. The limitation comes from the fact that the API server is self-hosted as well. Since self-hosted components are not static pods, they will not be recreated after the reboot. In order for kubelet to restart self-hosted components, it needs a functioning API server. The bootkube project currently resolves this by a dedicated “user space checkpointing” container in the self-hosted kubelet pod, which periodically persists the API-server manifest as a static pod in the manifest directory.

Motivation

It would be of benefit to have the API-server checkpointing as a part of kubelet, here’s why:

reduced dependency on external tools, so that there are no compatibility issues in the future,
maintained by the Kubernetes community so that development and maintenance does not fall behind from the Kubernetes master,
avoiding new components leads to easier provisioning and maintenance (less .yaml files),
checkpointing only when the API-server pod has been updated, avoiding unnecessary disk writes.

Proposal

The proposal is to implement a sort of a checkpointing or snapshotting mechanism of the api-server pod definitions. If and when the self hosted api-server fails and is not actively running it would be responsible for spinning up a temp api-server which would in turn start the self-hosted api-server.

High level

General idea of solving this issue is to add API-server checkpointing to the kubelet.

While running, it will periodically perform backups of the running self-hosted API server and store them locally, in the form of static pod definition. When the self-hosted API-server is down, the locally stored backup can be retrieved as a temporary API server the reason we are spinning up the temp server is to establish communication to the etcd server which has the latest pod definitions so that it can trigger the re-launch of the self-hosted api server in order to heal the cluster. The temporary API server will then be used to recreate all self-hosted components, including the missing self-hosted API-server.

Implementation details

Let’s sketch the rough flow:

During cluster provisioning kubelet starts with --checkpoint=true flag.
Checkpointer performs snapshot of API server running in kube-system namespace and saves it as a static pod definition locally, as a .yaml or .json file.
If checkpointer detects that the kube-system API server is not running, it moves saved static pod to the /etc/kubernetes/manifests directory. Then kubelet automatically creates a temporary API server from the static pod.
Assuming a system reboot has occurred, in the beginning only kubelet is running. Once the checkpointer detects kube-system API-server is missing, it moves the static pod to the manifests directory and thus the temp API-server is started. Now we have kubelet and the temporary API server running. etcd still has the states of all cluster pods from before the cluster restart. Because of that, kubelet can talk to etcd via the temporary API server and restore the lost cluster state from before the reboot. This includes starting the true, self-hosted API server.
Once the self-hosted API server appears again, the checkpointer stops the temporary API server by moving the static pod out of the /etc/kubernetes/manifests directory.

Issues and limitations

Assumption that there will always be a running (bootstrap) kubelet after a node reboot.
Assumption that etcd will restart after the reboot and not be lost, i.e. shouldn’t disappear forever after the master has been rebooted (assuming that it’s located on the master).
Where to put checkpointer code in kubelet?
How often checkpointing should be run? Is it possible to perform it only while API server will be modified?
How to designate the kubelet that is on the master node as the only one that is running the checkpointing mechanism?
kubelet --checkpoint=true flag name could be changed.

The text was updated successfully, but these errors were encountered:

floreks · 2016-08-04T12:18:30Z

Here is basic visualization of checkpointing workflow described in implementation details:

vishh · 2016-08-04T15:24:13Z

Currently a single-master cluster installation of Kubernetes that is self-hosted cannot recover after a reboot. The limitation comes from the fact that the API server is self-hosted as well.

How are the self-hosted components being started currently? Why is there no fallback apiserver that is run as a static pod?

Adding checkpointing logic to kubelet that is not specific to kubelet itself doesn't seem like a good idea.

cc @kubernetes/sig-node

vishh · 2016-08-04T15:25:46Z

cc @aaronlevy How is this scenario handled today with your self-hosting proposal?
I'm still genuinely surprised that users will want to self-host etcd given that it is extremely critical to the overall cluster state.

aaronlevy · 2016-08-04T19:16:59Z

Why is there no fallback apiserver that is run as a static pod?

If this were the case, I think it would mean there were some special "fallback" api-servers that would need to be managed/upgraded differently than the self-hosted daemonset/deployment based api-servers. Ideally, this wouldn't be the case as it somewhat defeats a goal of self-hosting these components (back to: modify files on disk with external tools).

Now to be fair, single api-server is not exactly the ideal, and we're solving for the failure domain of "all api-servers are down". But single-node master deployments are common, and even addressing multiple api-servers isn't super simple (requires loadbalancer or external DNS). And in the single-node master case, "all api-servers are down" is a reboot - so a goal is finding sane way of solving for this.

Adding checkpointing logic to kubelet that is not specific to kubelet itself doesn't seem like a good idea.

Agreed. I think a generic solution would be more ideal, rather than "checkpoint an api-server". The initial implementation of the user-space api-server checkpointing was done this way to scratch the immediate itch, with the longer term goal of starting a discussion around general checkpointing. Which itself is a little vague:

pod manifest checkpoints
configMap checkpoints
secret checkpoints (or if this will even be supported).

How is this scenario handled today with your self-hosting proposal?

Functionally, it's pretty close to the workflow described by @maciaszczykm and @floreks. Also, very similar in function to the old podmaster (move static pod manifests around on disk to activate/deactivate)

A side-car container is deployed with the kube-apiserver pod
The side-car copies a pod manifest for a "api-server checkpointer" application into static pod dir (then waits indefinitely).
The "api-server checkpointer" static pod functions as described above (make a local inactive copy of api-server manifest & required secrets. Modify the secret volume to be a host volume).
If the "api-server checkpointer" sees that no api-server is running locally, it will move the inactive copy of api-server into the static pod dir.
If the "api-server checkpointer" sees the self-hosted api-server running, it will move the checkpointed copy out of the static pod dir (deactivate).

I'm still genuinely surprised that users will want to self-host etcd given that it is extremely critical to the overall cluster state.

I may have mis-read (via ctrl-f) but I think the proposal just discusses that etcd is assumed to have survived - that the assumption is that the api-server can always successfully contact etcd.

That being said, self-hosting etcd is something we want to look into, but its still in the very early stages of discussion - and like other components, could always be opted out and run however the deployer decides.

aaronlevy · 2016-08-04T19:19:08Z

ref: #489

lavalamp · 2016-08-04T21:50:32Z

Kubelet shouldn't do anything special for apiserver. Kubelet should checkpoint everything it was running.

lavalamp · 2016-08-04T21:51:11Z

Anyway, I think the solution for this more like have a "bootstrap" apiserver in the static manifest folder.

kenan435 · 2016-08-05T09:38:56Z

How are the self-hosted components being started currently?

@vishh we really are just trying to make sure that the self-hosted api-server is up and running at all times, so that kubelet/etcd can communicate with each other and maintain the cluster state. API-server itself being self-hosted needs a way to restart in case of failure. Kubelet not being able to communicate with the etcd server through the api-server prevents this from happening. How the self hosted components are initially started is not a concern of this proposal as the "snapshoting" of the api-server is started after the provisioning process has completed.

kenan435 · 2016-08-05T09:51:25Z

@lavalamp agree, seems we are not respecting the separation of concern here. Having a dedicated pod that is responsible for this would be a better solution? The thing is, the api-server cannot checkpoint itself, so somebody else needs to do it for it.

Anyway, I think the solution for this more like have a "bootstrap" apiserver in the static manifest folder.

When taking a snapshot of the running api-server we are persisting important info like etcd server ip address for example. The api server definition can also change over time, or have custom file paths etc... We want to keep this info and use it to restart the api-server when it fails. Storing static pod definitions in the manifest folder goes against the future plans of the community to have k8s's components host them selves, meaning this folder should be kept empty.

lavalamp · 2016-08-08T17:29:54Z

@kenan435 I think this issue would be a lot less confusing if you called it "Kubelet checkpointing", since kubelet is the component that needs to do checkpointing. It needs to do checkpointing for everything, not just kube-apiserver, since other services (like etcd) are also vital for the startup sequence. Focusing specifically on apiserver is overly specific.

Having a dedicated pod that is responsible for this would be a better solution?

IMO, kubelet needs this functionality built in. This is a general problem. We can imagine solving it with a sidecar container that keeps the manifests directory up to date, but that would require privileged mode, so it's not great.

Having kubelet checkpoint is a long standing request and is probably captured in another issue already.

hongchaodeng · 2016-08-08T17:52:02Z

As @lavalamp discussed with us, we will probably do some checkpointing of cached state in apiserver itself.

/cc @xiang90

aaronlevy · 2016-08-08T18:18:29Z

I agree with @lavalamp that we should be thinking about this from the perspective of "Kubelet checkpointing".

There is a general existing issue (#489) which covers 'kubelet checkpointing or something'.

I think at this point it might help to start the discussion about the scope of work to implement some form of this functionality. But even "checkpointing" is a bit of a grey area in my mind.

Regarding solving the problem presented in this issue, the functionality described would be: "periodically save the state of running pods on the node, and be able to recover that state in the absence of an api-server".

From a pretty simplistic standpoint, this would also necessitate checkpointing any api-provided assets:

Checkpoint podSpecs
Checkpoint configMaps
Checkpoint secrets

@vishh I know this has briefly been discussed in the past, but wondering about your thoughts in terms of feasibility / first steps / beginning to scope out this work?

lavalamp · 2016-08-08T22:02:49Z

@hongchaodeng I think you have misinterpreted this issue, it's actually got nothing to do with apiserver.

lavalamp · 2016-08-08T22:04:35Z

@maciaszczykm I have taken the liberty of editing the title and your first heading so as not to mislead people.

lavalamp · 2016-08-08T22:07:34Z

I'm very tempted to close this as a dup of #489. I'll let @dchen1107 make that call.

dchen1107 · 2016-08-08T22:16:16Z

I agreed with manys above the initial issue should be resolved by Kubelet checkpoint solution. Kubelet shouldn't tream kube-apiserver and other master component pods different from other pods on the node. Close the issue as the dup of #489.

kenan435 · 2016-08-09T13:09:38Z

@aaronlevy

"periodically save the state of running pods on the node, and be able to recover that state in the absence of an api-server".

This includes the api-server itself I presume. But if the kubelet now holds this so called snapshot and is responsible for restoring the state prior to the failure, what is the point in having this info in the etcd server? Would the pointchecking mechanism hold more detail about the state of the node then etcd does, namely the podSpecs, configMaps and secrets you have outlined above?

k8s-github-robot added area/nodecontroller sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 4, 2016

vishh added area/kubelet and removed area/nodecontroller labels Aug 4, 2016

lavalamp changed the title ~~API server checkpointing~~ Kubelet to checkpoint running pods (including kube-apiserver) Aug 8, 2016

lavalamp assigned dchen1107 Aug 8, 2016

dchen1107 closed this as completed Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet to checkpoint running pods (including kube-apiserver) #30065

Kubelet to checkpoint running pods (including kube-apiserver) #30065

maciaszczykm commented Aug 4, 2016 •

edited by lavalamp

Loading

floreks commented Aug 4, 2016

vishh commented Aug 4, 2016

vishh commented Aug 4, 2016

aaronlevy commented Aug 4, 2016 •

edited

Loading

aaronlevy commented Aug 4, 2016

lavalamp commented Aug 4, 2016

lavalamp commented Aug 4, 2016

kenan435 commented Aug 5, 2016

kenan435 commented Aug 5, 2016 •

edited

Loading

lavalamp commented Aug 8, 2016

hongchaodeng commented Aug 8, 2016

aaronlevy commented Aug 8, 2016

lavalamp commented Aug 8, 2016

lavalamp commented Aug 8, 2016

lavalamp commented Aug 8, 2016

dchen1107 commented Aug 8, 2016

kenan435 commented Aug 9, 2016 •

edited

Loading

Kubelet to checkpoint running pods (including kube-apiserver) #30065

Kubelet to checkpoint running pods (including kube-apiserver) #30065

Comments

maciaszczykm commented Aug 4, 2016 • edited by lavalamp Loading

Checkpointing the current API server configuration

Motivation

Proposal

High level

Implementation details

Issues and limitations

floreks commented Aug 4, 2016

vishh commented Aug 4, 2016

vishh commented Aug 4, 2016

aaronlevy commented Aug 4, 2016 • edited Loading

aaronlevy commented Aug 4, 2016

lavalamp commented Aug 4, 2016

lavalamp commented Aug 4, 2016

kenan435 commented Aug 5, 2016

kenan435 commented Aug 5, 2016 • edited Loading

lavalamp commented Aug 8, 2016

hongchaodeng commented Aug 8, 2016

aaronlevy commented Aug 8, 2016

lavalamp commented Aug 8, 2016

lavalamp commented Aug 8, 2016

lavalamp commented Aug 8, 2016

dchen1107 commented Aug 8, 2016

kenan435 commented Aug 9, 2016 • edited Loading

maciaszczykm commented Aug 4, 2016 •

edited by lavalamp

Loading

aaronlevy commented Aug 4, 2016 •

edited

Loading

kenan435 commented Aug 5, 2016 •

edited

Loading

kenan435 commented Aug 9, 2016 •

edited

Loading