-
Notifications
You must be signed in to change notification settings - Fork 219
Documentation: Disaster recovery scenarios #432
Comments
To expand on some of this a bit - and include strawman discussion of building a tool to handle some of the recovery mechanics: Partial loss of control plane:We are assuming we still have running apiserver + etcd - and we just need to recover other control plane components that are no longer scheduled for some reason. More specific discussion is in #112 -- but the pseudo process / UX might be:
or maybe
loss of all api-serversWe have two main options here. One option would assume that we are also backing up checkpoints of "critical" pods (e.g. apiserver + anything it relies on like secrets). This seems relatively reasonable - but does make the backup/restore story have a few more moving pieces. Another option would be to only expect access to etcd - and attempt recovery from there. This would be my initial preference because technically all needed state should exist directly in etcd. The process would essentially be like a partial All pseudo UX:
This would:
The above UI might be trying to overload functionality too much -- we could also make this more explicit rather than generic:
Or maybe even be a new option in
Recovery from etcd backup (external etcd)This would be the similar process as above, but following normal etcd-recovery documentation (e.g. start a new etcd cluster from backup). The change might be that we need to modify the apiserver manifest if the network addressability of the etcd cluster has changed (this could essentially be documented). Recovery from etcd backup (self-hosted etcd)
Pseudo UX:
Or again, could be a new
More related discussion: #333 (comment) |
this is an awesome awesome idea! |
Adding another failure case to consider:
Not sure best option here. If the pod is deleted -- you're essentially saying it should no longer run on the node. As I mentioned in the original PR:
|
What @dghubble described doesn't make any sense. In another word, is there any real scenario that does that? All system pods would be managed by our control plane. Our bottom line is that whatever error happened, we could still recover everything. But we should not optimize such extreme, not-real scenario. |
Just because it hasn't been considered doesn't mean it won't happen. This WILL happen to users somewhere, somehow and there will need to be an answer about how to recover. The only thing you can count on in real-world scenarios is the unexpected. Let's not dismiss this as extreme. Plenty of the resiliency tests simulate a failure by deleting a pod to ensure it recovers. People demo recovery this way as well. Or accidents will happen. This was nearly the first thing I tried when assessing resiliency of self-hosted etcd. Perhaps the better questions here are whether a checkpointed deployment per pod or a StatefulSet could be of help or how recovery from the data that exists on disk could be done manually or automatically. |
Overall I agree, and the recovery tool will support this scenario (point it at an etcd data dir). I think the open-ended question is if there is some way we might headlessly recover from this scenario (without a user running a recovery step). One option might be to simply delay the checkpoint garbage collection decision. In this scenario the checkpointer saw that the etcd pod had been removed from api (so GC'd the local checkpoint) - but the etcd pod was still running at that moment (otherwise it wouldn't have been able to contact the api). If we delayed the decision to GC, then we wouldn't remove the etcd checkpoint (because the apiserver would shortly become unavailable). Essentially we're saying "delay the GC until some window of 'still working'".
Can you expand on this? I don't follow what you mean |
We should document recovery of:
The text was updated successfully, but these errors were encountered: