-
Notifications
You must be signed in to change notification settings - Fork 48
Implement automatic certificate rotation #1201
Conversation
d7f3d90
to
97e674b
Compare
This commit is a WIP version of automated certificate rotation implementation, which right now occurs on every cluster update, but as we discussed, this functionality needs to be moved into separate subcommand. Some highlights: - Added config checksum to kube-apiserver and kube-controller-manager to trigger automatic recreation of pods when secrets are updated. - Added ca-syncer init container to kubelet which ensures that bootstrap kubeconfig and one generated by TLS bootstrapping used by kubelet and client-ca certificate are always up to date, as kubelet seems to be not capable of updating them themselve. - Temporarily force kube-controller-manager to sign certificates with only 45 minutes duration to make sure certificate rotation works as expected and all used certificates are rotated. - When any of certificates used by etcd is changed, trigger copy-controller-secrets null_resource to copy etcd certificates on all controller nodes and then restart etcd to ensure it picks up new certificates, as it can only automatically pick up new client certificates and private key, but not CA certificate. - Temporarily switch certificate validity hours on AWS to 1 hour to enforce running 'lokoctl cluster apply' every hour to keep cluster alive to make sure certificate rotation works as expected. - 'lokoctl cluster apply' now taints all certificate resources on cluster update and after cluster is updated, it waits for all service account token secrets to be updated with the new CA certificate, then triggers a restart on all system deployments to make sure they pick up new CA certificate. This is core of certificate rotation logic. TODO: - Split into multiple commits. - Move rotation code to separate subcommand. - Restart etcd on other platforms. - Figure out if new k8sutil function can be used in e2e tests. Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
97e674b
to
0b360b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some thoughts mainly for myself when picking this up.
I have one main concern which is that is the CA rotation. Rotating the CA by replacing it means the cluster will be non-responsive quite soon when kubelets pick up the new CA but the API i the old (and vice versa) that will no longer be able to comminucate.
To solve this we should be able to distribute an historic CA chain (just 1 old one will be enough) as trust bundle. Followed by making it trusted by all worker nodes then restarting the API server.
Etcd certs should follow a similar strategy.
|
||
const ( | ||
// Time to wait between updating DaemonSet/Deployment and start looking if | ||
// workload has converged. This should account for kube-controller-manager election time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be an idea to watch the leader election to ensure this instead of guessing a time
The situation you described only occurs when you replace private key of the CA certificate. We only rotate the certificate itself, so the signatures remains the same, so the communication remains functional. This is a deliberate decision as first step, as changing only the certificates is relatively easy. Changing private keys is indeed harder and remains chaining and further cleanup of old CA certificates etc. |
I over read that we didn't taint the keys so that indeed helps a lot, in the first iteration of this. |
Also sorry I am throwing a lot of things here :) Some might just be better as a follow up PR in a new version ;) |
We already have CA expiration time configurable, so user may choose to extend it, but I think by default we should offer short-lived certificates and, ideally, fully automated rotation. |
I only find one certificate lifetime variable used everywhere, am I missing a variable? |
Ah yes, we only have |
Closing in favor of #1435. |
Draft on top of #1193
Closes #1215
TODO:
cluster apply
to separate sub-command