Provide full secret rotation #1020

justinsb · 2016-11-30T21:11:03Z

We should provide a way to rotate all our secrets: usernames, tokens, CA key, SSH key.

krisnova · 2016-12-11T18:22:06Z

Curious do we want 1 command to rotate them all, or do we need individual execution paths for all of our secrets?

Should we also include a way to generate these secrets within kops automatically?

chrislovecnm · 2016-12-11T21:23:06Z

@kris-nova I am thinking that we do a rolling update of the cluster. All components to need a restart and the kubeconfig is going to change. I am pretty sure kops already generates the secrets, we just need to be able to roll and update. Also we need to plugin to 3rd part cert sources.

Secretes:

TLS Certs
Admin ssh key
Admin password

^ What am I missing??

krisnova · 2016-12-11T21:39:38Z

I was hoping kops could generate an ssh key for the user if they wanted one ad-hoc - do we do this yet?

chrislovecnm · 2016-12-17T22:22:31Z

Nope we require an ash key

lev-kuznetsov · 2017-07-11T16:36:37Z

Is this still on a roadmap somewhere?

Is there a workaround I can follow in the meantime? We need a procedure for rotating credentials.

chrislovecnm · 2017-07-12T00:36:49Z

At this point creating a new secret and applying acrolling update is your option.

fejta-bot · 2017-12-31T14:13:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-30T14:21:36Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-01T15:07:51Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Christian-Schmid · 2019-03-07T16:55:40Z

/remove-lifecycle rotten
/reopen

Hi guys,

I'm opening this issue again, as we have the same challenge currently.
As the issue is already a bit old, I want to ask, if by now maybe something in that direction happend.
I found a sketched procedure here: https://github.com/kubernetes/kops/blob/master/docs/rotate-secrets.md but this one does not work without downtime.

Does maybe someone got some ideas / experiences how to solves this?

Thanks a lot!

Chris

Christian-Schmid · 2019-03-07T16:57:26Z

/reopen

k8s-ci-robot · 2019-03-07T16:57:33Z

@Christian-Schmid: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

philwhln · 2019-03-25T23:57:11Z

Does maybe someone got some ideas / experiences how to solves this?

@Christian-Schmid Did you find any non-downtime solutions to this? We're looking at the same thing now.

mikesplain · 2019-03-26T00:56:17Z

On behalf of @Christian-Schmid

/reopen

k8s-ci-robot · 2019-03-26T00:56:18Z

@mikesplain: Reopened this issue.

In response to this:

On behalf of @Christian-Schmid

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Christian-Schmid · 2019-03-26T15:44:16Z

Hi @philwhln
We're still looking for a "nice" solution to rotate the secrets.
One reason why we want to do the rotation is that we want to control external access to the kubernets api.
One option we were evaluating was to put an nginx reverse proxy in front of the kube api.
Like this we wouldn't really have to rotate the certificates, and still could control the external api access with other means of authentication.

But if this workaround helps, depends on the use case why you want to rotate :-)

philwhln · 2019-03-26T17:27:25Z

Hi @Christian-Schmid ,

In the short-term, we're looking to rotate out the keys we used in early development of our clusters as these were used by people who have left the company. In the longer-term, we see this as good practice. Interesting idea with the reverse proxy. Would this work?

We're not overly confident in https://github.com/kubernetes/kops/blob/master/docs/rotate-secrets.md since it seems to have been written 18 months ago with little review or updates. That said, we're going to dig into it and test it out. @justinsb, I'm interested in your thoughts on this, since you wrote that doc and also opened this ticket :)

Christian-Schmid · 2019-03-29T16:11:12Z

Regarding the rotate-secrets manual: we tested the described steps and it more or less worked as described with a downtime of like 15 minutes (due to we had to force cluster update twice).
The only step which we had a problem was the line:

kops get secrets | grep ^Keypair | awk '{print $2}' | xargs -I {} kops delete secret keypair {}

which caused an error in the current kops version. But we could delete the pki directly from the s3 bucket with aws cli:
aws s3 rm s3://<your_bucket>.com/pki/issued --recursive
aws s3 rm s3://<your_bucket>.com/pki/private --recursive

Regarding proxy: We didn't investigate so far further in that direction due to time reasons...

philwhln · 2019-03-29T19:44:23Z

@Christian-Schmid . Thanks for this info!

The only step which we had a problem was the line:

We hit the same problem and decided not proceed. Good to know that deleting in S3 worked and you were able to complete the process. We had considered this too. Downtime is not great though :)

tushar00jain · 2019-04-09T10:28:19Z

+1

tushar00jain · 2019-05-02T09:34:26Z

@justinsb

It doesn't look like kubernetes supports the usage of multiple certs that would make a zero downtime rotation possible. I would appreciate if you could point me to the sig responsible for pki or maybe point me on if and how I could start working on this myself? This issue seems pretty critical and judging by the response on it doesn't seem likely that waiting for 10 years for someone to implement a solution would be sufficient. If somehow our secrets get leaked then also there is no way to rotate the certificates unless we accept the hefty downtime.

@chrislovecnm

At this point creating a new secret and applying acrolling update is your option.

Could you explain what you mean by this? You mean the current documented method that involves deleting all pki related data on S3 or something else?

tushar00jain · 2019-05-02T13:27:49Z

kubernetes/kubeadm#581
kubernetes/kubeadm#1361

Maybe kubernetes and kubeadm supports renewal but lacks the docs, certainly for HA master nodes judging by the comments on the above mentioned issues. After those are addressed maybe the parts of the code can be also ported over from kubeadm.

tushar00jain · 2019-05-04T13:02:54Z

BTW @philwhln @Christian-Schmid for your use cases about external access, maybe the best approach is to use OIDC or something like AWS authenticator instead of x509. There's good support for these approaches already.

And one way to rotate the certificates with minimum downtime would be to spin up a warm standby cluster when the certificates are about to expire and move over the traffic to this other cluster.

philwhln · 2019-05-04T14:42:34Z

use OIDC

@tushar00jain We already do use this, with dex (similar to https://thenewstack.io/kubernetes-single-sign-one-less-identity/), but this doesn't remove the need for Kubernetes to have certificates that need rotating.

move over the traffic to this other cluster.

This does seem like the only solution right now, but there's a cost to this and something we don't think we should have to do.

tushar00jain · 2019-05-04T15:01:09Z

but this doesn't remove the need for Kubernetes to have certificates that need rotating.

yes at least the downtime should be in seconds if at all but the approach mentioned by deleting the pki folder on S3 is just not feasible

fejta-bot · 2019-08-02T15:52:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jeremygaither · 2019-08-24T22:40:26Z

Would using offline root CAs with long lifetimes, and then using intermediate and subordinate CAs for the cluster CA, help any in a zero-downtime rolling-update? It obviously wouldn’t help if the root CA must be rotated, if any of the CAs must be revoked, or with username or token rotation. I’m curious if it would at least help in non-revocation forward rolling updates. I suppose this would assume that new nodes wouldn’t be able to join the cluster until the new cluster CA certificates were added to the master nodes, and a new initial Kubelet client certificate was in place. SSH on the nodes could trust certificates signed by the root CA chain.

/remove-lifecycle stale

rpagliuca · 2019-10-10T19:44:39Z

I'm very interested in this feature as well. We should be able to rotate secrets without downtime.

kuzaxak · 2019-12-14T16:03:06Z

I'm trying to use article https://github.com/kubernetes/kops/blob/master/docs/rotate-secrets.md and faced the issue with etcdv3 cluster, not it reporting the next:

2019-12-14 15:44:51.789763 I | embed: rejected connection from "172.20.51.200:57144" (error "remote error: tls: bad certificate", ServerName "etcd-events-2.internal.<redacted>")
2019-12-14 15:44:51.793287 I | embed: rejected connection from "172.20.51.200:57148" (error "remote error: tls: bad certificate", ServerName "etcd-events-2.internal.<redacted>")

kuzaxak · 2019-12-14T16:57:27Z

Need to add the section to doc https://github.com/kubernetes/kops/blob/master/docs/rotate-secrets.md

After changing Etcd-v3 CA, need to trigger issuing new certs on masters.

Log in to each master via ssh and run:

sudo find /mnt/ -name server.* | xargs -I {} sudo rm {}
sudo find /mnt/ -name me.* | xargs -I {} sudo rm {}

It will erase the old peer and client certificate. Then roll masters again, and on startup, Etcd will issue a valid certificate from a new CA.

cmanzi · 2020-01-17T13:52:06Z

It will erase the old peer and client certificate. Then roll masters again, and on startup, Etcd will issue a valid certificate from a new CA.

@kuzaxak Have you gotten this to work? I've tried doing this manually, and it does not seem to be so simple. In our setup, we are providing our own custom CAs for everything by uploading them to the kops state bucket, and we set them up to be rotated every 6 months. What I did was:

Replace the Etcd CAs
Allow etcd-manager to mount the volumes (with the old certs)
Delete the existing client/server/peer certs on the etcd volumes
Restart etcd-manager to force certificate issuance.
The problem is that after doing this, peer authentication still seems to be broken. After a few hours of debugging I was pretty much stumped. I haven't had any more time to debug that since.

Basically this has forced us to use the current strategy of replacing clusters entirely when their CAs expire, which is far from ideal. I was curious if anyone else had encountered this issue.

olemarkus · 2020-04-21T08:47:43Z

Updating the docs in #8948

Cryptophobia · 2020-04-24T00:08:27Z

I stumbled on this issue and the docs after running into certificate problems with etcd after a cluster upgrade. I just have to say that I'm very thankful I found them and every single line worked as expected.

Big thank you for writing these docs! ❤️

fejta-bot · 2020-07-23T00:52:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rifelpet · 2020-07-23T01:28:45Z

/remove-lifecycle stale
/lifecycle frozen

olemarkus · 2021-12-29T13:33:41Z

This has now been implemented. See https://kops.sigs.k8s.io/operations/rotate-secrets/
/close

k8s-ci-robot · 2021-12-29T13:33:58Z

@olemarkus: Closing this issue.

In response to this:

This has now been implemented. See https://kops.sigs.k8s.io/operations/rotate-secrets/
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krisnova added area/security Feature Request labels Dec 11, 2016

justinsb added this to the 1.5.1 milestone Dec 28, 2016

kenden mentioned this issue Apr 25, 2017

Easy way to issue new master cert? #2375

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018

k8s-ci-robot closed this as completed Mar 1, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 7, 2019

k8s-ci-robot reopened this Mar 26, 2019

tushar00jain mentioned this issue Apr 9, 2019

Kubernetes Certificates Rotation #6758

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2019

rifelpet removed this from the 1.5.2 milestone Apr 11, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 23, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2020

olemarkus mentioned this issue Jan 15, 2021

Create command for rotating cluster CA #10516

Closed

k8s-ci-robot closed this as completed Dec 29, 2021

Provide full secret rotation #1020

Provide full secret rotation #1020

Comments

justinsb commented Nov 30, 2016

krisnova commented Dec 11, 2016

chrislovecnm commented Dec 11, 2016 • edited Loading

krisnova commented Dec 11, 2016

chrislovecnm commented Dec 17, 2016

lev-kuznetsov commented Jul 11, 2017

chrislovecnm commented Jul 12, 2017

fejta-bot commented Dec 31, 2017

fejta-bot commented Jan 30, 2018

fejta-bot commented Mar 1, 2018

Christian-Schmid commented Mar 7, 2019 • edited Loading

Christian-Schmid commented Mar 7, 2019

k8s-ci-robot commented Mar 7, 2019

philwhln commented Mar 25, 2019

mikesplain commented Mar 26, 2019

k8s-ci-robot commented Mar 26, 2019

Christian-Schmid commented Mar 26, 2019

philwhln commented Mar 26, 2019

Christian-Schmid commented Mar 29, 2019

philwhln commented Mar 29, 2019

tushar00jain commented Apr 9, 2019

tushar00jain commented May 2, 2019

tushar00jain commented May 2, 2019

tushar00jain commented May 4, 2019

philwhln commented May 4, 2019

tushar00jain commented May 4, 2019 • edited Loading

fejta-bot commented Aug 2, 2019

jeremygaither commented Aug 24, 2019

rpagliuca commented Oct 10, 2019

kuzaxak commented Dec 14, 2019

kuzaxak commented Dec 14, 2019

cmanzi commented Jan 17, 2020 • edited Loading

olemarkus commented Apr 21, 2020

Cryptophobia commented Apr 24, 2020

fejta-bot commented Jul 23, 2020

rifelpet commented Jul 23, 2020

olemarkus commented Dec 29, 2021

k8s-ci-robot commented Dec 29, 2021

chrislovecnm commented Dec 11, 2016 •

edited

Loading

Christian-Schmid commented Mar 7, 2019 •

edited

Loading

tushar00jain commented May 4, 2019 •

edited

Loading

cmanzi commented Jan 17, 2020 •

edited

Loading