Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create command for rotating cluster CA #10516

Closed
wants to merge 1 commit into from

Conversation

olemarkus
Copy link
Member

@olemarkus olemarkus commented Jan 2, 2021

Fixes #8774
Fixes #10498
Fixes #1020

Pretty much following https://kubernetes.io/docs/tasks/tls/manual-rotation-of-ca-certificates/ but in an order that makes sense for kOps

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 2, 2021
@k8s-ci-robot k8s-ci-robot added area/nodeup size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 2, 2021
@olemarkus
Copy link
Member Author

/milestone v1.21

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from olemarkus after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@olemarkus olemarkus modified the milestones: v1.22, v1.21 Apr 4, 2021
@olemarkus olemarkus changed the title WIP: Create toolbox command for rotating cluster CA Create toolbox command for rotating cluster CA Apr 4, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 4, 2021
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 4, 2021
@olemarkus olemarkus changed the title Create toolbox command for rotating cluster CA Create command for rotating cluster CA Apr 4, 2021
@olemarkus
Copy link
Member Author

/cc @johngmyers @justinsb

@hakman hakman requested review from hakman and removed request for rdrgmnzs and KashifSaadat April 9, 2021 05:51
@johngmyers
Copy link
Member

johngmyers commented Apr 10, 2021

My main concern with this is how the admin is supposed to recover if the command fails halfway through, for example with a power loss on the machine running kops.

I designed the similar WIP PR #9556 with smaller, atomic steps. In that PR, to fully distrust the previous keys one has to run the command three times, waiting enough time for tokens to be renewed between each run.

Like that PR, it might be easier to simplify by always keeping a "next CA" (and possibly "previous CA") in the trust list.

Another idea is to put a hash of the relevant CA set in userdata, so worker and/or control plane nodes will be detected as needing update by the rolling update code. (EDIT: Actually, this probably already happens as a side-effect of the changing config)

@olemarkus
Copy link
Member Author

We rotate the cluster twice in this case, once with a CA bundle where both certs are trusted, and once with only one the new CA after everything trust the new CA.

I have not managed to end up in a state where the cluster is broken. Only manage to lock myself out by explicitly exporting admin credentials signed with the new CA too early, before the api servers trust it.

@olemarkus
Copy link
Member Author

@johngmyers I see your PR is about SA signing token keys, which is much more messy as all api servers and all pods should rotate at exactly the same time. There is kubernetes/kubernetes#20165 but it is pretty much unresolved

@johngmyers
Copy link
Member

@olemarkus my PR is about graceful rotation of SA signing token keys. It is not necessary for all API servers and pods to rotate at the same time.

Since kubernetes/kubernetes#20165 was filed the --service-account-key-file flag started taking multiple keys.

@olemarkus
Copy link
Member Author

Ah wasn't aware of that. Thanks. I need to have a look on how that one works then.

@johngmyers
Copy link
Member

@olemarkus Would you be willing to hold this for a while to give #11204 a chance to land first? I believe that puts down some infrastructure that would improve this PR.

@olemarkus
Copy link
Member Author

Yeah. I like the ability to decide when to move to the next key rather than always using the highest number. I think that may require an additional roll of the control plane though, as on the first roll, the kube-controller-manager would not use the new keys. Unless we tell it to always use the latest cert even if it is not primary.

@k8s-ci-robot
Copy link
Contributor

@olemarkus: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2021
}

if !model.UseKopsControllerForNodeBootstrap(cluster) {
return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to disallow for !UseEtcdManager && UseEtcdTLS

}

//Update service accounts to trust old and new CA
err = rotateCAUpdateServiceAccounts(ctx, cluster, caBundle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be done after staging the new cert to the control plane. Otherwise, between the time this runs and the time the control plane picks up the new cert, controller-manager could write a new service account secret containing only the old CA.

//Update masters. This will issue new certs for k8s using the new CA.
//New nodes, service accounts etc will use new CA
ruo.InstanceGroupRoles = []string{"master", "apiserver"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the new certs/keys currently take effect without an apply_cluster, this is undesirable and there are plans to move more of this stuff to userdata and/or versioned files in the store. We should ensure that changing either the set of trusted CAs or the active CA causes the control plane nodes to be treated as "NeedsUpdate". We should also allow for a future need to do an apply_cluster.


klog.Info("rotating the control plane")

//Update masters. This will issue new certs for k8s using the new CA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//Update masters. This will issue new certs for k8s using the new CA.
// Update control plane. This will issue new certs for k8s using the new CA.

return fmt.Errorf("failed to rotate cluster: %v", err)
}

klog.Info("rotating the control plane")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new certificate-authority-data needs to be propagated to all the kubeconfig files that live outside the cluster before proceeding to change the signing CA. This is part of the reason I don't think this should be an "all in one go" type of command.

return fmt.Errorf("only clusters using kops-controller for boostrapping nodes are supported")
}

exportAdmin := clientConfig.Contexts[contextName].AuthInfo == contextName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a bit too heuristic. It should check that the corresponding AuthInfo has nonempty ClientCertificateData

}

if len(pool.Secondary) > 0 {
klog.Info("Secondary CA cert already in the pool. Not issuing a new CA")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the previous run died while removing trust from the old CA?

@johngmyers
Copy link
Member

Another reason I don't like an all-in-one command: We have a workload (rhymes with CrewBeeper) which frequently fails on a node rolling update. For this reason, we only do node rolling updates with a locally written controller. That controller implements local constraints, such as notifying the relevant workload's maintainers and delaying the update of their instance group until they can be on deck to fix the resulting failure of their workload. So we would not want the command that updates the keys to perform the rolling update.

@olemarkus
Copy link
Member Author

/close

Closing in favour of the great work done by John!

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Closed this PR.

In response to this:

/close

Closing in favour of the great work done by John!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation area/nodeup cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automate rotation of cluster secrets Certificate renewal Provide full secret rotation
4 participants