Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC about certificate handling #24

Merged
merged 3 commits into from
Sep 26, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
258 changes: 258 additions & 0 deletions rfc/0018-certificate-handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
| | |
| :----------- | :----------------------------------------------- |
| Feature Name | Certificate Handling |
| Start Date | Aug 11 2023 |
| Category | enhancement, feature |
| RFC PR | [#24](https://github.com/kubewarden/rfc/pull/24) |
| State | **ACCEPTED** |

# Summary

[summary]: #summary

The goal of this RFC is to summarize the usage of certificates inside of the
Kubewarden stack and come up with a proposal about how to manage them without
having to resort to external dependencies like CertManager.

# Motivation

The first goal of this RFC is to provide a picture of the current state of
certificate handling inside of Kubewarden. How many certificates are used,
by which components, which CA signed them and how have they been generated?

The second goal is to come up with an architecture that removes the external
dependency against CertManager. The new solution should also take into account
certificate rotation.

## Examples / User Stories

> As a Kubernetes operator, I don't want to install CertManager
> in order to deploy Kubewarden.

> As a Kubernetes operator,
> I want Kubewarden certificates to be automatically rotated
> before they reach their expiration date.

# Detailed design

## Components that make use of TLS certificates

All the certificates used by the Kubewarden stack are internal to the cluster.
All of them are used to secure the communication between the API Server and different
HTTPS endpoints.

These are the HTTP servers managed by Kubewarden that require TLS termination:

- kubewarden-controller: this is a Validating and Mutating webhook endpoint that
processes the Kubewarden CRDs (like `PolicyServer`, `ClusterAdmissionPolicy`
and `AdmissionPolicy`)
- Policy Server: each Deployment has its own dedicated certificate

> **Note:** all these endpoints are either Validating or Mutating webhook
> endpoints for Kubernetes

Currently, Kubewarden 1.6 at the time of writing, the certificate used by the
kubewarden-controller is generated by CertManager.
On the other hand, the certificates used by the
Policy Server instances are generated by our controller.

The kubewarden-controller generates a self-signed Certificate Authority (CA) that
is then used to sign all the Policy Server certificates.

## Webhook Configuration and Certificates

Both `ValidatingWebhookConfiguration` and `MutatingWebhookConfiguration`
resources have a
[`clientConfig`](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.27/#webhookclientconfig-v1-admissionregistration-k8s-io)
object inside of their definition.
The `clientConfig.caBundle` has the bundle of CA required to
verify the TLS termination of the webhook endpoint:

```yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
webhooks:
- name: my-webhook.example.com
clientConfig:
caBundle: <CA_BUNDLE>
service:
namespace: my-service-namespace
name: my-service-name
path: /my-path
port: 1234
```

We have one `MutatingWebhookConfiguration` resource used by the kubewarden-controller which
holds a reference to the CA managed by CertManager.

Each `ClusterAdmissionPolicy` and `AdmissionPolicy` has a dedicated `(Validating|Mutating)WebhookConfiguration`
object defined. For all of them, the `clientConfig.caBundle` contains the certificate of
the CA generated by our controller.

## Proposed Design

We would like to get rid of CertManager. To do that we need to change how certificates are
managed for the `kubewarden-controller`.
Going forward, the controller will also take care of generating the certificate used by the kubewarden-controller.
This certificate is going to be signed by the CA which is already created by the controller.
Comment on lines +92 to +97
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we remove it completely or make it optional? I would like to double check that because the original issue around this topic says to turn it optional

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on having it optional, and disabled by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the advantage of having this certificate generated by CertManager? I mean, we're still generating all the internal ones by ourselves.

I would prefer to drop the CertManager dependency entirely, just to reduce the support matrix and keep the code as simple as possible

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flavio does this comment still stand? I thought we wanted to still support cert-manager, but not as default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last week @jvanz, @fabriziosestito and agreed to focus on removing the cert-manager dependency. We can make it optional in the future if there's some request coming from the community.


```mermaid
flowchart TD
A[KW Internal CA]
A -->B(KW controller)
A -->C(Policy Server #1)
A -->D(Policy Server #2)
```

## Possible Scenarios

This section aims to cover the different scenarios. For each one of them we start by defining an event
and then we outline the different actions that have to be done.

### Kubewarden Fresh Install

The Kubewarden stack has just been installed. The following actions have to be done:

- Generate Certificate Authority
- Generate certificate for `kubewarden-controller` and sign it with our CA
- Start the Kubewarden controller HTTPS server using this certificate
- Register the webhook against the Kubernetes API, using the certificate of our CA inside of
`clientConfig.caBundle`

### New Policy Server Defined

A new `PolicyServer` is defined, the following actions have to be done:

- Generate new certificate, sign it with our CA
- Create Policy Server Deployment, ensure the certificate is mounted inside of the Pods

This is already done by our current code (Kubewarden 1.6.0).

### New `AdmissionPolicy`/`ClusterAdmissionPolicy` is defined

A new policy is defined inside of the cluster, ensure the following action are done:

- Register the webhook against the Kubernetes API, use our CA inside of `clientConfig.caBundle`

> **Note:** it doesn't matter which Policy Server is going to host the policy. All the Policy Server
> instances use a certificate that is signed by our CA. Hence we always put our CA
> inside of the `clientConfig.caBundle`.

This is already done by our current code (Kubewarden 1.6.0).

### Policy Server Certificate Renewed

From time to time we need to renew the certificate issued to a Policy Server instance. The main reason
to perform this operation is to ensure the Policy Server is not using an expired certificate.

To renew a Policy Server certificate the following actions have to be done:

- Generate new certificate, sign it with our CA
- Force a rollout of the Policy Server deployment. This will ensure the HTTPS server uses the new certificate

> **Note:** there's no need to touch the webhook configuration objects related to the policies
> hosted by the Policy Server. The new certificate is still signed by the same CA, hence nothing has
> to be changed.

### Internal CA changes

The internal CA managed by Kubewarden might be recreated (for example, to avoid its expiration). This is the
most disruptive event that can happen, especially once some policies are deployed.

Once the new CA is generated, the following actions have to be performed:

- Generate a new CA bundle that contains the previous internal CA and the new one
- For each webhook configuration managed by Kubewarden (meaning all the policies deployed, plus the kubewarden-controller):
- Update the `clientConfig.caBundle`: ensure it contains the CA bundle created during the previous step
- For each `PolicyServer` defined:
- Generate a new certificate, sign it with the new CA
- Force a rollout of the PolicyServer Deployment
- Kubewarden Controller:
- Generate a new certificate, sign it with the new CA
- Restart the controller to ensure its HTTPS endpoint uses the new certificate

Once all the rollout of the new Policy Server instances is done, and the kubewarden controller is using the new certificate:

- For each webhook configuration managed by Kubewarden (meaning all the policies deployed, plus the kubewarden-controller):
- Update the `clientConfig.caBundle`: ensure it contains only the certitificate of the new CA

By respecing these steps we can avoid communication failures between the Kubernetes API server and the webhooks.

## Controller Bootstrap

At startup, the kubewarden-controller must ensure the root CA has been created. Currently (Kubewarden 1.6.0), the
controller creates the root CA only when the 1st PolicyServer object is defined.
We can keep creating a root CA that expires after 10 years.

The code must also ensure the root CA is not about to expire. When this scenario is detected, the controller must renew
the root CA.

## Reconciliation Loops

This section describes how the kubewarden-controller reconciliation loops should be arranged.

Currently we have these reconciliation loops:

- `PolicyServer`
- `ClusterAdmissionPolicy`
- `AdmissionPolicy`

### Policy Server

Reconciliation loops are triggered whenever an event dealing with a watched resource happens.
In this case it could be the create/update/delete of a `PolicyServer`.
The reconciliation loop is also triggered every X seconds as a way to cope with possible glitches
with the event notification system.
Comment on lines +204 to +205
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this time based trigger setup? Is this done by kube-builder? I cannot see this in the code. =(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we're currently using controller-runtime version 0.14.6.

When creating the Manager we give to it some configuration Options, see here

One of the values of the Manager Options is called SyncPeriod. Since we don't specify anything, we will have an automatic sync interval of 10 hours. Meaning, even if there's no event affecting one of our watched resources, we will still have a sync every 10 hours.

This configuration parameter has been moved to another location inside of the latest release of controller-runtime. Now this is one of the configuration values of the cache. The behaviour stays the same: every 10 hours you get a full sync.

We can rely on this 10 hours sync or we could be more explicit inside of our reconciliation loops by returning a reconcile.Result{RequeueAfter: t} instead of a reconcile.Result{}.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably go with the latter. As the docstring states, we shouldn't change the SyncPeriod as it forces every object in the cache to be reconciled, and we might need a shorter window than 10 hours. Not a strong opinion, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would either leave things untouched, hence rely on the default sync value, or provide an explicit sync timer via the Result response

Copy link
Contributor

@fabriziosestito fabriziosestito Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably go with the latter.

Sorry I did not explain myself well, I meant that I have a small preference for the Result solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, that was clear to me. I also share the same feeling about how to move forward.

This change (being explicit about the sync time) should be tracked with a dedicated issue, which would be unrelated to this RFC


At the beginning of the reconciliation loop, the code should check the expiration date of the
certificate used by the Policy Server. If the certificate is about to expire a new one should be
created. Check the previous section to see which actions have to be done in this circumstance.

The loop should also monitor changes done to the secret that holds the internal CA.
The reconciliation loop should then make sure that the Policy Server certificate has been signed by
our CA. If not, a new certificate should be issued (like during a renewal).

> Note: checking if the Policy Server certificate has been issued by our CA might be computationally intense.
> We have to ensure this is not going to cause a CPU spike. We could resort to do something clever using
> the revision number of the Secret that holds the CA.

### ClusterAdmissionPolicy

The reconciliation loop should watch also the Secret object that contains the root CA certificate.

During the reconciliation loop, the code must ensure the
`(Validating|Mutating)WebhookConfiguration` object related with the `ClusterAdmissionPolicy`
has the `clientConfig.caBundle` properly configured. The field must container the certificate of the internal root CA.

### AdmissionPolicy

The reconciliation loop should watch also the Secret object that contains the root CA certificate.

During the reconciliation loop, the code must ensure the
`(Validating|Mutating)WebhookConfiguration` object related with the `AdmissionPolicy`
has the `clientConfig.caBundle` properly configured. The field must container the certificate of the internal root CA.

# Drawbacks

[drawbacks]: #drawbacks

No significant drawback is introduced by this RFC. Our codebase already handles the certificate
creation for all the Policy Server instances.

On the other hand, we are not yet handling certificate rotation. Hence the implementation of this
RFC will improve the end user experience and make Kubewarden more robust.

# Alternatives

[alternatives]: #alternatives

Unfortunately there are no alternatives, since delegating certificate management to CertManager
is not an option. Even if that was an option, we would still have to deal with some aspects
of certificate management like handling the `(Validating|Mutating)WebhookConfiguration`
resources.

# Unresolved questions

[unresolved]: #unresolved-questions

None