Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reload certificate periodically #280

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

gthao313
Copy link
Member

when cert-manager rotates the certificate brupop-apiserver-certificate (which by default will be after 60 days), brupop apiserver needs to reload certificate to bounce all the agent and APIServer pods.

Issue number:
Closes: #233

Description of changes:

Author: Tianhao Geng <tianhg@amazon.com>
Date:   Mon Oct 10 17:46:26 2022 +0000

    Reload certificate periodically

    when cert-manager rotates the certificate brupop-apiserver-certificate
    (which by default will be after 60 days), brupop apiserver needs to reload
    certificate to bounce all the agent and APIServer pods.

Add a mechanism to monitor if certificate has been renewed. If yes, stop servers and apiserver will be restarted to reload new certificate. Agent doesn't need do anything since the logic in brupop that agent will always use current certificate on volume to talk with apiserver.

Testing done:

  1. Set Certificate to be expired on 7 mins
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: brupop-apiserver-certificate
  namespace: brupop-bottlerocket-aws
spec:
  isCA: true
  commonName: my-selfsigned-ca
  secretName: brupop-tls
  renewBefore: 2159h53m
  1. Run integration test to monitor if all nodes still are updated to latest version.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

"Certificate has been renewed, restarting server to reload new certificate"
);
server.stop(true).await;
std::process::exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to see a process exit buried so deeply in a program. Ideally you would return a bool (or enum) that would say whether or not an exit is needed. Bubble that all the way up to main if possible. Ideally only main should have process exits IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing, I think we don't need process exit since server.stop(true).await; will restart apiserver. I'll just remove this std::process::exit(0). thanks

@gthao313
Copy link
Member Author

rebase

Copy link
Contributor

@jpmcb jpmcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🕺🏼

Comment on lines 51 to 63
pub mod mod_error {
use snafu::Snafu;

#[derive(Debug, Snafu)]
#[snafu(visibility(pub))]
pub enum Error {
#[snafu(display(
"IO error occurred while attempting to use APIServerClient: '{}'",
source
))]
IOError { source: Box<dyn std::error::Error> },
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this error be better placed in the node/error.rs file with the rest of this modules errors? (also see comment above where we seem to already have a Result type for this module

use std::io::Read;

/// The module-wide result type.
type Result<T> = std::result::Result<T, mod_error::Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already defined here?

pub type Result<T> = std::result::Result<T, Error>;

@@ -101,6 +104,9 @@ pub async fn run_server<T: 'static + BottlerocketShadowClient>(
k8s_client: kube::Client,
prometheus_exporter: Option<opentelemetry_prometheus::PrometheusExporter>,
) -> Result<()> {
let public_key_path = format!("{}/{}", TLS_KEY_MOUNT_PATH, PUBLIC_KEY_NAME);
let certificate_cache = read_certificate(&public_key_path).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace this unwrap with a structured error in the API module (reference: bottlerocket-os/bottlerocket@5a58d4b)

Comment on lines 245 to 248
// Certificate is refreshed periodically (default 60 days). Once certificate is renewed, apiserver
// needs to stop the server for reloading new certificate.
// We cache the certificate initially when brupop starts server, and compare to update-to-date certificate periodically.
// If they aren't match, we recognize it as new certificate is coming out, so the server needs to be restarted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits here:

Suggested change
// Certificate is refreshed periodically (default 60 days). Once certificate is renewed, apiserver
// needs to stop the server for reloading new certificate.
// We cache the certificate initially when brupop starts server, and compare to update-to-date certificate periodically.
// If they aren't match, we recognize it as new certificate is coming out, so the server needs to be restarted.
// The certificate is refreshed periodically (default 60 days). Once the certificate is renewed, the apiserver
// needs to stop in order to reload the new certificate.
// We cache the certificate initially when brupop starts the server, and compare it to the update-to-date certificate periodically.
// If they don't match, we recognize it as a new certificate, so the server needs to be restarted.

@gthao313
Copy link
Member Author

push above address @jpmcb comments

when cert-manager rotates the certificate brupop-apiserver-certificate
(which by default will be after 60 days), brupop apiserver needs to reload
certificate to bounce all the agent and APIServer pods.
Copy link
Contributor

@jpmcb jpmcb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇

@gthao313 gthao313 merged commit 0ec8a9a into bottlerocket-os:develop Oct 11, 2022
// We cache the certificate initially when brupop starts the server, and compare it to the update-to-date certificate periodically.
// If they don't match, we recognize it as a new certificate, so the server needs to be restarted.
async fn reload_certificate(
server: Server,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to comment on an old PR, but I was catching up on some of the stuff that's changed in brupop!

I wonder if it makes more sense to make this a mutable reference? It seems unusual to clone it. Although, I suppose if it works, Actix must have done some work to make sure that the interior state of the Server is thread-safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually find out that using https://github.com/notify-rs/notify to watch the fs change and reload the certificate is what other people like to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbgbt - This has actually changed recently: 561e6c0#diff-5cad36d03cb8802a9c5a0a20ae2e841d683e1cd564146c441e2940fb1ecf652eR255

A few APIs changed in Actix so cloning the server and using it directly wasn't a great option. Instead, now, we can use a actix_web::dev::ServerHandle as a "server handler" to call into it and stop it like:

 server_handler.stop(true).await;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reload certificate periodically
5 participants