Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key rotator #3136

Merged
merged 25 commits into from
Jun 20, 2024
Merged

Key rotator #3136

merged 25 commits into from
Jun 20, 2024

Conversation

inahga
Copy link
Contributor

@inahga inahga commented May 15, 2024

Supports #2147.

Introduces the key_rotator, which is a utility for managing the lifecycle of global HPKE keys. It will bootstrap keys, and rotate them according to the rotation policy in the config.

Keys are unique by their HPKE ciphersuite. For each ciphersuite, a key is run through the pending->active->expired->deleted lifecycle. It is permissive of some manual operator changes, e.g. if a manual key rotation needs to be executed through janus_cli or the aggregator API.

In this iteration, it runs as a standalone binary for use in a cronjob. It is suitable for deployment in our environments, including taskprov ones. A future PR will add support for BYOH deployments by letting the aggregator process run the key rotator.

@inahga inahga force-pushed the inahga/global-hpke branch from 20ad856 to ece604c Compare May 16, 2024 15:57
@inahga inahga force-pushed the inahga/global-hpke branch 2 times, most recently from d1a1d04 to 03f1748 Compare May 24, 2024 19:15
@inahga
Copy link
Contributor Author

inahga commented May 24, 2024

I still need to plumb this through to a CLI command, but my next iteration on the logic in aggregator/src/aggregator/key_rotator.rs is worth some early re-review, if you have some time @divergentdave.

Copy link
Collaborator

@divergentdave divergentdave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@inahga
Copy link
Contributor Author

inahga commented May 28, 2024

One edge case currently unhandled is when a ciphersuite is removed from the config. The key is then no longer managed at all by the key rotator. Fixed.

The use case for this is to retire old/undesired algorithms. I think we ought to add a parameter retire: true which automatically phases out a key, i.e. discontinues renewing it and quickly moves it to the expired state. This logic should be simple enough, and takes out manual toil if we need to retire a ciphersuite.

@inahga inahga marked this pull request as ready for review May 30, 2024 20:16
@inahga inahga requested a review from a team as a code owner May 30, 2024 20:16
@inahga inahga changed the title [WIP] Key rotator Key rotator May 30, 2024
@inahga inahga force-pushed the inahga/global-hpke branch from 760985d to d07215b Compare May 30, 2024 20:29
Comment on lines 197 to 202
for pending_keypair in pending_keypairs {
// Janus replicas should have never advertised this keypair, so it should be safe
// to delete outright.
info!(id = ?pending_keypair.id(), "deleting pending keypair for retired ciphersuite");
tx.delete_global_hpke_keypair(pending_keypair.id()).await?;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b. we could try_join a lot of calls like this, but since this code is not performance sensitive, I prioritized readability and straightforwardness.

Copy link
Contributor

@tgeoghegan tgeoghegan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to have a key rotation component in Janus to handle those keys that Janus itself is aware of, like HPKE configurations or maybe aggregator API tokens. With some work, this could be extended to handle database encryption keys, too.

But in Divvi Up, we have several secrets that we want to manage outside of Janus' scope, such as seeds for OHTTP key configurations. Does this key-rotator component fit into a bigger plan for managing all of Divvi Up's secrets? For example, could we write a higher-level key rotator that execs janus_key_rotator alongside other things?

@tgeoghegan tgeoghegan requested a review from a team May 30, 2024 22:19
@inahga
Copy link
Contributor Author

inahga commented May 30, 2024

Does this key-rotator component fit into a bigger plan for managing all of Divvi Up's secrets? For example, could we write a higher-level key rotator that execs janus_key_rotator alongside other things?

I don't see why not. Since this process runs one sweep at a time, we have flexibility in what context the process is run, e.g. cronjob, as part of another process, etc.

Copy link
Contributor

@branlwyd branlwyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass -- overall this looks good, lots of nitpicky comments. I'm still thinking through whether the "core" rotation logic can be simplified.

Duration::from_seconds(GlobalHpkeKeypairCache::DEFAULT_REFRESH_INTERVAL.as_secs() * 2)
}

/// 4 weeks. This is long enough not to be unnecessary churn, but short enough that misbehaving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bikeshedding: 4 weeks is a pretty fast key rotation. Maybe something more like 1 year?

Copy link
Contributor Author

@inahga inahga May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that a short rotation would force clients to have proper key fetching and caching strategy, i.e. the surprises would happen in the 1 month timespan rather than the 1 year timespan, and we would discourage any permanent retention of the key.

Short rotation also validates whether all this stuff works in production, so that we're not shocked 1 year later when we've completely forgotten how this tool works.

Also, lower key lifespan lowers the blast radius of a leaked key, so if we have the automation to make it work we might as well keep the lifespan low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't buy the blast radius argument: if a key is leaked, we will act quickly to rotate it (perhaps once we resolve whatever issue allowed it to be leaked in the first place); we can do so because we centrally control which keys can be used. If we aren't aware that the key is leaked, the attacker can plausibly leak the next key in the same way since we will not have remediated the leak, so fast rotation may not help.

I imagine the blast radius argument is ported over from the PKI world, where my counterargument wouldn't apply because the PKI world does not have a good revocation story, and therefore does not have a good rotation story.

Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but I do feel strongly that 1 year is too long, for at least the other reasons. Can split the difference though, 3 months? 6 months?

Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.

👍

pub async fn run(&self) -> Result<(), Error> {
self.datastore
.run_tx("global_hpke_key_rotator", |tx| {
let configs = self.hpke.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super-nit: the config is read-only. wrap it in an Arc to make this cloning cheaper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

let configs = Arc::new(self.hpke.clone())

doesn't strike me as cheaper.

Duration::from_seconds(GlobalHpkeKeypairCache::DEFAULT_REFRESH_INTERVAL.as_secs() * 2)
}

/// 4 weeks. This is long enough not to be unnecessary churn, but short enough that misbehaving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't buy the blast radius argument: if a key is leaked, we will act quickly to rotate it (perhaps once we resolve whatever issue allowed it to be leaked in the first place); we can do so because we centrally control which keys can be used. If we aren't aware that the key is leaked, the attacker can plausibly leak the next key in the same way since we will not have remediated the leak, so fast rotation may not help.

I imagine the blast radius argument is ported over from the PKI world, where my counterargument wouldn't apply because the PKI world does not have a good revocation story, and therefore does not have a good rotation story.

Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.

.await?;
}
} else {
let to_be_expired_keypairs: Vec<_> = active_keypairs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow the logic here, but it's pretty complicated & procedural. I might just be missing an obvious policy, but it also seems somewhat ad-hoc. It is also challenging to be completely confident in this logic, as the logic switches between working with in-memory representations & actually touching the DB -- the in-memory representations will be out of sync with the state of the DB after we start touching the DB, so it's somewhat challenging to reason about this code.

I would suggest writing an explicit human-language policy for rotation, then implementing that policy. If there's no simple policy that matches the current code, a per-ciphersuite policy that might work would be (cribbing from the prio-server key rotator):

  1. For each pending key, if the pending key is at least age active_min_age, promote it to active.
  2. If the ciphersuite is configured, and there are no pending/active keys or the newest pending/active key is at least age creation_min_age, create a new pending key.
  3. While there are more than maximum_active_keys active keys, expire the oldest active key. [I think maximum_active_keys would typically be set to 1, and would be treated 0 for unconfigured ciphersuites. Maybe this doesn't need to be a config knob at all & can always be treated as 1 for configured ciphersuites, 0 for unconfigured?]
  4. For each expired key, if the expired key is at least age deletion_min_age, delete it.

(It would be good to write this policy down in comments, near the code which implements each step. This policy is just an example that I think works, but maybe we can come up with a simpler one? Especially the special-casing around unconfigured ciphersuites might permit a simpler policy.)

To make the code clearer/more obviously correct, I suggest structuring it into three phases:

  1. Read the existing DB state into an in-memory structure.
  2. Manipulate the in-memory structure to perform rotation (in-memory). Ideally this would remove the "we have a bunch of in-memory stuff we make decisions on, but which is out of sync with changes we've already written to the DB" property the current code has, as decisions would be made at each step based on the current state of the in-memory structure, which would reflect the operations already performed on the in-memory structure.
  3. Write the in-memory structure back to the DB, replacing the existing contents of the DB. This could be done by keeping track of a diff in the in-memory structure, generating a diff based on the in-memory structure, or even wiping the table & rewriting it with each rotation.

I would also like it if the handling for configured vs unconfigured ciphersuites was unified, as long as that doesn't introduce too much complexity. The suggested policy above does an OK job of this IMO, but maybe we can do better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest writing an explicit human-language policy for rotation, then implementing that policy.

👍

Read the existing DB state into an in-memory structure.
...
Write the in-memory structure back to the DB, replacing the existing contents of the DB.

I had considered this early but chose not to since I felt the logic was simple enough, though borderline, that we could avoid the extra code.

In any case, I'm amenable to this idea. I was careful to not rely on fields that would have been outdated, but perhaps it's unnecessary mental overhead to have to consider the current table state versus in-memory state for each operation.

I would also like it if the handling for configured vs unconfigured ciphersuites was unified, as long as that doesn't introduce too much complexity.

I disagree that we should handle unconfigured ciphersuites like we do configured ones. IMO if we encounter an unconfigured ciphersuite we should swiftly phase it out. This strikes me as the least surprising behavior.

That is, as an operator, the least surprising result of removing a ciphersuite from the configuration would be that the respective keys are removed from the system. We, as developers, balance that out by retaining the expired key for as long as we have to, but still immediately remove the key from advertisement.

The alternative approaches, i.e. continue rotating the key or let the key run out its lifespan, are surprising results, because now the operator has to go in and make manual changes to actually remove the key from the system.

Or do you mean more mechanically, i.e. if we can sweep unconfigured keys in the same set of logic as configured keys?

Copy link
Contributor

@branlwyd branlwyd Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that we should handle unconfigured ciphersuites like we do configured ones. IMO if we encounter an unconfigured ciphersuite we should swiftly phase it out. This strikes me as the least surprising behavior.

That is, as an operator, the least surprising result of removing a ciphersuite from the configuration would be that the respective keys are removed from the system. We, as developers, balance that out by retaining the expired key for as long as we have to, but still immediately remove the key from advertisement.

OK -- I don't object to expiring keys for quickly, I think that can be done without making the policy too complicated. (N.B. the suggested policy would immediately expire any active keys for an unconfigured ciphersuite.)

I do think we want to let expired keys for unconfigured ciphersuites time out before deletion, but I think we agree on that point.

Or do you mean more mechanically, i.e. if we can sweep unconfigured keys in the same set of logic as configured keys?

Yes, this would be nice. The current logic has many special cases; while we do need (effectively) a ciphersuite_configured signal, I think it should basically feed into the same set of steps used by configured ciphersuites. Basically I want to avoid having two different codepaths for configured vs unconfigured ciphersuites.

@inahga inahga force-pushed the inahga/global-hpke branch 2 times, most recently from c770fa6 to 21fbf89 Compare June 5, 2024 18:55
@inahga
Copy link
Contributor Author

inahga commented Jun 5, 2024

I still need to finagle migrations and test this in kind, but it's ready once again for a deep review. @divergentdave this underwent significant changes since your review, so it might be worth a re-review.

@inahga inahga requested review from branlwyd and divergentdave June 5, 2024 20:29
@inahga inahga force-pushed the inahga/global-hpke branch 4 times, most recently from f3dc669 to e473524 Compare June 5, 2024 21:16
.keypairs
.iter()
.filter_map(|(&id, keypair)| {
let ciphersuite = self.config.ciphersuites.get(&keypair.ciphersuite());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ciphersuite is effectively a boolean, tracking whether the ciphersuite is configured or not -- everywhere that uses ciphersuite uses it in a match call. Instead of matching each place it is used, change this to a contains() call & write the logic to use a boolean ciphersuite_configured directly? Anywhere that uses the ciphersuite extracted from the match can use keypair.ciphersuite(), perhaps stored in a variable named ciphersuite.

.filter(|keypair| keypair.ciphersuite() == *ciphersuite)
.collect();

let pending_key_ready = keypairs.iter().any(|keypair| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty fragile: it needs to match the logic in the Pending case. You got it right (I think), but later changes could easily break this, leading to us expiring all active keys (or, less seriously, advertising two active keys for a given ciphersuite at a time).

In general, the operation for each keypair is computed entirely independently of each other keypair, so "lockstep" changes (such as promoting a pending keypair to active & simultaneously demoting the old active keypair to expired) have to "agree" in the logic of their processing. [1]

I suggest an "imperative" implementation strategy like the following. First, split the stored keypairs by ciphersuite so that you have a list of keypairs per ciphersuite, and process each ciphersuite's keypairs indepdendently. For each ciphersuite's keypairs:

  1. Clone the list of keypairs (or perhaps map it into (id, state, last_state_change_at)).
  2. Apply each policy rule iteratively, implemented by actually updating the in-memory state of the keypairs from the previous step. (An order that would work would be: promote old-enough pending to active -> demote old active to expired -> delete old expired; if implemented properly, I think we'd get the correct behavior for non-configured ciphersuites without needing special steps for that case.) This way, each update would be apparent to the later steps.
  3. Diff the in-memory state against the original state to determine which operations to perform against the DB.
  4. Finally, perform the DB operations.

I think this implementation strategy, other than being more obviously correct, would also be easier to follow & simpler to write down. WDYT? (If you disagree, definitely push back -- you've been staring at this problem longer than I have at this point, so if you think this is a bad idea, maybe it is.)

[1] Well, they could look at the operation computed for the other keypairs, but IMO this is more confusing than the suggested imperative update method; and this would also depend on the order than the keypairs are processed in.

Copy link
Contributor Author

@inahga inahga Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a more-than-plausible idea, but I'm not sure if there's enough benefit to justify additional work to iterate on this again, since the changes proposed seem rather sweeping. (edit: perhaps not as sweeping as I first thought looking at it, but my question remains, I'm amenable to trying it if you feel strongly that it would be an improvement).

Is the only problem with this code the duplicated logic between Pending and Active? If so, I think it can be factored out into a small function.

but later changes could easily break this, leading to us expiring all active keys (or, less seriously, advertising two active keys for a given ciphersuite at a time).

Unit tests ought to catch any serious regressions like this--I think they should be covering this arm pretty well. LMK if they don't, though.

(I'd like if the defensive assertion could validate one active keypair per ciphersuite, but the bootstrap case confounds that since there could be a pending key but no active key.)

I think we'd get the correct behavior for non-configured ciphersuites without needing special steps for that case.

I don't see how, since the policy for unconfigured ciphersuites is explicitly different than configured ones, i.e. immediate phase out without replacement. If we were treating the two the same, then yes I think this solution would be good for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my concern was around the fragility of duplicating logic with the implementation stategy of handling each keypair independently.

Looking at this again, I do think it would be clearer to handle each ciphersuite independently instead (i.e. handle all keypairs for a given ciphersuite together), but I'm OK leaning on the tests to provide guarantees that things work OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adopted your strategy after all, following @divergentdave noticing a bug related to preconditions not matching. Let me know if this is closer to what you're looking for.

Note that this implementation never writes to the in-memory state, instead we derive a list of Ops from the current table state, and apply them to the database. I think if we can do it this way, we should, since I think the logic is easier to follow without having to consider intermediate state, i.e. we can think of sweep() as the "operations necessary to get the table to the desired state, given the initial state".

This implementation also turns out to be less indentation-riffic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this implementation never writes to the in-memory state, instead we derive a list of Ops from the current table state, and apply them to the database.

OK, well I might have to back out that change if adopting property based testing suggested by @divergentdave. To have tests complete in a reasonable amount of time, we'll have to check in-memory state rather than taking obnoxious amounts of database roundtrips.

.filter(|keypair| keypair.ciphersuite() == *ciphersuite)
.collect();

let pending_key_ready = keypairs.iter().any(|keypair| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my concern was around the fragility of duplicating logic with the implementation stategy of handling each keypair independently.

Looking at this again, I do think it would be clearer to handle each ciphersuite independently instead (i.e. handle all keypairs for a given ciphersuite together), but I'm OK leaning on the tests to provide guarantees that things work OK.

@inahga inahga force-pushed the inahga/global-hpke branch from eeebd1c to c9c6d30 Compare June 17, 2024 23:07
Quickcheck was bringing up the global `log`ger, conflicting with
`install_test_trace_subscriber()`.
@@ -12,6 +12,12 @@ pub mod diagnostic;
pub mod metrics;
pub mod trace;

#[cfg(test)]
extern crate quickcheck;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extern crate is pretty archaic; can we use quickcheck instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we totally can. I copypasta'd from the examples without being critical about it.

# 1 hour.
pending_duration_s: 3600

# The TTL of keys. Defaults to 4 weeks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The TTL of keys. Defaults to 4 weeks.
# The TTL of keys. Defaults to 12 weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit, I believe this matches what is written in code)


key_rotator:
# Rotation policy for global HPKE keys
hpke: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this config is so simple because it is attempting to be as minimal as possible, but I think this would effectively just remove any existing keys.

Maybe specify only ciphersuite, with a common choice of configuration values? (P256HkdfSha256, HkdfSha256, Aes128Gcm, perhaps?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty ciphersuite value should default to one with a reasonable ciphersuite.

fn default_hpke_ciphersuites() -> HashSet<HpkeCiphersuite> {
    HashSet::from([HpkeCiphersuite::new(
        HpkeKemId::X25519HkdfSha256,
        HpkeKdfId::HkdfSha256,
        HpkeAeadId::Aes128Gcm,
    )])
}

At another look I'm not happy with the test coverage on this though, so I'll at minimum add a test.

.filter(|(_, keypair)| &keypair.ciphersuite() == ciphersuite)
.collect();
if inserted.len() != 1 {
return TestResult::error("more than one key inserted for new ciphersuite");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be TestResult::failed() instead (the reason could go in comments or log messages)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify why? TestResult::error() looks the same as TestResult::failed() except it sets a friendly error message https://docs.rs/quickcheck/latest/src/quickcheck/tester.rs.html#216. That way if a test fails it points us directly to the failing line, rather than having to run with RUST_LOG.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason is just the semantics of test failure versus runtime errors. It looks like we have only one or two test failure code paths per quickcheck property. The more important output upon failure is the input, which can be turned around into a reproducer unit test. (That reminds me, we may want to override Arbitrary::shrink to get nicer failing inputs, but I think the implementation is bullet proof enough now that it's not worth doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both error() and failed() cases, we're presented with the input variables, only in error() we're given an additional line Error: expired key was not deleted.

e.g.

# cargo test hpke_key_rotator_expired_keys
   Compiling janus_core v0.7.19 (/home/inahga/Projects/janus/core)
   Compiling janus_aggregator_core v0.7.19 (/home/inahga/Projects/janus/aggregator_core)
   Compiling janus_aggregator_api v0.7.19 (/home/inahga/Projects/janus/aggregator_api)
   Compiling janus_aggregator v0.7.19 (/home/inahga/Projects/janus/aggregator)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 58.37s
     Running unittests src/lib.rs (/home/inahga/Projects/janus/target/debug/deps/janus_aggregator-4c9e31edbd742197)

running 1 test
test aggregator::key_rotator::tests::hpke_key_rotator_expired_keys ... FAILED

failures:

---- aggregator::key_rotator::tests::hpke_key_rotator_expired_keys stdout ----
thread 'aggregator::key_rotator::tests::hpke_key_rotator_expired_keys' panicked at /home/inahga/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quickcheck-1.0.3/src/tester.rs:165:28:
[quickcheck] TEST FAILED (runtime error). Arguments: [the arguments, really long, somewhat justifying fixing `Arbitrary::shrink`.]
Error: expired key was not deleted
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    aggregator::key_rotator::tests::hpke_key_rotator_expired_keys

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 193 filtered out; finished in 0.05s

@inahga inahga merged commit 85172b6 into main Jun 20, 2024
7 of 8 checks passed
@inahga inahga deleted the inahga/global-hpke branch June 20, 2024 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants