-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Key rotator #3136
Key rotator #3136
Conversation
20ad856
to
ece604c
Compare
d1a1d04
to
03f1748
Compare
I still need to plumb this through to a CLI command, but my next iteration on the logic in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
The use case for this is to retire old/undesired algorithms. I think we ought to add a parameter |
760985d
to
d07215b
Compare
for pending_keypair in pending_keypairs { | ||
// Janus replicas should have never advertised this keypair, so it should be safe | ||
// to delete outright. | ||
info!(id = ?pending_keypair.id(), "deleting pending keypair for retired ciphersuite"); | ||
tx.delete_global_hpke_keypair(pending_keypair.id()).await?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.b. we could try_join
a lot of calls like this, but since this code is not performance sensitive, I prioritized readability and straightforwardness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to have a key rotation component in Janus to handle those keys that Janus itself is aware of, like HPKE configurations or maybe aggregator API tokens. With some work, this could be extended to handle database encryption keys, too.
But in Divvi Up, we have several secrets that we want to manage outside of Janus' scope, such as seeds for OHTTP key configurations. Does this key-rotator component fit into a bigger plan for managing all of Divvi Up's secrets? For example, could we write a higher-level key rotator that execs janus_key_rotator
alongside other things?
I don't see why not. Since this process runs one sweep at a time, we have flexibility in what context the process is run, e.g. cronjob, as part of another process, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass -- overall this looks good, lots of nitpicky comments. I'm still thinking through whether the "core" rotation logic can be simplified.
Duration::from_seconds(GlobalHpkeKeypairCache::DEFAULT_REFRESH_INTERVAL.as_secs() * 2) | ||
} | ||
|
||
/// 4 weeks. This is long enough not to be unnecessary churn, but short enough that misbehaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bikeshedding: 4 weeks is a pretty fast key rotation. Maybe something more like 1 year?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping that a short rotation would force clients to have proper key fetching and caching strategy, i.e. the surprises would happen in the 1 month timespan rather than the 1 year timespan, and we would discourage any permanent retention of the key.
Short rotation also validates whether all this stuff works in production, so that we're not shocked 1 year later when we've completely forgotten how this tool works.
Also, lower key lifespan lowers the blast radius of a leaked key, so if we have the automation to make it work we might as well keep the lifespan low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't buy the blast radius argument: if a key is leaked, we will act quickly to rotate it (perhaps once we resolve whatever issue allowed it to be leaked in the first place); we can do so because we centrally control which keys can be used. If we aren't aware that the key is leaked, the attacker can plausibly leak the next key in the same way since we will not have remediated the leak, so fast rotation may not help.
I imagine the blast radius argument is ported over from the PKI world, where my counterargument wouldn't apply because the PKI world does not have a good revocation story, and therefore does not have a good rotation story.
Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but I do feel strongly that 1 year is too long, for at least the other reasons. Can split the difference though, 3 months? 6 months?
Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.
👍
pub async fn run(&self) -> Result<(), Error> { | ||
self.datastore | ||
.run_tx("global_hpke_key_rotator", |tx| { | ||
let configs = self.hpke.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super-nit: the config is read-only. wrap it in an Arc
to make this cloning cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
let configs = Arc::new(self.hpke.clone())
doesn't strike me as cheaper.
Duration::from_seconds(GlobalHpkeKeypairCache::DEFAULT_REFRESH_INTERVAL.as_secs() * 2) | ||
} | ||
|
||
/// 4 weeks. This is long enough not to be unnecessary churn, but short enough that misbehaving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't buy the blast radius argument: if a key is leaked, we will act quickly to rotate it (perhaps once we resolve whatever issue allowed it to be leaked in the first place); we can do so because we centrally control which keys can be used. If we aren't aware that the key is leaked, the attacker can plausibly leak the next key in the same way since we will not have remediated the leak, so fast rotation may not help.
I imagine the blast radius argument is ported over from the PKI world, where my counterargument wouldn't apply because the PKI world does not have a good revocation story, and therefore does not have a good rotation story.
Still, if that's not convincing, let's make sure we communicate with all deployed folks before we enable key rotation in their respective environments.
.await?; | ||
} | ||
} else { | ||
let to_be_expired_keypairs: Vec<_> = active_keypairs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow the logic here, but it's pretty complicated & procedural. I might just be missing an obvious policy, but it also seems somewhat ad-hoc. It is also challenging to be completely confident in this logic, as the logic switches between working with in-memory representations & actually touching the DB -- the in-memory representations will be out of sync with the state of the DB after we start touching the DB, so it's somewhat challenging to reason about this code.
I would suggest writing an explicit human-language policy for rotation, then implementing that policy. If there's no simple policy that matches the current code, a per-ciphersuite policy that might work would be (cribbing from the prio-server
key rotator):
- For each pending key, if the pending key is at least age
active_min_age
, promote it to active. - If the ciphersuite is configured, and there are no pending/active keys or the newest pending/active key is at least age
creation_min_age
, create a new pending key. - While there are more than
maximum_active_keys
active keys, expire the oldest active key. [I thinkmaximum_active_keys
would typically be set to 1, and would be treated 0 for unconfigured ciphersuites. Maybe this doesn't need to be a config knob at all & can always be treated as 1 for configured ciphersuites, 0 for unconfigured?] - For each expired key, if the expired key is at least age
deletion_min_age
, delete it.
(It would be good to write this policy down in comments, near the code which implements each step. This policy is just an example that I think works, but maybe we can come up with a simpler one? Especially the special-casing around unconfigured ciphersuites might permit a simpler policy.)
To make the code clearer/more obviously correct, I suggest structuring it into three phases:
- Read the existing DB state into an in-memory structure.
- Manipulate the in-memory structure to perform rotation (in-memory). Ideally this would remove the "we have a bunch of in-memory stuff we make decisions on, but which is out of sync with changes we've already written to the DB" property the current code has, as decisions would be made at each step based on the current state of the in-memory structure, which would reflect the operations already performed on the in-memory structure.
- Write the in-memory structure back to the DB, replacing the existing contents of the DB. This could be done by keeping track of a diff in the in-memory structure, generating a diff based on the in-memory structure, or even wiping the table & rewriting it with each rotation.
I would also like it if the handling for configured vs unconfigured ciphersuites was unified, as long as that doesn't introduce too much complexity. The suggested policy above does an OK job of this IMO, but maybe we can do better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest writing an explicit human-language policy for rotation, then implementing that policy.
👍
Read the existing DB state into an in-memory structure.
...
Write the in-memory structure back to the DB, replacing the existing contents of the DB.
I had considered this early but chose not to since I felt the logic was simple enough, though borderline, that we could avoid the extra code.
In any case, I'm amenable to this idea. I was careful to not rely on fields that would have been outdated, but perhaps it's unnecessary mental overhead to have to consider the current table state versus in-memory state for each operation.
I would also like it if the handling for configured vs unconfigured ciphersuites was unified, as long as that doesn't introduce too much complexity.
I disagree that we should handle unconfigured ciphersuites like we do configured ones. IMO if we encounter an unconfigured ciphersuite we should swiftly phase it out. This strikes me as the least surprising behavior.
That is, as an operator, the least surprising result of removing a ciphersuite from the configuration would be that the respective keys are removed from the system. We, as developers, balance that out by retaining the expired key for as long as we have to, but still immediately remove the key from advertisement.
The alternative approaches, i.e. continue rotating the key or let the key run out its lifespan, are surprising results, because now the operator has to go in and make manual changes to actually remove the key from the system.
Or do you mean more mechanically, i.e. if we can sweep unconfigured keys in the same set of logic as configured keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that we should handle unconfigured ciphersuites like we do configured ones. IMO if we encounter an unconfigured ciphersuite we should swiftly phase it out. This strikes me as the least surprising behavior.
That is, as an operator, the least surprising result of removing a ciphersuite from the configuration would be that the respective keys are removed from the system. We, as developers, balance that out by retaining the expired key for as long as we have to, but still immediately remove the key from advertisement.
OK -- I don't object to expiring keys for quickly, I think that can be done without making the policy too complicated. (N.B. the suggested policy would immediately expire any active keys for an unconfigured ciphersuite.)
I do think we want to let expired keys for unconfigured ciphersuites time out before deletion, but I think we agree on that point.
Or do you mean more mechanically, i.e. if we can sweep unconfigured keys in the same set of logic as configured keys?
Yes, this would be nice. The current logic has many special cases; while we do need (effectively) a ciphersuite_configured
signal, I think it should basically feed into the same set of steps used by configured ciphersuites. Basically I want to avoid having two different codepaths for configured vs unconfigured ciphersuites.
c770fa6
to
21fbf89
Compare
I still need to finagle migrations and test this in kind, but it's ready once again for a deep review. @divergentdave this underwent significant changes since your review, so it might be worth a re-review. |
f3dc669
to
e473524
Compare
.keypairs | ||
.iter() | ||
.filter_map(|(&id, keypair)| { | ||
let ciphersuite = self.config.ciphersuites.get(&keypair.ciphersuite()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ciphersuite
is effectively a boolean, tracking whether the ciphersuite is configured or not -- everywhere that uses ciphersuite
uses it in a match call. Instead of matching each place it is used, change this to a contains()
call & write the logic to use a boolean ciphersuite_configured
directly? Anywhere that uses the ciphersuite extracted from the match can use keypair.ciphersuite()
, perhaps stored in a variable named ciphersuite
.
.filter(|keypair| keypair.ciphersuite() == *ciphersuite) | ||
.collect(); | ||
|
||
let pending_key_ready = keypairs.iter().any(|keypair| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty fragile: it needs to match the logic in the Pending
case. You got it right (I think), but later changes could easily break this, leading to us expiring all active keys (or, less seriously, advertising two active keys for a given ciphersuite at a time).
In general, the operation for each keypair is computed entirely independently of each other keypair, so "lockstep" changes (such as promoting a pending keypair to active & simultaneously demoting the old active keypair to expired) have to "agree" in the logic of their processing. [1]
I suggest an "imperative" implementation strategy like the following. First, split the stored keypairs by ciphersuite so that you have a list of keypairs per ciphersuite, and process each ciphersuite's keypairs indepdendently. For each ciphersuite's keypairs:
- Clone the list of keypairs (or perhaps map it into
(id, state, last_state_change_at)
). - Apply each policy rule iteratively, implemented by actually updating the in-memory state of the keypairs from the previous step. (An order that would work would be: promote old-enough pending to active -> demote old active to expired -> delete old expired; if implemented properly, I think we'd get the correct behavior for non-configured ciphersuites without needing special steps for that case.) This way, each update would be apparent to the later steps.
- Diff the in-memory state against the original state to determine which operations to perform against the DB.
- Finally, perform the DB operations.
I think this implementation strategy, other than being more obviously correct, would also be easier to follow & simpler to write down. WDYT? (If you disagree, definitely push back -- you've been staring at this problem longer than I have at this point, so if you think this is a bad idea, maybe it is.)
[1] Well, they could look at the operation computed for the other keypairs, but IMO this is more confusing than the suggested imperative update method; and this would also depend on the order than the keypairs are processed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a more-than-plausible idea, but I'm not sure if there's enough benefit to justify additional work to iterate on this again, since the changes proposed seem rather sweeping. (edit: perhaps not as sweeping as I first thought looking at it, but my question remains, I'm amenable to trying it if you feel strongly that it would be an improvement).
Is the only problem with this code the duplicated logic between Pending and Active? If so, I think it can be factored out into a small function.
but later changes could easily break this, leading to us expiring all active keys (or, less seriously, advertising two active keys for a given ciphersuite at a time).
Unit tests ought to catch any serious regressions like this--I think they should be covering this arm pretty well. LMK if they don't, though.
(I'd like if the defensive assertion could validate one active keypair per ciphersuite, but the bootstrap case confounds that since there could be a pending key but no active key.)
I think we'd get the correct behavior for non-configured ciphersuites without needing special steps for that case.
I don't see how, since the policy for unconfigured ciphersuites is explicitly different than configured ones, i.e. immediate phase out without replacement. If we were treating the two the same, then yes I think this solution would be good for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my concern was around the fragility of duplicating logic with the implementation stategy of handling each keypair independently.
Looking at this again, I do think it would be clearer to handle each ciphersuite independently instead (i.e. handle all keypairs for a given ciphersuite together), but I'm OK leaning on the tests to provide guarantees that things work OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've adopted your strategy after all, following @divergentdave noticing a bug related to preconditions not matching. Let me know if this is closer to what you're looking for.
Note that this implementation never writes to the in-memory state, instead we derive a list of Op
s from the current table state, and apply them to the database. I think if we can do it this way, we should, since I think the logic is easier to follow without having to consider intermediate state, i.e. we can think of sweep()
as the "operations necessary to get the table to the desired state, given the initial state".
This implementation also turns out to be less indentation-riffic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this implementation never writes to the in-memory state, instead we derive a list of Ops from the current table state, and apply them to the database.
OK, well I might have to back out that change if adopting property based testing suggested by @divergentdave. To have tests complete in a reasonable amount of time, we'll have to check in-memory state rather than taking obnoxious amounts of database roundtrips.
.filter(|keypair| keypair.ciphersuite() == *ciphersuite) | ||
.collect(); | ||
|
||
let pending_key_ready = keypairs.iter().any(|keypair| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my concern was around the fragility of duplicating logic with the implementation stategy of handling each keypair independently.
Looking at this again, I do think it would be clearer to handle each ciphersuite independently instead (i.e. handle all keypairs for a given ciphersuite together), but I'm OK leaning on the tests to provide guarantees that things work OK.
eeebd1c
to
c9c6d30
Compare
Quickcheck was bringing up the global `log`ger, conflicting with `install_test_trace_subscriber()`.
aggregator/src/lib.rs
Outdated
@@ -12,6 +12,12 @@ pub mod diagnostic; | |||
pub mod metrics; | |||
pub mod trace; | |||
|
|||
#[cfg(test)] | |||
extern crate quickcheck; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extern crate
is pretty archaic; can we use quickcheck
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we totally can. I copypasta'd from the examples without being critical about it.
# 1 hour. | ||
pending_duration_s: 3600 | ||
|
||
# The TTL of keys. Defaults to 4 weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The TTL of keys. Defaults to 4 weeks. | |
# The TTL of keys. Defaults to 12 weeks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit, I believe this matches what is written in code)
|
||
key_rotator: | ||
# Rotation policy for global HPKE keys | ||
hpke: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this config is so simple because it is attempting to be as minimal as possible, but I think this would effectively just remove any existing keys.
Maybe specify only ciphersuite
, with a common choice of configuration values? (P256HkdfSha256
, HkdfSha256
, Aes128Gcm
, perhaps?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty ciphersuite
value should default to one with a reasonable ciphersuite.
fn default_hpke_ciphersuites() -> HashSet<HpkeCiphersuite> {
HashSet::from([HpkeCiphersuite::new(
HpkeKemId::X25519HkdfSha256,
HpkeKdfId::HkdfSha256,
HpkeAeadId::Aes128Gcm,
)])
}
At another look I'm not happy with the test coverage on this though, so I'll at minimum add a test.
.filter(|(_, keypair)| &keypair.ciphersuite() == ciphersuite) | ||
.collect(); | ||
if inserted.len() != 1 { | ||
return TestResult::error("more than one key inserted for new ciphersuite"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be TestResult::failed()
instead (the reason could go in comments or log messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify why? TestResult::error()
looks the same as TestResult::failed()
except it sets a friendly error message https://docs.rs/quickcheck/latest/src/quickcheck/tester.rs.html#216. That way if a test fails it points us directly to the failing line, rather than having to run with RUST_LOG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason is just the semantics of test failure versus runtime errors. It looks like we have only one or two test failure code paths per quickcheck property. The more important output upon failure is the input, which can be turned around into a reproducer unit test. (That reminds me, we may want to override Arbitrary::shrink
to get nicer failing inputs, but I think the implementation is bullet proof enough now that it's not worth doing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both error()
and failed()
cases, we're presented with the input variables, only in error()
we're given an additional line Error: expired key was not deleted
.
e.g.
# cargo test hpke_key_rotator_expired_keys
Compiling janus_core v0.7.19 (/home/inahga/Projects/janus/core)
Compiling janus_aggregator_core v0.7.19 (/home/inahga/Projects/janus/aggregator_core)
Compiling janus_aggregator_api v0.7.19 (/home/inahga/Projects/janus/aggregator_api)
Compiling janus_aggregator v0.7.19 (/home/inahga/Projects/janus/aggregator)
Finished `test` profile [unoptimized + debuginfo] target(s) in 58.37s
Running unittests src/lib.rs (/home/inahga/Projects/janus/target/debug/deps/janus_aggregator-4c9e31edbd742197)
running 1 test
test aggregator::key_rotator::tests::hpke_key_rotator_expired_keys ... FAILED
failures:
---- aggregator::key_rotator::tests::hpke_key_rotator_expired_keys stdout ----
thread 'aggregator::key_rotator::tests::hpke_key_rotator_expired_keys' panicked at /home/inahga/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quickcheck-1.0.3/src/tester.rs:165:28:
[quickcheck] TEST FAILED (runtime error). Arguments: [the arguments, really long, somewhat justifying fixing `Arbitrary::shrink`.]
Error: expired key was not deleted
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
aggregator::key_rotator::tests::hpke_key_rotator_expired_keys
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 193 filtered out; finished in 0.05s
Supports #2147.
Introduces the
key_rotator
, which is a utility for managing the lifecycle of global HPKE keys. It will bootstrap keys, and rotate them according to the rotation policy in the config.Keys are unique by their HPKE ciphersuite. For each ciphersuite, a key is run through the pending->active->expired->deleted lifecycle. It is permissive of some manual operator changes, e.g. if a manual key rotation needs to be executed through janus_cli or the aggregator API.
In this iteration, it runs as a standalone binary for use in a cronjob. It is suitable for deployment in our environments, including taskprov ones. A future PR will add support for BYOH deployments by letting the
aggregator
process run the key rotator.