-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support to save and load the CA journal from the datastore #4690
Conversation
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
2819ae5
to
0e9eff7
Compare
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
0e9eff7
to
2bef377
Compare
@MarcosDY Please start the commentary on this one, it is large and will likely take more time than a normal review. In addition, this functionality only prepares for the next step, which includes no longer maintaining / using the on-disk footprint of the CA Journal. This step can only be performed after this PR has been merged. |
pkg/common/fflag/fflag.go
Outdated
@@ -38,6 +38,10 @@ const ( | |||
// reattestable. In those cases the agent will still renew without reattesting. | |||
FlagReattestToRenew Flag = "reattest_to_renew" | |||
|
|||
// FlagCAJournalInDatastore controls whether or not to store the CA journal | |||
// of the server in the datastore. | |||
FlagCAJournalInDatastore = "ca_journal_in_datastore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may this be deprecated? since we are going to remove this to make it default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we keep it until we are doing all of our rotation with the database only? I agree it will be deprecated, but it seems like it's too early to remove it, considering this will be experimental in 1.9.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcosDY are you suggesting that the flag should be true by default? That would be the process of deprecating a feature flag, and we cannot do that in 1.8.x.
Feature flags are meant to be used by developers while features are still under development or cannot be turned on to all users (like this case, because the schema change required was introduced in 1.8.0).
@edwbuck There is no need to have this feature introduced as an experimental feature in 1.9.0. All users will have the schema changes needed when they either update to 1.9.0 or fresh install 1.9.0.
In 1.9.0, this will be the default behavior to all users.
@@ -1865,7 +1865,7 @@ func (m *fakeCAManager) PrepareJWTKey(context.Context) error { | |||
return m.prepareJWTKeyErr | |||
} | |||
|
|||
func (m *fakeCAManager) RotateJWTKey() { | |||
func (m *fakeCAManager) RotateJWTKey(_ context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same comment for RotateX509CA
func (m *fakeCAManager) RotateJWTKey(_ context.Context) { | |
func (m *fakeCAManager) RotateJWTKey(context.Context) { |
mu sync.RWMutex | ||
entries *JournalEntries | ||
mu sync.RWMutex | ||
activeX509AuthorityID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is activeX509 authority id requied to be cached? why is it different to jwt authority?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CA journal record in the database has the active X509 authority ID so we can efficiently compare the obtained X509 authority IDs from the key manager with the record's active X509 authority ID.
When we update an stored X509CA entry to have the ACTIVE status, we keep updated the active X509 authority ID in this struct, so we know what's the active X509 authority ID when we save the record in the database.
pkg/server/ca/manager/journal.go
Outdated
if err != nil { | ||
return nil, fmt.Errorf("failed to load journal from datastore: %w", err) | ||
} | ||
if j != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is nil, means that it was unable to fetch it from datastore, and we'll be moving into disk approach,
that is fine since we my be migrating here...
but i'm not sure, what happens if we are never able to fetch from ds and we move always to disk?
Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.
2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.
what do you think? maybe it is an overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never fetching from the data store is a scenario that needs to be considered in different migration phases onto the database rotation scheme:
- No migration started - Not fetching from the data store is exactly what it wanted.
- Migration in process (as it is now) - Fetch from the data store but conditionally so. If the data store doesn't have information, load the information from disk (and update the datastore).
- Migration complete (which will come in a future MR) - If we cannot fetch from the data store, we will need to create initial certificate information in the data store (much like the before-effort logic, where we bootstrap the configurations where there is no external PKI and no CA journal on disk).
As we are currently in the second step above, and currently in a migration from disk to datastore, we should load the datastore from disk if we can't fetch the datastore. As for "datastore unavailable" scenarios (in contrast to the data unavailable scenarios), this is the Server's Datastore, so it should eventually trigger a server restart with enough datastore failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is a feature that someone enabled intentionally I think we have 2 ways to follow here:
1- add a warning if flag is set but no journal found in DS, in a perferct world this means that it was unable to found and in next iteration warning wil be done.
I don't think that we should log a warning. This is the expected behavior for all the systems running for the first time with this fflag enabled in 1.8.x, and will be the expected behavior for all the systems running for the first time in 1.9.0.
2- create a migration code that try to load things from disk if not found, and then fails if that is no possible, with this we can be sure that in next iteratrion we'll be sure that we'll load from ds.
That's what it does. If there is no CA record for this server, it will load it from disk. If there is a failure when trying to load from disk, we fail with an error and it will try the same the next time. Note that having no journal is also a valid scenario. No CA record, no disk file, doesn't cause an error and will start with an empty journal.
The migration logic is pretty straightforward. The mechanism consists in just writing the next update on the database and on disk, when this feature is enabled. This way, the journal is always backed on disk during the 1.9.x timeframe to allow downgrades to 1.8.x.
// TODO: stop trying to load the journal from disk in v1.10 and delete | ||
// the journal file if exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm not sure about 'delete journal file', I agree we must remove loading code, but removing files looks like an overkill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving the journal off the disk is to improve both deployability and security. If we leave the old journal on disk, it might leak the information necessary to perform a rotation to a CA that is very similar to the one in SPIRE.
In this case, after a successful migration of the Journal into the database, removing the Journal from the disk prevents a second attempt at migration, and greatly reduces the chance that an old journal overwrites the current journal (as we expect that the database journal will always be newer than the disk journal after this commit is merged).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this a little bit during the last maintainer's call and I think that the consensus is that the right thing to do would be to remove the journal from disk in v1.10.
|
||
func deleteCAJournal(tx *gorm.DB, caJournalID uint) error { | ||
model := new(CAJournal) | ||
if err := tx.Find(model, "id = ?", caJournalID).Error; err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you interested in returning error when not found right? if no, you can do a tx.Where().Delete instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, I think that's good to return an error and log it, since it's unexpected.
|
||
// PruneCAJournals prunes the CA journals that have all of their authorities | ||
// expired. | ||
func (ds *Plugin) PruneCAJournals(ctx context.Context, allAuthoritiesExpireBefore int64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allAuthoritiesExpireBefore is confusing to me, it looks like it will remove all journals that expires before now.Add(SOME TIME).Unix() , no all that are really expired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use "expiresBefore" in PruneBundle, PruneRegistrationEntries and pruneJoinTokens. If we change the naming we should probably do that in all those places. May we revisit that separately?
Note that we subtract the threshold duration (it's not a positive Add operation), we do Clock.Now().Add(-safetyThreshold)
, so this really is "prune the CA journals that have all of the authorities expired since more than 2 weeks", which is the CA journals that have all of the authorities expired before more than two weeks ago.
for _, model := range caJournals { | ||
entries := new(journal.Entries) | ||
if err := proto.Unmarshal(model.Data, entries); err != nil { | ||
return errs.New("unable to unmarshal entries from CA journal record: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%w?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be status.Errorf
, I'll change that.
|
||
for _, x509CA := range entries.X509CAs { | ||
if x509CA.NotAfter > allAuthoritiesExpireBefore { | ||
continue checkAuthorities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when not expired after that time, will not this retry for ever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no retry logic here. This iterates through all the journals, and inside each journal, over all X509 CAs and JWT keys.
If x509CA.NotAfter > allAuthoritiesExpireBefore, that means that the authority expires after the expiration threshold, and the whole journal must be skipped. Otherwise, it will take the next x509CA and check the same thing. Once there are no more X509 CAs to check in this journal, it will go over the JWT keys and it will do the same.
continue checkAuthorities | ||
} | ||
} | ||
if err := deleteCAJournal(tx, model.ID); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if is not clear to me... but, if a bundle expires tomorrow, then x509CA.NotAfter > allAuthoritiesExpireBefore will be false,
and marked to remove.
it sounds to me that we must search for expired bundles (from today) and then see for how much time it was expired
in our case we look for entries that are expired 2weeks ago,
and to do that we'll need to provide a duration not a date.
makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is the opposite, if x509CA.NotAfter > allAuthoritiesExpireBefore the CA journal will be skipped. It essentially goes over all the authorities (including JWT keys) and checks if all of them expired more than two weeks ago. If not, the whole journal record will be skipped.
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
tomorrow := test.clock.Now().Add(time.Hour * 24).Unix() | ||
beforeThreshold := test.clock.Now().Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
tomorrow := now.Add(time.Hour * 24).Unix()
beforeThreshold := now.Add(-safetyThresholdCAJournals).Add(-time.Minute).Unix()
pkg/server/ca/manager/journal.go
Outdated
if err != nil { | ||
return 0, fmt.Errorf("failed to get CA journal from database: %w", err) | ||
} | ||
if caJournal == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this caJournal is never used, since you are using :=
and never setting the ID, if found must not it be setting the id inside j.caJournalID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can safely remove this call to findCAJournal since a save operation will always happen after the loadJournalFromDS operation, which is the one that calls to findCAJournal and will update j.caJournalID.
// There is a CA journal record that has an active X509 authority | ||
// ID that matches with one of the public keys of this server. This | ||
// means that this record belongs to this server. | ||
j.config.log.WithFields(logrus.Fields{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is connected to a previous comment, but looks like the response from this function is never used, so caJournalID will not be set and this log message is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can safely remove the call to findCAJournal, and I'll fix to refer to telemetry.CAJournalID: caJournal.ID here.
// Verify entries is empty | ||
s.AssertProtoEqual(&JournalEntries{}, journal.Entries()) | ||
spiretest.RequireProtoEqual(t, &journal.Entries{}, j.getEntries()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require.Empty(t, j.getEntries())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we can use require.Empty here.
require.Empty asserts that the specified object is empty. I.e. nil, "", false, 0 or either a slice or a channel with len == 0.
We want to assert that the referenced struct is an empty struct (&journal.Entries{}) but not assert that is nil. We are noy asserting a slice.
require.NoError(t, err) | ||
require.NoError(t, j.UpdateX509CAStatus(ctx, now, journal.Status_ACTIVE)) | ||
|
||
spiretest.RequireProtoEqual(t, j.getEntries(), test.loadJournalFromDS(t).getEntries()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both loads returns pointers, so if this fail for some reason we'll get a panic,
may we verify it is not empty before trying to get entries? (same comment for all load* calls)
require.NoError(t, err) | ||
|
||
if !j.shouldBePruned { | ||
testCase.expectedCAJournals = append(testCase.expectedCAJournals, caJournal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since expected is not set when defining testCase, and is calculated using shouldBePruned, may you just add a var at function level?
Or just add the expect list hardcoded instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we need to populate the datastore with journal data, that's saved marshalled and each record is assigned an ID, I found easier to just build the expectedCAJournals list with the items returned from the SetCAJournal operation that we know that should not be pruned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved expectedCAJournals to a var.
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…piffe#4690) * Save and load the CA journal from datastore Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
…piffe#4690) * Save and load the CA journal from datastore Signed-off-by: Agustín Martínez Fayó <amartinezfayo@gmail.com>
This PR introduces support to save and load the CA journal from the datastore. Since the database schema changes required for this were done in v1.8.0, this is being introduced through a feature flag. This functionality is only enabled if the
ca_journal_in_datastore
feature flag is enabled.Fixes #4463, #4464.