Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate and prune entry event table #4411

Merged
merged 56 commits into from
Sep 13, 2023
Merged

Conversation

faisal-memon
Copy link
Contributor

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Affected functionality
SPIRE datastore

Description of change
Second part of datastore enhancement:

  • Creates an event every time an entry is created/updated/deleted.
  • Events are pruned every 12 hours. Pruning interval is configurable.
  • Creates 2 new datastore APIs to list and prune events
  • Listing events can be filtered to be greater than an event id

The cache is not updated with this pr. Node events is not populated yet either. That will be in the next PR.

Which issue this PR fixes
Related to #2182

@edwbuck
Copy link
Contributor

edwbuck commented Aug 9, 2023

@amartinezfayo Can we start a review on this with priority?

@azdagron
Copy link
Member

azdagron commented Aug 9, 2023

I think we're still waiting on an experimental flag that encapsulates the whole feature. We can't assume the tables used by the queries will exist until one minor release after the minor release in which the table migrations landed.

@azdagron
Copy link
Member

We talked about this again today and realized that we probably shouldn't have an experimental flag for this until the whole feature is in place. Until then it probably belongs under a developer feature flag. We shouldn't wire up any of the new components (including populating the events table), unless that flag is on. Once the whole feature is in place, we can remove the developer feature flag in favor of an experimental flag that folks can use to exercise the new cache. And eventually once we have confidence, we can make this the default behavior.

pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved
pkg/server/datastore/datastore.go Show resolved Hide resolved

type ListRegistrationEntriesEventsResponse struct {
EntryIDs []string
FirstEventID uint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this FirstEventID is?
is it id of the first elment in the response? how is it passed (or affected) on pagination?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes first id of element in responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure about pagination. let me test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need pagination for this api?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put it in. Bulk deletion / addition APIs could (even though the data per event is small) overflow the single page response.

Copy link
Contributor Author

@faisal-memon faisal-memon Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to create unit tests for the pagination but it seems to overlap some with the specific greater than event id filter for this api. This is similar to the Token half of the pagination, but with an unlimited page size. Thats what we need for this functionality, all events that happened since the last time we polled the api.

Given the current implementation fetches all entries with no pagination, im not sure pagination is needed here. @edwbuck what do you think of removing the pagination? This would simplify the implementation.

Copy link
Contributor

@edwbuck edwbuck Aug 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an issue with non-pagination. I just wanted to mention the bulk updates.

In my mind, if pagination becomes and issue, we can address it then. Honestly, the events have so few data items in them (just references to other items) that I imagine one would need a lot of them to overflow the response.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have strong opinion with pagination.
We can get into s bulk insert with several entries, or a server that is not consuming updates for some time, in that case we'll be returning several IDs here and fetching all of them together, maybe we can ""paginate" the response in the caller? just to prevent protobuf limits? (in case someone has a lot of entries to update)

But I dont have issues about not supporting pagination. @azdagron do you have something against removing pagination here?

pkg/server/datastore/sqlstore/sqlstore.go Outdated Show resolved Hide resolved
@@ -3693,6 +3693,52 @@ func (s *PluginSuite) TestDeleteBundleDissociateRegistrationEntries() {
s.Require().Empty(entry.FederatesWith)
}

func (s *PluginSuite) TestListRegistrationEntriesEvents() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like tehre is no unit tests for pruning, I think we must validate that path too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test for pruning

})
resp, err := s.ds.ListRegistrationEntriesEvents(ctx, &datastore.ListRegistrationEntriesEventsRequest{})
s.Require().NoError(err)
s.Require().Equal(1, len(resp.EntryIDs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may we compare the expected IDs? tehe same comment is for all equal compartions
Using a list of expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comparisons

@@ -213,6 +217,7 @@ func TestListenAndServe(t *testing.T) {
Metrics: metrics,
RateLimit: rateLimit,
EntryFetcherCacheRebuildTask: ef.RunRebuildCacheTask,
EntryFetcherPruneEventsTask: ef.PruneEventsTask,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are adding default at this level, I think we must validate that value using a test

case <-ctx.Done():
a.log.Debug("Stopping event pruner")
return nil
case <-a.clk.After(a.eventsPruneInterval):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, every 12hs it will try to remove all entries that are older than 12hs,
if we have e1, e2, e3 and e1 was created 13hs ago, and e2 11hs ago,
if we run prune, e1 will be removed now, but e2 will be removed in another 12hs (it will have 23hs)

I'm not sure if we must have 2 intervals,
first to know how ofter we try to remove (it can be every 1m or 10m or 1h) and
another one to know the older entries we support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been a while, but I believe the cache_reload_interval is now used to see how often to check the cache, and the eventsPruneInterval is the time from "now" into the past that sets the pruning limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that way and prune every time we reload the cache. Though there would be more delete operations.

Can also have the pruner run at a faster rate than it prunes. So runs every 6 hours and prunes everything over 12 hours old.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name of the configurable to PruneEventsOlderThan and have the pruner task trigger at half of that.

entries := make(map[spiffeid.ID][]*types.Entry)
buildCache := func(context.Context) (entrycache.Cache, error) {
return newStaticEntryCache(entries), nil
}

ef, err := NewAuthorizedEntryFetcherWithFullCache(ctx, buildCache, log, clk, defaultCacheReloadInterval)
pruneEventsFn := func(context.Context, time.Duration) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any of this functions are actualy using prune,
I think we'll need to add unit tests to validate prune too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test added

pkg/server/datastore/sqlstore/sqlstore.go Outdated Show resolved Hide resolved
pkg/server/datastore/sqlstore/sqlstore.go Outdated Show resolved Hide resolved
@evan2645 evan2645 added this to the 1.8.0 milestone Aug 22, 2023
@faisal-memon
Copy link
Contributor Author

@azdagron @MarcosDY All comments have been addressed. Ready for review.

@@ -645,6 +646,14 @@ func NewServerConfig(c *Config, logOptions []log.Option, allowUnknownConfig bool
sc.CacheReloadInterval = interval
}

if c.Server.Experimental.PruneEventsOlderThan != "" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add unit tests for this? (if we are going to merge this config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

pkg/common/telemetry/server/datastore/event.go Outdated Show resolved Hide resolved

type ListRegistrationEntriesEventsResponse struct {
EntryIDs []string
FirstEventID uint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have strong opinion with pagination.
We can get into s bulk insert with several entries, or a server that is not consuming updates for some time, in that case we'll be returning several IDs here and fetching all of them together, maybe we can ""paginate" the response in the caller? just to prevent protobuf limits? (in case someone has a lot of entries to update)

But I dont have issues about not supporting pagination. @azdagron do you have something against removing pagination here?

resp, err := s.ds.ListRegistrationEntriesEvents(ctx, &datastore.ListRegistrationEntriesEventsRequest{})
s.Require().NoError(err)
return reflect.DeepEqual(tt.expectedEntryIDs, resp.EntryIDs)
}, 10*time.Second, 50*time.Millisecond, "Failed to prune entries correctly")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about adding clock to ds layer (and a mock here), and use that instead of time.Now.
That will allow you to add test cases where you can advance time and have more robust test cases where you have more events and prune partially

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats a great idea. Is it ok to that in a follow on pr? I think its a little out of scope for this one.

select {
case err := <-watchErr:
assert.NoError(t, err)
case <-time.After(5 * time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can relay on ctx done here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you requesting to put case <-ctx.Done():?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since context will finish (with timeout) you can rely ctx.Done()

Copy link
Contributor Author

@faisal-memon faisal-memon Sep 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we are calling cancel() just a few lines above on line 239. If we use ctx.Done() that will always be true so we never check for the watchErr on line 241.

pkg/server/endpoints/entryfetcher_test.go Outdated Show resolved Hide resolved
clk.Add(defaultPruneEventsOlderThan)
select {
case <-pruneEventsCh:
case <-time.After(5 * time.Second):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can depends on ctx .Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

faisal-memon and others added 11 commits September 12, 2023 15:32
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [sigs.k8s.io/controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) from 0.15.0 to 0.15.1.
- [Release notes](https://github.com/kubernetes-sigs/controller-runtime/releases)
- [Changelog](https://github.com/kubernetes-sigs/controller-runtime/blob/main/RELEASE.md)
- [Commits](kubernetes-sigs/controller-runtime@v0.15.0...v0.15.1)

---
updated-dependencies:
- dependency-name: sigs.k8s.io/controller-runtime
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [golang.org/x/sys](https://github.com/golang/sys) from 0.10.0 to 0.11.0.
- [Commits](golang/sys@v0.10.0...v0.11.0)

---
updated-dependencies:
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Zack Train <ztrain@uber.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.14.0.
- [Commits](golang/net@v0.13.0...v0.14.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
…ncy telemetry util (spiffe#4399)

* Add telemetry instrumentation for delegated identity API and add latency telemetry util

Signed-off-by: chiragk25 <chirag.d.kapadia@gmail.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
…spiffe#4416)

Bumps [github.com/aws/aws-sdk-go-v2/service/ec2](https://github.com/aws/aws-sdk-go-v2) from 1.109.1 to 1.110.1.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](aws/aws-sdk-go-v2@service/ec2/v1.109.1...service/ec2/v1.110.1)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/service/ec2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 4.0.1 to 4.1.0.
- [Release notes](https://github.com/actions/setup-go/releases)
- [Commits](actions/setup-go@fac708d...93397be)

---
updated-dependencies:
- dependency-name: actions/setup-go
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.134.0 to 0.136.0.
- [Release notes](https://github.com/googleapis/google-api-go-client/releases)
- [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md)
- [Commits](googleapis/google-api-go-client@v0.134.0...v0.136.0)

---
updated-dependencies:
- dependency-name: google.golang.org/api
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Bumps [github.com/sigstore/sigstore](https://github.com/sigstore/sigstore) from 1.7.1 to 1.7.2.
- [Release notes](https://github.com/sigstore/sigstore/releases)
- [Commits](sigstore/sigstore@v1.7.1...v1.7.2)

---
updated-dependencies:
- dependency-name: github.com/sigstore/sigstore
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
This project generates releases by just creating a new release branch
without a corresponding semver tag, and changing the major version tag
to point to the release branch, which isn't enough for dependabot to
automatically detect the new versions,
see msys2/setup-msys2#327

Manually update this step for now to the current commit pointed to by
the `v2` tag (`v2.20.0`): https://github.com/msys2/setup-msys2/tree/v2

Signed-off-by: Ryan Turner <turner@uber.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
dependabot bot and others added 16 commits September 12, 2023 15:32
Bumps [k8s.io/kube-aggregator](https://github.com/kubernetes/kube-aggregator) from 0.27.4 to 0.28.0.
- [Commits](kubernetes/kube-aggregator@v0.27.4...v0.28.0)

---
updated-dependencies:
- dependency-name: k8s.io/kube-aggregator
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Quite some time ago we added a scan to first warn and then eventually
delete entries with invalid SPIFFE IDs. This scan is no longer needed,
since entries will have already been removed by previous upgrades and
can be removed.

Signed-off-by: Andrew Harding <azdagron@gmail.com>
Co-authored-by: Marcos Yacob <marcos.yacob@hpe.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Co-authored-by: Marcos Yacob <marcos.yacob@hpe.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>

Co-authored-by: Marcos Yacob <marcos.yacob@hpe.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
Signed-off-by: Faisal Memon <fymemon@yahoo.com>
@faisal-memon
Copy link
Contributor Author

DCO fixed and branch up to date with latest.

MarcosDY
MarcosDY previously approved these changes Sep 13, 2023
Comment on lines 3651 to 3652
func listRegistrationEntriesEvents(tx *gorm.DB, req *datastore.ListRegistrationEntriesEventsRequest) (*datastore.ListRegistrationEntriesEventsResponse, error) {
var events []RegisteredEntryEvent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is experimental it must be called only when feature flag is enabled

Suggested change
func listRegistrationEntriesEvents(tx *gorm.DB, req *datastore.ListRegistrationEntriesEventsRequest) (*datastore.ListRegistrationEntriesEventsResponse, error) {
var events []RegisteredEntryEvent
func listRegistrationEntriesEvents(tx *gorm.DB, req *datastore.ListRegistrationEntriesEventsRequest) (*datastore.ListRegistrationEntriesEventsResponse, error) {
if !fflag.IsSet(fflag.FlagEventsBasedCache) {
return &datastore.ListRegistrationEntriesEventsRespons{}, nil
}
var events []RegisteredEntryEvent

Comment on lines 3668 to 3669
func pruneRegistrationEntriesEvents(tx *gorm.DB, olderThan time.Duration) error {
if err := tx.Where("created_at < ?", time.Now().Add(-olderThan)).Delete(&RegisteredEntryEvent{}).Error; err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is experimental it must be called only when feature flag is enabled

Suggested change
func pruneRegistrationEntriesEvents(tx *gorm.DB, olderThan time.Duration) error {
if err := tx.Where("created_at < ?", time.Now().Add(-olderThan)).Delete(&RegisteredEntryEvent{}).Error; err != nil {
func pruneRegistrationEntriesEvents(tx *gorm.DB, olderThan time.Duration) error {
if !fflag.IsSet(fflag.FlagEventsBasedCache) {
return nil
}
if err := tx.Where("created_at < ?", time.Now().Add(-olderThan)).Delete(&RegisteredEntryEvent{}).Error; err != nil {

Signed-off-by: Marcos Yacob <marcos.yacob@hpe.com>
@rturner3 rturner3 merged commit 7a5a528 into spiffe:main Sep 13, 2023
@azdagron azdagron mentioned this pull request Sep 15, 2023
7 tasks
@faisal-memon faisal-memon deleted the db-api branch September 26, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants