Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use persistable cache for GCS progress tracking #1204

Merged
merged 48 commits into from
Apr 10, 2023
Merged

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented Mar 23, 2023

  • Simplify progress tracking for the GCS source by levering a cache for objects scanned.
    This change no longer supports fine grained enumeration resuming, I think this is a fair tradeoff for the amount of complexity that is removed from the source. On average enumeration takes about ~18s for ~1.3M objects, therefore enumerating on each run is fine.
  • This changes how we save progress data within EncodeResumeInfo from objects being map[string]struct{} to just a string. The string is the object name and is saved in an in-memory cache during the scan job to facilitate resuming scans. Note: there will be a follow up PR that uses the md5 hash of the contents vs the object name in the even the same named object is found in multiple buckets.

Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intended to be merged after #1190?

@ahrav
Copy link
Collaborator Author

ahrav commented Mar 27, 2023

is this intended to be merged after #1190?

Yea, I have this rebased off #1190 , i should've left this in Draft until that one was merged.

@ahrav ahrav marked this pull request as draft March 27, 2023 16:11
@ahrav ahrav force-pushed the use-persistable-cache branch 2 times, most recently from 584a7f5 to 5fd2d61 Compare March 27, 2023 18:06
@ahrav ahrav changed the title Use persistable cache Use persistable cache for GCS progress tracking Mar 27, 2023
@ahrav ahrav marked this pull request as ready for review March 27, 2023 18:32
@ahrav ahrav requested review from a team and rosecodym March 27, 2023 18:32
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do you feel it's worth adding a test to validate that persistence doesn't occur before the interval has been hit?

pkg/sources/gcs/gcs_test.go Outdated Show resolved Hide resolved
pkg/sources/gcs/gcs.go Outdated Show resolved Hide resolved
pkg/sources/gcs/gcs_test.go Outdated Show resolved Hide resolved
pkg/sources/gcs/gcs_test.go Outdated Show resolved Hide resolved
@ahrav
Copy link
Collaborator Author

ahrav commented Mar 28, 2023

question: do you feel it's worth adding a test to validate that persistence doesn't occur before the interval has been hit?

Yea, I like that idea. I'll add a test for it. Thanks for the suggestion.

@ahrav ahrav requested review from a team and rosecodym March 28, 2023 23:28
pkg/sources/gcs/gcs.go Outdated Show resolved Hide resolved
@ahrav ahrav requested a review from a team March 29, 2023 18:25
@ahrav ahrav requested review from rosecodym and a team April 7, 2023 16:40
Copy link
Collaborator

@rosecodym rosecodym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work on this!


cache.Set(objName, objName)
s.Progress.SectionsRemaining = int32(s.stats.numObjects)
s.Progress.PercentComplete = int64(float64(s.SectionsCompleted) / float64(s.stats.numObjects) * 100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this isn't part of this pr so i'm just asking because i'm curious but why does ProgressComplete need 64 bits to represent a percentage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It likely doesn't. I think that was the type when the struct was first created a long time ago and we haven't thought twice about it. I can create a separate PR to update that type along with the related code.

@ahrav
Copy link
Collaborator Author

ahrav commented Apr 10, 2023

nice work on this!

Thanks for all the feedback :)

@ahrav ahrav merged commit c451f9d into main Apr 10, 2023
@ahrav ahrav deleted the use-persistable-cache branch April 10, 2023 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants