Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow gitlab to resume from encoded resume info #611

Merged
merged 5 commits into from
Jun 17, 2022

Conversation

trufflesteeeve
Copy link
Collaborator

This also includes a commit to pull out the common functions between gitlab and github into potentially reusable ones for other sources. I put it in resume.go in the sources package, but could definitely see it making more sense elsewhere.

Copy link
Collaborator

@mcastorina mcastorina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just had some optional nits / suggestions.

index := -1
for i, repo := range resumeRepos {
if repoURL == repo {
index = i
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): I know this is simply moving code around, but we can break out of this loop once we found the repo. Alternatively, since the resumeRepos are sorted, we could do a binary search using sort.SearchStrings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that's neat. I do like the break. But looking at sort.SearchStrings, it could return an index that we wouldn't want to use, because it doesn't actually contain the repoURL, and we'd have to check that that index did exactly equal the repo. But still very cool to know about.

func (s *Source) getRepos() ([]*url.URL, []error) {
var validRepos []*url.URL
func (s *Source) getRepos() ([]string, []error) {
var validRepos []string
var errs []error
if len(s.repos) > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): I prefer structuring code to "fail early" which helps keep nested indentation low.

if len(s.repos) == 0 {
    return nil, nil
}
// rest of logic that was previously inside the `if len(s.repos) > 0` block

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if condition must be inverted.

for i, u := range repos {
// If there is resume information available, limit this scan to only the repos that still need scanning.
reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo)
s.repos = reposToScan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it okay to be possibly dropping some information here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to the dropping of the repositories that have already been scanned?

If so, I think it's fine. We don't actually store that information anywhere, as it's generated at the start of a scan. And the progressIndexOffset will allow us to get the proper count of the original number of repos.

However if you're referring to the potential to drop repos that may have been added between the time the scan started and when it was picked back up, I think that is okay. The goal here is to reduce the amount of time it takes to finish the scan, so the next scan will happen faster, and that scan will pick up the new repository.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to dropping of the repositories that have already been scanned, but it's good to know you've thought of other scenarios!

// The repo normalization has already successfully parsed the URL at this point, so we can ignore the error.
u, _ := url.ParseRequestURI(repo)
validRepos = append(validRepos, u)
validRepos = append(validRepos, repo)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I'm not sure if we can trust the repo url directly or if we should do url.ParseRequestURI followed by String(). I would think parsing would help normalize the input, but perhaps it's redundant at this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. The repo URL was already normalized when giturl.NormalizeGitlabRepo(prj) was run on it above. That eventually calls url.Parse here:

parsed, err := url.Parse(repoURL)

Though now that you mention it, maybe giturl should be using url.ParseRequestURI instead of just url.Parse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I'm not sure the difference between the two. Looking at the docs, ParseRequestURI assumes the URL came from an HTTP request and mentions

The string url is assumed not to have a #fragment suffix.

I wasn't entirely sure what that meant so I did a quick test here

Copy link
Collaborator Author

@trufflesteeeve trufflesteeeve Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. So I think this is supplied by an individual in configuration, so probably best to stick with url.Parse.

I think assuming it comes from a HTTP request means that it's the URL that was requested from the server? And anything after the # (the fragment) shouldn't be included, so it looks like it doesn't handle the # normally. Also, cool test!

Comment on lines +415 to +416
s.resumeInfoSlice = append(s.resumeInfoSlice, repoURL)
sort.Strings(s.resumeInfoSlice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (optional): We could more efficiently insert repoURL into the slice since it should already be sorted, though I kind of like the stability of sorting every time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Maybe something to keep an eye on, as though I don't expect anyone to have like 10k+ repos to sort nor the concurrency to have that many listed in the resume info, you never know.

@@ -418,5 +418,6 @@ func (s *Source) setProgressCompleteWithRepo(index int, repoURL string) {
// Make the resume info string from the slice.
encodedResumeInfo := sources.EncodeResumeInfo(s.resumeInfoSlice)

s.SetProgressComplete(index, len(s.repos), fmt.Sprintf("Repo: %s", repoURL), encodedResumeInfo)
// Add the offset to both the index and the repos to give the proper place and proper repo count.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: It would be nice to have a test for this to prevent regressions. If adding it would take too much time / effort, let's open an issue so it can be planned.

Copy link
Collaborator

@mcastorina mcastorina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! Just one thing needs to be fixed before merging.

@trufflesteeeve trufflesteeeve merged commit 10f4d02 into main Jun 17, 2022
@trufflesteeeve trufflesteeeve deleted the allow-gitlab-progress-resuming branch June 17, 2022 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants