Allow gitlab to resume from encoded resume info #611

trufflesteeeve · 2022-06-07T15:59:13Z

This also includes a commit to pull out the common functions between gitlab and github into potentially reusable ones for other sources. I put it in resume.go in the sources package, but could definitely see it making more sense elsewhere.

… package

mcastorina

Nice work! Just had some optional nits / suggestions.

mcastorina · 2022-06-16T17:17:27Z

pkg/sources/resume.go

+	index := -1
+	for i, repo := range resumeRepos {
+		if repoURL == repo {
+			index = i


Nit (optional): I know this is simply moving code around, but we can break out of this loop once we found the repo. Alternatively, since the resumeRepos are sorted, we could do a binary search using sort.SearchStrings

Ah that's neat. I do like the break. But looking at sort.SearchStrings, it could return an index that we wouldn't want to use, because it doesn't actually contain the repoURL, and we'd have to check that that index did exactly equal the repo. But still very cool to know about.

mcastorina · 2022-06-16T17:23:31Z

pkg/sources/gitlab/gitlab.go

-func (s *Source) getRepos() ([]*url.URL, []error) {
-	var validRepos []*url.URL
+func (s *Source) getRepos() ([]string, []error) {
+	var validRepos []string
 	var errs []error
 	if len(s.repos) > 0 {


Nit (optional): I prefer structuring code to "fail early" which helps keep nested indentation low.

if len(s.repos) == 0 { return nil, nil } // rest of logic that was previously inside the `if len(s.repos) > 0` block

Totally agree.

This if condition must be inverted.

mcastorina · 2022-06-16T17:27:55Z

pkg/sources/gitlab/gitlab.go

-	for i, u := range repos {
+	// If there is resume information available, limit this scan to only the repos that still need scanning.
+	reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo)
+	s.repos = reposToScan


Question: Is it okay to be possibly dropping some information here?

Are you referring to the dropping of the repositories that have already been scanned?

If so, I think it's fine. We don't actually store that information anywhere, as it's generated at the start of a scan. And the progressIndexOffset will allow us to get the proper count of the original number of repos.

However if you're referring to the potential to drop repos that may have been added between the time the scan started and when it was picked back up, I think that is okay. The goal here is to reduce the amount of time it takes to finish the scan, so the next scan will happen faster, and that scan will pick up the new repository.

I was referring to dropping of the repositories that have already been scanned, but it's good to know you've thought of other scenarios!

mcastorina · 2022-06-16T17:32:31Z

pkg/sources/gitlab/gitlab.go

-			// The repo normalization has already successfully parsed the URL at this point, so we can ignore the error.
-			u, _ := url.ParseRequestURI(repo)
-			validRepos = append(validRepos, u)
+			validRepos = append(validRepos, repo)


Question: I'm not sure if we can trust the repo url directly or if we should do url.ParseRequestURI followed by String(). I would think parsing would help normalize the input, but perhaps it's redundant at this point.

Ah. The repo URL was already normalized when giturl.NormalizeGitlabRepo(prj) was run on it above. That eventually calls url.Parse here:

trufflehog/pkg/giturl/giturl.go

Line 41 in 4218c39

parsed, err := url.Parse(repoURL)

Though now that you mention it, maybe giturl should be using url.ParseRequestURI instead of just url.Parse.

Hm.. I'm not sure the difference between the two. Looking at the docs, ParseRequestURI assumes the URL came from an HTTP request and mentions

The string url is assumed not to have a #fragment suffix.

I wasn't entirely sure what that meant so I did a quick test here

Hmmm. So I think this is supplied by an individual in configuration, so probably best to stick with url.Parse.

I think assuming it comes from a HTTP request means that it's the URL that was requested from the server? And anything after the # (the fragment) shouldn't be included, so it looks like it doesn't handle the # normally. Also, cool test!

mcastorina · 2022-06-16T17:38:09Z

pkg/sources/gitlab/gitlab.go

+	s.resumeInfoSlice = append(s.resumeInfoSlice, repoURL)
+	sort.Strings(s.resumeInfoSlice)


Nit (optional): We could more efficiently insert repoURL into the slice since it should already be sorted, though I kind of like the stability of sorting every time.

True. Maybe something to keep an eye on, as though I don't expect anyone to have like 10k+ repos to sort nor the concurrency to have that many listed in the resume info, you never know.

mcastorina · 2022-06-16T17:41:50Z

pkg/sources/gitlab/gitlab.go

@@ -418,5 +418,6 @@ func (s *Source) setProgressCompleteWithRepo(index int, repoURL string) {
 	// Make the resume info string from the slice.
 	encodedResumeInfo := sources.EncodeResumeInfo(s.resumeInfoSlice)

-	s.SetProgressComplete(index, len(s.repos), fmt.Sprintf("Repo: %s", repoURL), encodedResumeInfo)
+	// Add the offset to both the index and the repos to give the proper place and proper repo count.


Suggestion: It would be nice to have a test for this to prevent regressions. If adding it would take too much time / effort, let's open an issue so it can be planned.

mcastorina

Thanks for adding the test! Just one thing needs to be fixed before merging.

trufflesteeeve requested review from bill-rich, mcastorina, dustin-decker and ahrav June 7, 2022 15:59

trufflesteeeve force-pushed the allow-gitlab-progress-resuming branch from 5076e91 to 5494b63 Compare June 7, 2022 16:26

trufflesteeeve added 2 commits June 10, 2022 11:55

Separate repo resume functions into reusable functions in the sources…

bf8e54b

… package

Allow gitlab to resume from encoded resume info

da38994

trufflesteeeve force-pushed the allow-gitlab-progress-resuming branch from 5494b63 to da38994 Compare June 10, 2022 15:55

fixup - fix bug where an inaccurate number of repos would be reported

dd37530

mcastorina approved these changes Jun 16, 2022

View reviewed changes

fixup - add fixes from review, including progress test

cd1d6bd

mcastorina requested changes Jun 16, 2022

View reviewed changes

fixup - fix logic bug in getRepos

1e4688d

mcastorina approved these changes Jun 17, 2022

View reviewed changes

trufflesteeeve merged commit 10f4d02 into main Jun 17, 2022

trufflesteeeve deleted the allow-gitlab-progress-resuming branch June 17, 2022 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow gitlab to resume from encoded resume info #611

Allow gitlab to resume from encoded resume info #611

trufflesteeeve commented Jun 7, 2022

mcastorina left a comment

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022

mcastorina Jun 16, 2022

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022

mcastorina Jun 16, 2022

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022 •

edited

Loading

mcastorina Jun 16, 2022

trufflesteeeve Jun 16, 2022

mcastorina Jun 16, 2022

mcastorina left a comment

		s.resumeInfoSlice = append(s.resumeInfoSlice, repoURL)
		sort.Strings(s.resumeInfoSlice)

Allow gitlab to resume from encoded resume info #611

Allow gitlab to resume from encoded resume info #611

Conversation

trufflesteeeve commented Jun 7, 2022

mcastorina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trufflesteeeve Jun 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcastorina left a comment

Choose a reason for hiding this comment

trufflesteeeve Jun 16, 2022 •

edited

Loading