-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow gitlab to resume from encoded resume info #611
Conversation
5076e91
to
5494b63
Compare
5494b63
to
da38994
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just had some optional nits / suggestions.
index := -1 | ||
for i, repo := range resumeRepos { | ||
if repoURL == repo { | ||
index = i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (optional): I know this is simply moving code around, but we can break
out of this loop once we found the repo. Alternatively, since the resumeRepos
are sorted, we could do a binary search using sort.SearchStrings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's neat. I do like the break
. But looking at sort.SearchStrings
, it could return an index that we wouldn't want to use, because it doesn't actually contain the repoURL
, and we'd have to check that that index did exactly equal the repo. But still very cool to know about.
pkg/sources/gitlab/gitlab.go
Outdated
func (s *Source) getRepos() ([]*url.URL, []error) { | ||
var validRepos []*url.URL | ||
func (s *Source) getRepos() ([]string, []error) { | ||
var validRepos []string | ||
var errs []error | ||
if len(s.repos) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (optional): I prefer structuring code to "fail early" which helps keep nested indentation low.
if len(s.repos) == 0 {
return nil, nil
}
// rest of logic that was previously inside the `if len(s.repos) > 0` block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
condition must be inverted.
for i, u := range repos { | ||
// If there is resume information available, limit this scan to only the repos that still need scanning. | ||
reposToScan, progressIndexOffset := sources.FilterReposToResume(s.repos, s.GetProgress().EncodedResumeInfo) | ||
s.repos = reposToScan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is it okay to be possibly dropping some information here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the dropping of the repositories that have already been scanned?
If so, I think it's fine. We don't actually store that information anywhere, as it's generated at the start of a scan. And the progressIndexOffset will allow us to get the proper count of the original number of repos.
However if you're referring to the potential to drop repos that may have been added between the time the scan started and when it was picked back up, I think that is okay. The goal here is to reduce the amount of time it takes to finish the scan, so the next scan will happen faster, and that scan will pick up the new repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to dropping of the repositories that have already been scanned, but it's good to know you've thought of other scenarios!
pkg/sources/gitlab/gitlab.go
Outdated
// The repo normalization has already successfully parsed the URL at this point, so we can ignore the error. | ||
u, _ := url.ParseRequestURI(repo) | ||
validRepos = append(validRepos, u) | ||
validRepos = append(validRepos, repo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I'm not sure if we can trust the repo
url directly or if we should do url.ParseRequestURI
followed by String()
. I would think parsing would help normalize the input, but perhaps it's redundant at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. The repo URL was already normalized when giturl.NormalizeGitlabRepo(prj)
was run on it above. That eventually calls url.Parse
here:
trufflehog/pkg/giturl/giturl.go
Line 41 in 4218c39
parsed, err := url.Parse(repoURL) |
Though now that you mention it, maybe giturl
should be using url.ParseRequestURI
instead of just url.Parse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. I'm not sure the difference between the two. Looking at the docs, ParseRequestURI
assumes the URL came from an HTTP request and mentions
The string url is assumed not to have a #fragment suffix.
I wasn't entirely sure what that meant so I did a quick test here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. So I think this is supplied by an individual in configuration, so probably best to stick with url.Parse
.
I think assuming it comes from a HTTP request means that it's the URL that was requested from the server? And anything after the #
(the fragment) shouldn't be included, so it looks like it doesn't handle the #
normally. Also, cool test!
s.resumeInfoSlice = append(s.resumeInfoSlice, repoURL) | ||
sort.Strings(s.resumeInfoSlice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit (optional): We could more efficiently insert repoURL
into the slice since it should already be sorted, though I kind of like the stability of sorting every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Maybe something to keep an eye on, as though I don't expect anyone to have like 10k+ repos to sort nor the concurrency to have that many listed in the resume info, you never know.
@@ -418,5 +418,6 @@ func (s *Source) setProgressCompleteWithRepo(index int, repoURL string) { | |||
// Make the resume info string from the slice. | |||
encodedResumeInfo := sources.EncodeResumeInfo(s.resumeInfoSlice) | |||
|
|||
s.SetProgressComplete(index, len(s.repos), fmt.Sprintf("Repo: %s", repoURL), encodedResumeInfo) | |||
// Add the offset to both the index and the repos to give the proper place and proper repo count. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: It would be nice to have a test for this to prevent regressions. If adding it would take too much time / effort, let's open an issue so it can be planned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the test! Just one thing needs to be fixed before merging.
This also includes a commit to pull out the common functions between gitlab and github into potentially reusable ones for other sources. I put it in
resume.go
in the sources package, but could definitely see it making more sense elsewhere.