Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect Co-authored-by trailers in the contributors graph #30925

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

kemzeb
Copy link
Contributor

@kemzeb kemzeb commented May 9, 2024

Resolves #29183.

Added an automated test (and refactored the test code around it) and also manually tested the Contributors, Commit Frequency, and Recent Commits pages in a few small repos and the output appears to be fine.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label May 9, 2024
@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 9, 2024
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label May 9, 2024
@kemzeb
Copy link
Contributor Author

kemzeb commented May 9, 2024

Now that I am looking at into it further, I think Pulse may need to account for co-authors too. I think this function is the one that needs updating

}

email = strings.TrimRight(parts[1], ">")
if _, err := mail.ParseAddress(email); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since users can add anything between the <> brackets I believed that we could try to throw away lines that contain invalid email addresses.

Though while testing by creating blank repos and merging the commit history of existing projects into these repos I observed that some emails associated to bots will be rejected (e.g. the pre-commit bot on GitHub). I guess this happens because these emails don't follow RFC 5322. Not sure if I should do this validation check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how we parse other trailer lines I suppose. I would make it consistent with those.

@kemzeb
Copy link
Contributor Author

kemzeb commented May 9, 2024

Hmm I'm not sure how common it is, but should I account for duplicate Co-authored-by entries i.e. entries that have the same email in a single commit?

@lunny lunny added this to the 1.23.0 milestone May 10, 2024
@@ -125,8 +127,7 @@ func getExtendedCommitStats(repo *git.Repository, revision string /*, limit int
_ = stdoutWriter.Close()
}()

gitCmd := git.NewCommand(repo.Ctx, "log", "--shortstat", "--no-merges", "--pretty=format:---%n%aN%n%aE%n%as", "--reverse")
// AddOptionFormat("--max-count=%d", limit)
gitCmd := git.NewCommand(repo.Ctx, "log", "--shortstat", "--no-merges", "--pretty=format:---%n%aN%n%aE%n%as%n%(trailers:key=Co-authored-by,valueonly=true)", "--reverse")
Copy link
Member

@silverwind silverwind May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked compat on this trailers option and it does appear to be present in git 2.27 at least which is the oldest version where git's site has docs for, so it's likely fine to use:

https://git-scm.com/docs/git-log/2.27.0

@kemzeb
Copy link
Contributor Author

kemzeb commented May 11, 2024

Recent change made the co-author parsing logic more robust to inputs where the co-author's name may include a '<'.

I manually tested all the code paths for parseCoAuthorTrailerValue() and they appear fine. Even though it is unexported, should I just include unit tests for it since the other test I added doesn't cover all its paths?

@silverwind
Copy link
Member

Hmm I'm not sure how common it is, but should I account for duplicate Co-authored-by entries i.e. entries that have the same email in a single commit?

I guess duplicates (by email) should count as one.

@silverwind
Copy link
Member

silverwind commented May 13, 2024

Recent change made the co-author parsing logic more robust to inputs where the co-author's name may include a '<'.

I manually tested all the code paths for parseCoAuthorTrailerValue() and they appear fine. Even though it is unexported, should I just include unit tests for it since the other test I added doesn't cover all its paths?

I'm reminded of a recent vulnerability GitHub had in their commit message parser surrounding these meta lines. Hope we don't replicate it, maybe you can double-check and add a test.

@silverwind
Copy link
Member

BTW is it feasible to move all commit message parsing into a shared function? As I see it, parsing these Co-Authored-By lines is not specific to Commit graph and could be useful elsewhere too.

@pull-request-size pull-request-size bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 14, 2024
@kemzeb
Copy link
Contributor Author

kemzeb commented May 14, 2024

Recent changes exports the trailer parsing logic and adds adds parameterized tests for it. We now also handle duplicated Co-authored-by entries (i.e. entries that have the same email).

@kemzeb
Copy link
Contributor Author

kemzeb commented May 14, 2024

I'm reminded of a recent vulnerability GitHub had in their commit message parser surrounding these meta lines. Hope we don't replicate it, maybe you can double-check and add a test.

Nice read! It appears it is caused by their parser's use of regex, where it iterates a commit's metadata lines looking for its first match with a line that starts with author. From what I understand the main problem is that you can author a commit without a name (while working in a Codespaces environment) where the parser would mistakenly not catch; it would then continue iterating over the commit metadata and eventually find a false author line that was added by the attacker.

I have included some tests for my parser and I'm open to include more possible inputs.

}
coAuthorName, coAuthorEmail, err := util.ParseCommitTrailerValueWithAuthor(line)
if err != nil {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have a log here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how frequent it is for commits to have invalid Co-authored-by trailers but if it won't cause a problem for admins we could have this as maybe an INFO/DEBUG level log

// Note that it only parses the value and does not consider the trailer key i.e. we just
// parse something like the following:
//
// Foo Bar <foobar@example.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's a standard format, why not use mail.ParseAddress directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I originally was worried about using this since I wasn't too sure of what mail.ParseAddress accepted (i.e. wasn't familiar with RFC 5322) as input but looking at it now it seems to be a good solution compared to having us parsing the input manually.

It may still make sense to have a utility function around since it looks like we can accept a trailer with no name e.g.:

addrStr := "<hello@world>"
addr, _ := mail.ParseAddress(addrStr)
fmt.Printf("%s", addr.Address)
// hello@world
// Program exited.

Here we can throw an error if addr.Name is empty.

if err != nil {
continue
}
if _, exists := emailSet[coAuthorEmail]; exists {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a hash map here to account trailers with duplicate emails. Is it safe that this hash map is unbounded? If so, should we just ignore all the co-author data for this commit if they reach some certain bound?

@kemzeb
Copy link
Contributor Author

kemzeb commented Oct 9, 2024

I realized I need to account for the case where when we are iterating over the trailers we encounter co-author email = commit author email. The current implementation would count this as 2 commits for the commit author. Will make the necessary changes for this and address thte PR feedback as soon as I get the time.

@lunny lunny removed this from the 1.23.0 milestone Oct 13, 2024
@lunny lunny added this to the 1.24.0 milestone Oct 13, 2024
@kemzeb
Copy link
Contributor Author

kemzeb commented Oct 27, 2024

Will need to do the following when I get the time:

  • Revert and reimplement changes to user2/repo2 to avoid adding too many git objects (and because I accidentally changed a commit that was already there)
  • Move the authorEmail == coAuthorEmail below the error check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. modifies/go Pull requests that update Go code size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

contributors graph respect "Co-authored-by"
4 participants