-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect Co-authored-by trailers in the contributors graph #30925
base: main
Are you sure you want to change the base?
Respect Co-authored-by trailers in the contributors graph #30925
Conversation
Now that I am looking at into it further, I think Pulse may need to account for co-authors too. I think this function is the one that needs updating |
} | ||
|
||
email = strings.TrimRight(parts[1], ">") | ||
if _, err := mail.ParseAddress(email); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since users can add anything between the <> brackets I believed that we could try to throw away lines that contain invalid email addresses.
Though while testing by creating blank repos and merging the commit history of existing projects into these repos I observed that some emails associated to bots will be rejected (e.g. the pre-commit bot on GitHub). I guess this happens because these emails don't follow RFC 5322. Not sure if I should do this validation check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on how we parse other trailer lines I suppose. I would make it consistent with those.
Hmm I'm not sure how common it is, but should I account for duplicate |
@@ -125,8 +127,7 @@ func getExtendedCommitStats(repo *git.Repository, revision string /*, limit int | |||
_ = stdoutWriter.Close() | |||
}() | |||
|
|||
gitCmd := git.NewCommand(repo.Ctx, "log", "--shortstat", "--no-merges", "--pretty=format:---%n%aN%n%aE%n%as", "--reverse") | |||
// AddOptionFormat("--max-count=%d", limit) | |||
gitCmd := git.NewCommand(repo.Ctx, "log", "--shortstat", "--no-merges", "--pretty=format:---%n%aN%n%aE%n%as%n%(trailers:key=Co-authored-by,valueonly=true)", "--reverse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked compat on this trailers
option and it does appear to be present in git 2.27 at least which is the oldest version where git's site has docs for, so it's likely fine to use:
Recent change made the co-author parsing logic more robust to inputs where the co-author's name may include a '<'. I manually tested all the code paths for |
I guess duplicates (by email) should count as one. |
I'm reminded of a recent vulnerability GitHub had in their commit message parser surrounding these meta lines. Hope we don't replicate it, maybe you can double-check and add a test. |
BTW is it feasible to move all commit message parsing into a shared function? As I see it, parsing these Co-Authored-By lines is not specific to Commit graph and could be useful elsewhere too. |
Recent changes exports the trailer parsing logic and adds adds parameterized tests for it. We now also handle duplicated Co-authored-by entries (i.e. entries that have the same email). |
Nice read! It appears it is caused by their parser's use of regex, where it iterates a commit's metadata lines looking for its first match with a line that starts with I have included some tests for my parser and I'm open to include more possible inputs. |
} | ||
coAuthorName, coAuthorEmail, err := util.ParseCommitTrailerValueWithAuthor(line) | ||
if err != nil { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to have a log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how frequent it is for commits to have invalid Co-authored-by
trailers but if it won't cause a problem for admins we could have this as maybe an INFO/DEBUG level log
// Note that it only parses the value and does not consider the trailer key i.e. we just | ||
// parse something like the following: | ||
// | ||
// Foo Bar <foobar@example.com> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's a standard format, why not use mail.ParseAddress
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I originally was worried about using this since I wasn't too sure of what mail.ParseAddress
accepted (i.e. wasn't familiar with RFC 5322) as input but looking at it now it seems to be a good solution compared to having us parsing the input manually.
It may still make sense to have a utility function around since it looks like we can accept a trailer with no name e.g.:
addrStr := "<hello@world>"
addr, _ := mail.ParseAddress(addrStr)
fmt.Printf("%s", addr.Address)
// hello@world
// Program exited.
Here we can throw an error if addr.Name
is empty.
if err != nil { | ||
continue | ||
} | ||
if _, exists := emailSet[coAuthorEmail]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a hash map here to account trailers with duplicate emails. Is it safe that this hash map is unbounded? If so, should we just ignore all the co-author data for this commit if they reach some certain bound?
I realized I need to account for the case where when we are iterating over the trailers we encounter co-author email = commit author email. The current implementation would count this as 2 commits for the commit author. Will make the necessary changes for this and address thte PR feedback as soon as I get the time. |
Will need to do the following when I get the time:
|
Resolves #29183.
Added an automated test (and refactored the test code around it) and also manually tested the Contributors, Commit Frequency, and Recent Commits pages in a few small repos and the output appears to be fine.