Skip to content
This repository has been archived by the owner on Apr 12, 2019. It is now read-only.

Faster commit lookup #53

Merged
merged 8 commits into from
May 26, 2017
Merged

Faster commit lookup #53

merged 8 commits into from
May 26, 2017

Conversation

ethantkoenig
Copy link
Member

@ethantkoenig ethantkoenig commented May 23, 2017

A faster implementation of GetCommitsInfo, addresses go-gitea/gitea#491 and go-gitea/gitea#502.

The previous implementation made a call to git log for each entry, each of which required scanning through the commit history. This new implementation instead makes a single call to git log. This is faster, because it involves scanning the commit history only once.

BENCHMARK RESULTS:
Shoutout to @sapk for the benchmark tests he wrote (#54), which I have stolen here.

Old implementation:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     200	  70670229 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	       1	35907609610 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	      50	 204585469 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      20	 640497304 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 829981474 ns/op

New implementation:

BenchmarkEntries_GetCommitsInfo/gitea-4         	     500	  34284443 ns/op
BenchmarkEntries_GetCommitsInfo/manyfiles-4     	      30	 424753714 ns/op
BenchmarkEntries_GetCommitsInfo/moby-4          	     100	 109430547 ns/op
BenchmarkEntries_GetCommitsInfo/go-4            	      30	 521336982 ns/op
BenchmarkEntries_GetCommitsInfo/linux-4         	      20	 813171326 ns/op

IMPLEMENTATION DETAILS:
Gets the 16 latest commits affecting the relevant entries (git log --name-only -16 HEAD -- <treePath>). The output of this command containing which files were affect by each commit. Scan through this list of cmmit, and stop once a commit has been found for each entry.

If you go through the first batch of 16 commits, you get then next 32 commits (git log --name-only -32 <last-commit-from-first-batch>^ -- entry1 entry2 ...); each time you double the number of commits. This ensures that in the common case you'll only have to read in a small number (16) of commits, but you also won't have to make too many calls to git log if you need to go further into the commit history.

Finally, if you are looking for 32 or fewer entries, manually list out each entry in the git log command (git log --name-only -- entry1 entry2...) to support a more targeted search.

@lunny
Copy link
Member

lunny commented May 23, 2017

It seems there are some trace lines.

@ethantkoenig
Copy link
Member Author

Trace lines removed, sorry about that

@lunny lunny removed the status/wip label May 23, 2017
@lunny
Copy link
Member

lunny commented May 23, 2017

Are there some performance tests?

@ethantkoenig
Copy link
Member Author

Other than what I've run manually, no.

@sapk
Copy link
Member

sapk commented May 24, 2017

We could add golang benchmark or more simple benchmark like a script in a folder contrib ?

@sapk sapk mentioned this pull request May 24, 2017
@sapk
Copy link
Member

sapk commented May 24, 2017

Maybe I have made a mistake in my benchmark (#54) but I don't find optimization on some big repo.

Before PR :
before PR

After PR :
after PR

Edit:
For information: those tests were run on a server with a heavy load on fs in parallel maybe that the cause.

@ethantkoenig
Copy link
Member Author

@sapk Thanks for writing the tests, I'll look into it

@ethantkoenig
Copy link
Member Author

If you try a repo with many files (e.g. https://github.com/ethantkoenig/manyfiles), you should see a noticeable speed-up (old: 50 seconds, new: 5 seconds on my laptop).

I'll try to look for how to make my implementation comparable to the old implementation for non-pathological cases.

@ethantkoenig
Copy link
Member Author

ethantkoenig commented May 25, 2017

@sapk I've found a faster implementation that improves all 5 benchmarks test (see PR description). Let me know what you think

panic(err)
}
entries.Sort()
b.Run(benchmark.name, func(b *testing.B) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need Go1.7. I think we support at least Go1.6 for gitea but I can't find where it is written ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this repo already doesn't support Go1.6, since we use the "context" package in command.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking this change wasn't merge because of that.

{url: "https://github.com/torvalds/linux.git", name: "linux"},
}
for _, benchmark := range benchmarks {
var commit *Commit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should still b.StopTimer() ... b.StartTimer() around init.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

var commit *Commit
var entries Entries
if repoPath, err := setupGitRepo(benchmark.url, benchmark.name); err != nil {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that it is cleaner to use : b.Fatal(err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must have forget to commit ? there were no change maid here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops 🤦‍♂️, see #55

const benchmarkReposDir = "benchmark_repos/"

func setupGitRepo(url string, name string) (string, error) {
repoDir := filepath.Join(benchmarkReposDir, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to ask for tempdir via ioutil.TempDir but migth not be the choice of everyone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to have to re-clone the repositories each time the tests run; it took me several minutes to clone them all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so maybe /benchmark/repos and exclude /benchmark, if we include later other benchmark with other resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@sapk
Copy link
Member

sapk commented May 25, 2017

Globally, I haven't review implementation (yet), benchmark show some little improvement that would be good to have. I made some comments on the benchmark part. I would have prefer that you cherry-pick my commits/PR and made change after but it could pass for this PR ^^.

@ethantkoenig
Copy link
Member Author

@sapk Rebased to include your commits, sorry about that

@sapk
Copy link
Member

sapk commented May 26, 2017

@ethantkoenig couldn't this have some concurrency like old code in order to speed up again ? I know that more routine is not always the solution and that maybe you already test that.

@ethantkoenig
Copy link
Member Author

@sapk I don't think this implementation is as amenable to concurrency as the old one. As far as I can tell, the main benefit from using concurrency in the old implementation was to make calls to git log ... in parallel. The new implementation makes far fewer calls to git log ..., and these calls must occur in succession (each one depends on the result of the previous one), so I don't think we'd see much gain from introducing concurrency.

@sapk
Copy link
Member

sapk commented May 26, 2017

@ethantkoenig looking at it could have some concurrency but with a different format/strategy. For example a controller starting routines executing git log history and returning by chan the path matching. The controller stop starting new routine when all path are completed.

This could be done later. This PR give an improvement and LGTM. (except panic() that need to be changed)

This PR could have a test to check regression but I haven't a good idea how to do that so maybe another time. ^^

@lunny
Copy link
Member

lunny commented May 26, 2017

LGTM

@lunny lunny merged commit ec4446b into go-gitea:master May 26, 2017
@ethantkoenig ethantkoenig deleted the commit_lookup branch May 26, 2017 15:49
@ethantkoenig ethantkoenig mentioned this pull request Nov 19, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants