Skip to content
This repository has been archived by the owner on Sep 11, 2020. It is now read-only.

revlist is slow for non-local storage layers #909

Open
strib opened this issue Aug 7, 2018 · 5 comments
Open

revlist is slow for non-local storage layers #909

strib opened this issue Aug 7, 2018 · 5 comments

Comments

@strib
Copy link
Contributor

strib commented Aug 7, 2018

Over in keybase/client#12730, we have a user of Keybase Git who noticed it can be very slow (and CPU-intensive) to fetch a single new commit from certain repos. Under the cover this is powered by a go-git Remote.Push() call, which uses the Keybase storage layer as the Remote.Storer instance, and uses the local git checkout as the remote side of the connection. (So the "fetch" becomes a push from the Keybase repo to the local checkout.)

The push is implemented by first getting the advertised reference heads from the local git checkout, and passing those commits as the ignore list to revlist.Objects(), along with the commit hashes being pushed out of Keybase repo. The performance problem seems to happen here, which walks the entire commit history and the entire directory tree for each of the commits, in order to build up a more complete ignore list for the real revlist computation a few lines below.

This works acceptably fast when the storage layer is a local git repo, but when it's a distributed storage layer like KBFS, this can be very slow. And it seems unnecessary -- most of the ignored object IDs likely won't be used, since the tree-walking at the real computation will bail out at the first seen object it sees while walking the tree, which for most single-commit fetches will be most of the entries of the root directory.

I think there is a better algorithm for this. I started playing around with one here, but it's pretty stupid and while it does definitely improves things for Keybase, it's not by as much as I think it should be able to.

I haven't worked out a complete solution yet, as I am hoping for dev feedback first. But I think the outline could be something like this:

  1. In revlist.Objects, for each object in objs, find the most-recent parent commit ID in ignore. Call the set of these commits parents.
  2. For each commit in parents, compute a git diff-tree-like algorithm against the corresponding commit in objs. (I'm not sure if this algorithm exists yet in go-git anywhere.) This would traverse and compare the directory entries between the two commits, and find the exact places they differ (traversing only as deeply as needed in each tree to figure out the differences).
  3. The list of similarities between the two are added to the ignore list, along with all the parents.
  4. When constructing the final set of objects for the revlist using this new ignore list, reachableObjects should return early as soon as it reaches a commit in the ignore list, and the pending list is empty.

I could be wrong, but I think this is a more efficient algorithm in all cases. Of course, it requires a bit more code complexity.

What do the devs think? cc: @smola @erizocosmico @mcuadros

@strib
Copy link
Contributor Author

strib commented Sep 10, 2018

@smola @erizocosmico @mcuadros or anyone else, any thoughts on this issue? I'd like to improve the performance here, but I'm hesitant to put a lot of effort into something that won't be accepted upstream, or which might be introducing bugs due to my minimal lack of git protocol knowledge. Thanks!

@smola
Copy link
Collaborator

smola commented Sep 14, 2018

@strib I think we would favor adopting the same or similar algorithm to what git and jgit have. We initially took this shortcut of doing two full object traversals, but both git and jgit seem to be quite smarter. You can search their code base for the terms mark, interesting and uninteresting. I did not look in-depth at neither of them yet though.

An alternative/complementary approach is implementing support to generate pack bitmap indexes and use them when calculating reachable objects. Here is the format spec:

@strib
Copy link
Contributor Author

strib commented Sep 14, 2018

@smola thanks for the response and the suggestions. The bitmap index thing is interesting, though it might be a challenge for Keybase specifically since there's no real git "server" for us, just a bunch of distributed clients coordinating with one shared storage layer. Certainly it would be possible to use locks in a smart way to elect a single client to generate the map (which is basically what we do already for updating a reference), but it might be a bit hacky to drop that into the go-git code.

When I (or someone else on our team) have time to work on this, we'll look closely at what the existing implementations do and try to adapt them to go-git. Thanks!

@strib
Copy link
Contributor Author

strib commented Jan 23, 2019

@smola I did some research into what git and jgit do.

git:

For each reference commit on the remote, it marks that commit and its 5 immediate parents as "uninteresting":
https://github.com/git/git/blob/16a465bc018d09e9d7bbbdc5f40a7fb99c21f8ef/revision.c#L937-L938
https://github.com/git/git/blob/16a465bc018d09e9d7bbbdc5f40a7fb99c21f8ef/revision.c#L1099-L1105

Then, for each of those uninteresting commits, it marks the entire tree and all blobs "uninteresting" (skipping those already marked of course):
https://github.com/git/git/blob/16a465bc018d09e9d7bbbdc5f40a7fb99c21f8ef/list-objects.c#L237-L239

So overall this is similar to go-git except that it limits it to only 5 commits, rather than the full commit history.

jgit:

I'm less sure about my read here, but I think it's similar to the above, except that instead of 5 commits, it uses 500:
https://github.com/eclipse/jgit/blob/f40b39345cd9b54473ee871bff401fe3d394ffe3/org.eclipse.jgit/src/org/eclipse/jgit/revwalk/RevCommit.java#L73
https://github.com/eclipse/jgit/blob/f40b39345cd9b54473ee871bff401fe3d394ffe3/org.eclipse.jgit/src/org/eclipse/jgit/revwalk/RevCommit.java#L238-L245
https://github.com/eclipse/jgit/blob/f40b39345cd9b54473ee871bff401fe3d394ffe3/org.eclipse.jgit/src/org/eclipse/jgit/revwalk/ObjectWalk.java#L312-L315

Experiment:

I tried a quick implementation that limited the commit depth to 5 in go-git, and ran it against a known bad test case with our distributed storage layer. The results were somewhat underwhelming -- a 10-30% reduction in time to "push" one new commit, but the overall time spent was still way larger than it we want. So this change alone is probably not good enough for us. But I can clean it up and submit it as a PR here if you think it would be useful.

Other thoughts:

In our case, it would be way faster to traverse the commit trees of the "remote refs" via the "remote" repo itself (which for our purposes, as explained above, is really the local git repo). So I think I'm going to explore some hacky way to allow the remote repo to get the ignore list that way instead. For example, we can make the revlist to take an optional second storage layer to use to find the ignore list, and if Remote.PushContext() sees the remote is backed by a local file system, create and pass in a storage layer backed by that local repo. Not sure if go-git would accept a change like that, but we can maintain it in our fork if needed.

Thanks!

strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a
secondary storage layer can be provided, to be used to finding the
full set of objects to be ignored while finding the reachable objects.
This is useful when the main `s` storage layer is slow and/or remote,
while the ignore list is available somewhere local.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
This factors out some URL-parsing code from the transport layer so it
can be used by config as well.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
@strib
Copy link
Contributor Author

strib commented Feb 11, 2019

I just put up #1066 for review, which resulted in a huge performance win for fetches/pull in Keybase git. I hope you'll consider merging it!

strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a
secondary storage layer can be provided, to be used to finding the
full set of objects to be ignored while finding the reachable objects.
This is useful when the main `s` storage layer is slow and/or remote,
while the ignore list is available somewhere local.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
This factors out some URL-parsing code from the transport layer so it
can be used by config as well.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
`ObjectsWithStorageForIgnores` is the same as `Objects`, but a
secondary storage layer can be provided, to be used to finding the
full set of objects to be ignored while finding the reachable objects.
This is useful when the main `s` storage layer is slow and/or remote,
while the ignore list is available somewhere local.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
This factors out some URL-parsing code from the transport layer so it
can be used by config as well.

Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
strib added a commit to keybase/go-git that referenced this issue Feb 11, 2019
Issue: src-d#909
Signed-off-by: Jeremy Stribling <strib@alum.mit.edu>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants