-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal implementation of partial clones & promisor objects #1046
Comments
Thanks a lot for putting this feature on the radar permanently, and for starting the discussion about it.
There are various How object access works across crate boundaries and object access
Even though this represents the typical case of 'fetch one object at a time' which can be retrofitted quite easily, I'd recommend making this a shared trait that represents an object database. That way, multi-fetching facilities can be provided later. A great candidate for using these would be Of course, to get the index one has to build it from a tree, which now may have promisor trees on the way to the blobs/leafs, so building an index from a tree would have to be adjusted first. All of the above implicitly assumes that fetches with filters work in the first place. EfficiencyWhat matters most to me is that it's easy to determine if promisor objects are present at all to be able to bypass any kind of 'project-gathering-dry-run' unless it's known to be required. In other words, everything should stay as fast as it is without promisor objects. I think this shouldn't be a problem though as it's easy to determine if promisor objects are present in the ODB or not. |
Thanks @Byron. Things are slowly starting to get pieced together in my head.
That sounds like a good starting point to me. After playing around with the various
I see three possible levels of connectivity checks, ranging from the simplest/fastest to the most complex/slowest (I'm ignoring some less-common ones for now):
How about I start with just 1 & 2 above? Those seem relatively straightforward, and I believe they'd satisfy the immediate goal of making it easier to implement & test promisor objects. The third one can come later. If that sounds reasonable, then my next question would be: where should this implementation go?? Wherever it ends up, it will need to utilize functionality from several existing crates (via the |
That's clever - I forgot
Do I understand correctly and the reason for the connectivity check to be required is that one has to create the
Maybe it's just me misunderstanding what you meant to say, but in general, When I thought about connectivity checks (independently of promisor packs or indices) I always thought that this should be integrated tightly with the current pack resolution machinery. It sees all objects in the packs and learns about their hashes (these have to go into the index after all), and it's perfectly multi-threaded. Ideally, with minimal performance loss, connectivity checks could be implemented. Since the I know, I know, all that sounds like trouble and a separate walk-based implementation would be easier to do, but it would also be so much slower. Maybe with some thinking, connectivity checks can be built-into the pack resolution, but if not, doing it in a separate step seems alright as well. |
A connectivity check isn't required by any means, at least not for implementing promisor packs. I had thought you were recommending doing that as a starting point (which I do agree is a good idea) based on this comment:
I think this would definitely be helpful to have, as it would let me verify that the expected promisor objects are fetched/resolved during different operations on a partial cloned repository. Even before I have the fetch with filters implemented, a connectivity check would let me verify that
Nope - this was totally my own misunderstanding of what you had said in your last comment about
That sounds perfectly reasonable. I'll start digging into how this would best integrate into the current pack indexing & lookup architecture.
I thought that was already implemented here? But yes - Even so, the third connectivity check mode I mentioned in my last comment still has the added requirement of marking each object that is reachable/referenced from any other object, and then once the traversal is complete, it needs to report all objects that were not marked accordingly (i.e. "dangling objects"). While the traversal algorithm can (and should) certainly be highly parallelized, there is still a non-trivial memory requirement to maintain the data structure of dangling objects. I'm not planning on addressing that use-case of scanning for dangling objects right now, as that's entirely unrelated to partial clones. But it will be something to consider once features like garbage collection come to head. Thanks again @Byron! Please let me know if I've misunderstood anything, but otherwise I think I've got plenty of info to get me started. |
I think there are a couple of things worth untangling, from my perspective:
With all that said, I am not convinced that connectivity checks integrated into pack resolution are a good idea, and if truly needed, they should be done as standalone capability. For tests or testing and verification it might also be possible to start out with just using It might seem I am changing my mind a lot, which might be due to my utter lack of understanding how |
Apologies for the delay! I had some other stuff come up, but have also been spending time trying to better understand the intricate details of Before I dive into what I've learned about partial clones in
This is great, but even better is the fact that I did zero of that fearless parallelism you mentioned in #1041 - my naive implementation is single-threaded and written by someone entirely new to Now - about what I've learned about partial clones... When we fetch a pack from a promisor remote, we end up with a If you read the first bullet of Handling Missing Objects, it says the promisor file has "arbitrary contents". Upon digging into the In the same Handling Missing Objects section, the second bullet describes how All this looks good, however it doesn't align with all the behavior I see in In
This unfortunate behavior is explained by the third bullet in Handling Missing Objects... At the end of that bullet, you'll find:
I understand having to resort to this behavior if I had used a partial clone filter that filtered trees in addition to blobs, but @Byron I'd love to get an opinion on this. Right now my thinking for a minimal implementation is :
|
Thanks so much for the update - exciting! I love that you already cooked up a version of the connectivity check and that it apparently runs very fast, too! Too fast, if you ask me, as usually often-run algorithms in When looking at the times, I find both very slow anyway, as this is the Missing Object Handling
I think it's easy to do better than that as this could simply be (more) configurable. Gitoxide already as its own About potential efficiencyIntegrating a set of promisor objects for lookup wold need to be done in the belly of the beast which is the threadsafe part of the object database. That way, these objects can be shared among all handles, and one typically has a handle per thread at least. Updating promisor objects lazily is the way to go, and it should be possible to integrate it well with everything else. But it's going to be tricky as the object database is a lawless zone where borrowchk has no power 😅 (and introducing races is easy unless great care is taken). But it's far from impossible as well. An alternative implementation would be to keep the promisor dataset in the handle itself so each thread would have to repeat the work necessary to build and maintain it - that's easy to implement as it's borrowchecked, but I guess one should try to do it properly first. Next StepsI seem to be missing a plan to integrate
That seems reasonable, I just wanted to add that once an object is missing, one has searched all packs. Any amount of these could be promisor packs. For perfect laziness one would be able to tell the underlying shared ODB to find the given object in one of the promisor packs, whose 'promised' objects will be loaded on demand until the object was found. These promised objects would be kept in a shared read-only cache to accelerate future access. Of course that also complicates the implementation and I can imagine to start out with "load all promised objects, then check for membership" as well.
(emphasis mine) Probably not in parallel as the ODB isn't allowed to spawn threads yet. It works by using the 'power' of the thread that uses the handle, and the current implementation can leverage this as to speed up operations if a trampling horde comes in and needs to load packs - these will be loaded in parallel (if I remember correctly). But at first, I don't think there is a need to be super efficient, it's enough to have a single thread do the work and do so correctly. Trust me, the ODB core is a nightmare as borrowcheck isn't present and it's highly concurrent - things should be kept as simple as possible. With that said, 'as simple as possible' might also mean to do it like git - assume a missing object is promised as long as there is a single promisor pack in the known set of packs. That seems like a fair assumption and would be my starting point - it does safe the whole connectivity check business and would allow to keep the ODB core mostly unchanged - it just needs a new flag to check if a promisor is there or not, and that's easy to implement correctly. However, implementing the dynamic fetching of objects on such a way that it is thread-safe and correct and ideally allows multiple threads to fetch objects at the same time, that's going to be an interesting problem to solve efficiently. Getting a first version done should be relatively straightforward though as each thread (in the code of the handle) could just initiate their own connection and operate without shared state. It's just horribly inefficient to do so… fetch a pack with one object, then refresh the ODB to pick up the object, then search that pack specifically if you can or re-search all packs. Horrible…for a single object. You know, I think that even if it was implemented, having this in place will make any ordinary and unaware algorithm so painfully slow that people wouldn't want to wait around for the result anyway. And they'd end up with hundreds of packs of 1 object as well (unless, maybe, one explodes these into a loose objects, that's fair and probably has to be done to be viable). Somehow I feel that on-demand loading of single objects shouldn't even be done, but instead one should make sure that each algorithm can deal with promisor objects. A graph-walk, of course, would be in trouble as it has to discover everything, but maybe that's a special case, too? With all that said, please feel free to digest this into a potentially new (or changed) list of things you'd want to implement. |
I should clarify one thing here. My first attempt at doing this yielded results like so:
This is twice as slow as
I also did some tests using one of the larger repositories my company has. It's proprietary, but I can assure you it is quite big. The bare repository consumes 11.5GB and includes well over 100K commits. Here's what things look like with that:
In that same repository, commenting out my call to
I'd love any insight you may have on the conditions in which
The more I experiment with this, the more I agree. As you said, controlling this via configuration would be ideal. I can imagine a different git implementation ( Take, for example, Instead of
I can understand some users dealing with smaller repositories being okay with the default behavior, but being able to configure different behavior would be fantastic IMO. Hopefully all algorithms can be made to be promisor aware, in which case this is a non-issue.
My core background is mostly C++, so the borrow checker is a new-found luxury to me. I'm more than happy to have a go at doing it properly first. If that proves to be too complicated, then I can fall back to a per-thread (
Happy to hear that! Let me go ahead and polish up what I have (e.g. make it not panic upon any sort of error condition) and then I'll get an MR up.
Yesssss - I like this very much!
I 100% agree here, and my immediate use-cases have an anti-goal for that sort of lazy & inefficient fetching of one object at a time. I'm more than happy to remove that from the task list. As described in my
I'll update the task list in the original issue description now, but I'm also looking forward to continuing to iterate on ideas! Thanks again @Byron. I fully understand how much effort it takes to ramp up a new (prospective) contributor to a software project, and I'll do my best to make sure you haven't wasted your time! |
There is no cheating, and turning off automatic refresh on miss seems like the right thing to do. I'd expect Looking at the distribution of
I don't think there is any need if it seems to work and is faster, it's a win I am happy to (tentatively) take :D.
I think
I just repeat this here to say that I think we share the understanding that the ODB won't ever auto-fetch promised objects, but that there may be an API (not necessarily on the ODB though) to fetch a set of objects and deal with promisor-related things as needed.
I took a look (and updated the formatting a little) and would like to point out that I don't think any time should be spent on obtaining a set of promised objects as it seems sufficient to assume every missing object (but referenced) object is promised. I further edited the task-list by striking-through what I think should change with annotations, and kindly ask you to acknowledge it by removing these edited portions or substituting them with something that fits better. It's really just an attempt to collaborate on the task list without me repeating/quoting everything here. If it doesn't work, please let me know, otherwise, great :).
That is great to hear you are dedicated to the feature, and I hope that you (and the company backing you) will be able to support partial clones from hereon out :). On the bright side, even if that doesn't happen, nobody can predict the future, I hope we get to the point where we can:
With the building blocks above it will be possible for each algorithm to eventually pre-fetch objects as needed, and until that, fail relatively gracefully with 'object '' not found - it's probably promised' at least. From there, in a piecemeal fashion, one can add promisor support to whereever they are needed. Edit: I can imagine that we'd want to add a function to |
@Byron Thanks for the (extensive) help on getting #1081 merged! I've started work on getting a minimal blob-filtered clone going - it was surprisingly not too difficult to get a (rough) first-pass functioning as expected. My WIP branch is available here, but I'm just leaving this here as an FYI. There is still much to do on my end, and it isn't yet ready for review. |
what is missing @cesfahani ? this would be most likely the feature with most value to speed up git blones in cargo, see rust-lang/cargo#11165. |
@soloturn I have this all working in a fork, with some caveats:
The fork I've been most recently working out of is hosted in my company's private git service, but there's no reason I can't push it up to my Github fork. I've been meaning to do that - just haven't gotten around to it yet. I'm happy to get back into this now. I have a couple PR's to make for simple bug fixes, then I'll need some input on how best to expose the API to allow an explicit set of objects to be fetched (without attempting any negotiation rounds, etc). The way I did this in my fork was a total hack (duplicated a bunch of code), but I really just wanted to get something working. Expect to see some progress from me this week! |
not sure if i understand correctly. the most practical flag, at leasst for rusts cargo, is:
as it would permit more easily to check out different versions. this is supported? |
Summary 💡
Tracking a minimal implementation of partial clones and their corresponding promisor objects/remotes, as discussed in #1041.
What is a partial clone?
Partial clone is a recent(ish) feature of
git
that allows the client to fetch a subset of objects from a remote repository based on criteria (i.e. a "filter") of its choosing. The missing objects are referred to as "promisor objects", and are expected to be able to be provided by "promisor remotes" on-demand after the clone, as needed.The most common use-case of partial-clone are where the client requests a clone with either no historical blobs (e.g.
--filter=blob:none
), or only historical blobs under some size threshold (e.g.--filter=blob:512k
). Tree objects can also be filtered by a partial clone, however that use-case is far less common.Lessons learned from
git
Because partial clone was retrofitted into
git
, there are several performance gaps that have not yet been resolved. Operations likefetch
andcheckout
behave exactly as one would expect - the missing objects are fetched in a single transaction with the remote. Other operations, such asblame
andrebase
, do not do this, and instead end up lazily fetching missing objects one at a time (each with a separate transaction to the remote), which significantly slows things down.To implement partial clones efficiently, operations that traverse history and require inspecting the contents of blobs and trees need to:
That said, this feature does not aim to implement the optimized approach to partial clones across the board. However we would like to see APIs designed to facilitate the optimized approach, and possibly one implementation of the optimized approach to be used as a reference and proof that things can be made to work as expected.
Tasks
gix fsck connectivity
)remote.<name>.partialclonefilter
remote.<name>.promisor
true
if a partial clone filter was providedgix
CLIgit
partialclonefilter
andpromsior
are set appropriatelypromisor
andpartialclonefilter
config settings on the remotespromisor
packThe text was updated successfully, but these errors were encountered: