-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow specifying shallow clones in .config #1171
Comments
Note that this is blocked on libgit2/libgit2#2782. |
And to clarify some more, I would love to implement this! The backing libgit2 support is a prerequisite, however (see libgit2/libgit2#1430 as well) |
Both of those blocked issues in libgit are now fixed :-) |
Ah unfortunately they were closed in favor of a new metabug in libgit2: libgit2/libgit2#3058 I'll update the description accordingly |
Darn, got my hopes up :-) |
This issue is less important to Servo now that crates.io supports larger crate sizes and we are able to shift more of our dependency graph from raw git to crates.io. |
This would be nice for crates.io-index too, not just git dependencies. A current full clone of the index takes 39MB network and 48MB |
@cuviper oh it's worth pointing out that for the index at least we planned on this happening over time (clone being large) and we wanted to protect against that. It should be implemented that at any time we can roll the entire index into one commit (e.g. one huge squash) and force push that up. That, in effect, should reset everyone to a "shallow clone". We've never done that yet so we don't quite have a procedure and/or threshold to do that, but I should probably verify that we can actually do that soon :). Other than that all we'd need to do is briefly turn off crates.io, do the squash, force push, and then turn crates.io back. All instances of Cargo should then pick up the new commit and hard reset to it when they next need to update the registry. |
Hmm, will the history be saved to a branch before the reset? I'm not sure it's important, but it seems like we shouldn't just throw it all away. And FWIW git has a |
Ah excellent point, I think we'd definitely want a backup somewhere. I don't personally know a huge amount about git refspecs, but my gut is that Cargo is cloning all branches of the index today (and all historical versions of Cargo). I think it'd be easy to update that though to only fetch |
Yeah, the refspec just becomes |
I've filed an issue to update that |
Hey, any changes to this? I'm depending on my own forks of glfw & yoga which I'd like to rather not publish to crates.io yet - fetching it takes considerable time. |
I think the progress of this issue is completely related to libgit2/libgit2#5254, because IIRC cargo use the git2 rust library and it has recently made progress on shallow clones. |
Actual, there is more than that, since GitHub limited shallow clone on large repository, I don't think this would be work even There is an related RFC for pure HTTP registry index, maybe that would be a solution. |
example in the wild:
downstream use case: in deep clones cannot be shared between fetches, so deep clones actually increase the server load
in the downside of using the archive API is: there is no simple way to verify the files by the commit hash, tree hash of files + commit object → commit hash as workaround, example c=d41f5ccd7ea91afee4f1a9d20b85dbcede135d3b
git clone --depth 10 https://github.com/rust-lang/cargo
git checkout $c
git log | head -n3 | grep ^Date
# Date: Wed Mar 2 08:41:44 2022 -0600
# -> timezone is -0600
curl https://api.github.com/repos/rust-lang/cargo/commits/$c | jq .commit.author.date
# 2022-03-02T14:41:44Z
# -> no timezone
curl -I -L https://github.com/rust-lang/cargo/archive/$c.tar.gz
# -> no commit object in http headers (maybe someone with a better reputation than me could convince github/gitlab/gitea to implement this ...) |
It is however possible to fetch by the commit hash, so maybe this is fine? If you ask for commit $FOO over TLS, then then the result can be trusted insofar as that github is. |
yepp, but its 2x slower than github's archive API mkdir d
cd d
git init
git remote add o https://github.com/postgres/postgres
time git fetch o 5b68f75e12831cd5b7d8058320c0ca29bbe76067 --depth 1
# real 0m13.084s
time git checkout 5b68f75e12831cd5b7d8058320c0ca29bbe76067
# real 0m3.814s
time wget https://github.com/postgres/postgres/archive/5b68f75e12831cd5b7d8058320c0ca29bbe76067.tar.gz
# real 0m6.431s
TLS/SSL does not help here, because the data is already content-addressed |
@Byron would this only be replacing libgit2 for clones but continue to use it for everywhere else or is this the last item needed to completely switch to gitoxide? If we need to still use libgit2, I wonder what the trade offs look like vs waiting until we can completely switch (build time, binary size, error messages, behavior differences, etc). |
@Byron I am very excited for your grant! I look forward to us working together!
From our conversations with GitHub, doing a shallow clone is fine, but then attempting to do a fetch from that clone creates a lot of work on their servers. Shallow cloning the index is not a viable solution. #9069 is actively making progress to removing git from the index entirely. Happy to talk more about the details, but as you pointed out it's off-topic for this issue.
I believe we already do a bear clone of the index. Is that what you're referring to?
I appreciate the pressure to look at the data on this. My null hypothesis is that The Cargo Team should make a higher bandwidth communication channel with you to fully support the work in your grant! |
It would only replace it for clones and fetches, basically all network operations and worktree updates. Indeed That said, if there is time in this grant period I will be happy to make use of And to be honest, grant or not, I know I won't rest until
I think figuring out how to perform the initial integration where Last but not least, something I forgot to mention: I plan to maintain the |
To clarify, I don't think a deep analysis is needed. However, in the past, we have cared about dependencies being added (e.g. |
Thank you! And likewise :).
Before posting here I dug into the tree of links spreading out from this issue and also noticed that shallow clones and fetches were discouraged. From what I could fathom, it all boils down to this comment by a GitHub engineer 6 years ago. From my shallow (🐌👏) understanding of how shallow clones work I believe it's critical to assure the future fetches are correctly implemented so that the fetch will only include the changes, very similar what happens to fetches into a non-shallow repository. Besides the fetch protocol having special handling of shallow commits I don't see why the server would be slower if the client is correctly implemented. Furthermore the GitHub engineer talked about a series of patches to land that would reduce waste on the server side when handling shallow fetches, and I assume these landed by now. They also highlighted that the client was doing a non-shallow fetch after the first shallow clone which defeats the purpose. That said, I am happy to assume shallow fetches aren't viable, but hope that the previous paragraph can be a reason to re-validate that assumption. Using
Thanks for pointing that out! After removing my possibly years old index and running cargo again, it indeed created a checkout that looks more like a bare clone. The repository isn't 'officially' bare as This idea can certainly be discarded then.
Doing so could be a viable alternative mid-term goal if any work on using |
After diving into cargo's git related code a little more where conveniently most of the relevant one seems to be in assumptionsThe general assumption is that integration stepsThere are four distinct integration steps, with the first two being sufficient to fulfill the mid term and grant goal. The last two are more like stretch goals that I did have in mind when applying for the grant and that I personally want to see realized this year.
getting to
|
This would be nice... having executables execute other executables smells. |
I think both invocations are well-motivated. The one in The other use is What I like about this second opportunity is that it is high-value yet easy to accomplish, and all that's needed to enable a PR is a pre-existing feature toggle to allow switching |
gitoxide integration: fetch This PR is the first step towards resolving #1171. In order to get there, we integrate `gitoxide` into `cargo` in such a way that one can control its usage in nightly via `-Zgitoxide` or `Zgitoxide=<feature>[,featureN]`. Planned features are: * **fetch** - all fetches are done with `gitxide` (this PR) * **shallow_index** - the crates index will be a shallow clone (_planned_) * **shallow_deps** - git dependencies will be a shallow clone (_planned_) * **checkout** - plain checkouts with `gitoxide` (_planned_) The above list is a prediction and might change as we understand the requirements better. ### Testing and Transitioning By default, everything stays as is. However, relevant tests can be re-runwith `gitoxide` using ``` RUSTFLAGS='--cfg always_test_gitoxide' cargo test git ``` There are about 200 tests with 'git' in their name and I plan to enable them one by one. That way the costs for CI stay managable (my first measurement with one test was 2min 30s), while allowing to take one step at a time. Custom tests shall be added once we realize that more coverage is needed. That way we should be able to maintain running `git2` and `gitoxide` side by side until we are willing to switch over to `gitoxide` entirely on stable cargo. Then turning on `git2` might be a feature toggle for a while until we finally remove it from the codebase. _Please see the above paragraph as invitation for discussion, it's merely a basis to explore from and improve upon._ ### Tasks * [x] add feature toggle * [x] setup test system with one currently successful test * [x] implement fetch with `gitoxide` (MVP) * [x] fetch progress * [x] detect spurious errors * [x] enable as many git tests as possible (and ignore what's not possible) * [x] fix all git-related test failures (except for 1: built-in upload-pack, skipped for now) * [x] validate that all HTTP handle options that come from `cargo` specific values are passed to `gitoxide` * [x] a test to validate `git2` code can handle crates-index clones created with `gitoxide` and vice-versa * [x] remove patches that enabled `gitoxide` enabled testing - it's not used anymore * [x] ~~remove all TODOs and use crates-index version of `git-repository`~~ The remaining 2 TODO's are more like questions for the reviewer. * [x] run all tests with gitoxide on the fastest platform as another parallel task * [x] switch to released version * [x] [Tasks from first review round](#11448 (comment)) * [x] create a new `gitoxide` release and refer to the latest version from crates.io (instead of git-dependency) * [x] [address 2nd review round comments](#11448 (comment)) ### Postponed Tasks I suggest to go breadth-first and implement the most valuable features first, and then aim for a broad replacement of `git2`. What's left is details and improved compatibility with the `git2` implementation that will be required once `gitoxide` should become the default implementation on stable to complete the transition. * **built-in support for serving the `file` protocol** (i.e. without using `git`). Simple cases like `clone` can probably be supported quickly, `fetch` needs more work though due to negotiation. * SSH name fallbacks via a native (probably ~~libssh~~ (avoid LGPL) `libssh2` based) transport. Look at [this issue](#2399) for some history. * additional tasks from [this tracking issue](GitoxideLabs/gitoxide#450 (comment)) ### Proposed Workflow I am now using [stacked git](https://stacked-git.github.io) to keep commits meaningful during development. This will also mean that before reviews I will force-push a lot as changes will be bucketed into their respective commits. Once review officially begins I will stop force-pushing and create small commits to address review comments. That way it should be easier to understand how things change over time. Those review-comments can certainly be squashed into one commit before merging. _Please let me know if this is feasible or if there are other ways of working you prefer._ ### Development notes * unrelated: [this line](https://github.com/rust-lang/cargo/blob/9827412fee4f5a88ac85e013edd954b2b63f399b/src/cargo/ops/registry.rs#L620) refers to an issue that has since been resolved in `curl`. * Additional tasks related to a correct fetch implementation are collected in this [tracking issue](GitoxideLabs/gitoxide#450). **These affect how well the HTTP transport can be configured, needs work** * _authentication_ [is quite complex](https://github.com/rust-lang/cargo/blob/37cad5bd7f7dcd2f6d3e45312a99a9d3eec1e2a0/src/cargo/sources/git/utils.rs#L490) and centred around making SSH connections work. This feature is currently the weakest in `gitoxide` as it simply uses `ssh` (the program) and calls it a day. No authentication flows are supported there yet and the goal would be to match `git` there at least (which it might already do by just calling `ssh`). Needs investigation. Once en-par with `git` I think `cargo` can restart the whole fetch operation to try different user names like before. - the built-in `ssh`-program based transport can now understand permission-denied errors, but the capability isn't used after all since a builtin ssh transport is required. * It would be possible to implement `git::Progress` and just ignore most of the calls, but that's known to be too slow as the implementation assumes a `Progress::inc()` call is as fast as an atomic increment and makes no attempt to reduce its calls to it. * learning about [a way to get custom traits in `thiserror`](dtolnay/thiserror#212) could make spurious error checks nicer and less error prone during maintenance. It's not a problem though. * I am using `RUSTFLAGS=--cfg` to influence the entire build and unit-tests as environment variables didn't get through to the binary built and run for tests. ### Questions * The way `gitoxide` is configured the user has the opportunity to override these values using more specific git options, for example using url specific http settings. This looks like a feature to me, but if it's not `gitoxide` needs to provide a way to disable applying these overrides. Please let me know what's desired here - my preference is to allow overrides. * `gitoxide` currently opens repositories similar to how `git` does which respects git specific environment variables. This might be a deviation from how it was before and can be turned off. My preference is to see it as a feature. ### Prerequisite PRs * #11602
Note that upstream shallow cloning functionality is in final review stages over at libgit2/libgit2#6396. Perhaps an interested party can implement this using a patched version + PR here so that when it lands it's ready to go? |
It looks like shallow-cloning on |
I'm interested in taking on this issue as a first contribution. |
My initial thoughts are to move the |
Thanks for sharing! This reminded me to keep track of the Besides that, I really like the generalization of the |
Amazing, thanks everyone for getting this landed! Exciting :) |
Servo has a dep on Skia, which takes a very long time to download. Aside from appearing to hang (since Cargo doesn't show a progress message), this is a rather unnecessary delay for getting a new build running.
It would be nice if we could specify a
--depth
argument in the.config
. (most of the size is due to it's extensive commit history)Alternatively, a default
--depth 50
would work too, though I'm not sure how well that would interact withcargo update
.Status
Blocked on libgit2/libgit2#3058
The text was updated successfully, but these errors were encountered: