Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Cargo's load time of crates.io index information #6866

Closed
alexcrichton opened this issue Apr 22, 2019 · 0 comments · Fixed by #6880
Closed

Improve Cargo's load time of crates.io index information #6866

alexcrichton opened this issue Apr 22, 2019 · 0 comments · Fixed by #6880

Comments

@alexcrichton
Copy link
Member

Cargo is in theory a pretty fast CLI tool that adds negligible overhead when executing various commands like cargo build. Ideally if you cargo build and everything is fresh, this should be instantaneous and only a small handful of milliseconds.

Cargo does, however, actually do work and it's sometimes nontrivial. Currently when profiling null builds in Cargo one of the most expensive operations (40% of runtime) is loading crates.io index information. As a brief recap, the crates.io index has a file-per-crate model in a git repository. Each file is a number of lines where each line is a JSON blob. Currently if you crate uses 4 crates from crates.io it means that Cargo will parse 4 files and parse all JSON blobs on each line.

For a null build, however, this is a ton more work then we would otherwise need to do. We only actually need to parse exact JSON blobs that are used already in a lock file, this is just discovered later. There's two main things slowing us down right now:

  • We don't actually read each file from disk, but rather from libgit2. We don't check out the index to conserve disk space and it's much faster on an initial clone as well. Reading from the index however involves decompression and seeking that is somewhat expensive in libgit2.

  • We parse a ton of JSON that we immediately throw away. Typically only one JSON blob is relevant per file.

I think that we can optimize both of these strategies to improve Cargo's null build performance. While we can probably make each step in isolation faster and more optimized, I think there's benefit from simply reducing the number of operations we do in the first place.

Concretely what I think we can do is keep a per-crate-file cache on disk. Cargo could, when it loads a crate, originally read it from libgit2 and then cache the information on disk in a more easily machine readable format. For example it could at the very least just write the file out itself to disk, invalidating the file whenever the index updates (which does not happen during a null build). On another extreme we could have a database-like file synthesized from the git file contents. This could be easily indexable by Cargo and require less JSON parsing or something like that.

By caching something on disk we should avoid the libgit2 pitfall where decompressing the registry data is expensive. By using some more database-like or Cargo-friendly format than JSON we could not only parse less JSON (by knowing which lines we don't need to parse) but we could also have a faster parser than JSON (in theory). I think the amortized time of managing all this will still be quite good because it only happens when the index changes or is updated, never on a null build or after a lock file exists.

alexcrichton added a commit to alexcrichton/cargo that referenced this issue May 3, 2019
This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of `cargo build`
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes rust-lang#6866
bors added a commit that referenced this issue May 3, 2019
Parse less JSON on null builds

This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of `cargo build`
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866
@bors bors closed this as completed in #6880 May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant