Improve Cargo's load time of crates.io index information #6866

alexcrichton · 2019-04-22T20:17:33Z

Cargo is in theory a pretty fast CLI tool that adds negligible overhead when executing various commands like cargo build. Ideally if you cargo build and everything is fresh, this should be instantaneous and only a small handful of milliseconds.

Cargo does, however, actually do work and it's sometimes nontrivial. Currently when profiling null builds in Cargo one of the most expensive operations (40% of runtime) is loading crates.io index information. As a brief recap, the crates.io index has a file-per-crate model in a git repository. Each file is a number of lines where each line is a JSON blob. Currently if you crate uses 4 crates from crates.io it means that Cargo will parse 4 files and parse all JSON blobs on each line.

For a null build, however, this is a ton more work then we would otherwise need to do. We only actually need to parse exact JSON blobs that are used already in a lock file, this is just discovered later. There's two main things slowing us down right now:

We don't actually read each file from disk, but rather from libgit2. We don't check out the index to conserve disk space and it's much faster on an initial clone as well. Reading from the index however involves decompression and seeking that is somewhat expensive in libgit2.
We parse a ton of JSON that we immediately throw away. Typically only one JSON blob is relevant per file.

I think that we can optimize both of these strategies to improve Cargo's null build performance. While we can probably make each step in isolation faster and more optimized, I think there's benefit from simply reducing the number of operations we do in the first place.

Concretely what I think we can do is keep a per-crate-file cache on disk. Cargo could, when it loads a crate, originally read it from libgit2 and then cache the information on disk in a more easily machine readable format. For example it could at the very least just write the file out itself to disk, invalidating the file whenever the index updates (which does not happen during a null build). On another extreme we could have a database-like file synthesized from the git file contents. This could be easily indexable by Cargo and require less JSON parsing or something like that.

By caching something on disk we should avoid the libgit2 pitfall where decompressing the registry data is expensive. By using some more database-like or Cargo-friendly format than JSON we could not only parse less JSON (by knowing which lines we don't need to parse) but we could also have a faster parser than JSON (in theory). I think the amortized time of managing all this will still be quite good because it only happens when the index changes or is updated, never on a null build or after a lock file exists.

The text was updated successfully, but these errors were encountered:

This commit fixes a performance pathology in Cargo today. Whenever Cargo generates a lock file (which happens on all invocations of `cargo build` for example) Cargo will parse the crates.io index to learn about dependencies. Currently, however, when it parses a crate it parses the JSON blob for every single version of the crate. With a lock file, however, or with incremental builds only one of these lines of JSON is relevant. Measured today Cargo building Cargo parses 3700 JSON dependencies in the registry. This commit implements an optimization that brings down the number of parsed JSON lines in the registry to precisely the right number necessary to build a project. For example Cargo has 150 crates in its lock file, so now it only parses 150 JSON lines (a 20x reduction from 3700). This in turn can greatly improve Cargo's null build time. Cargo building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms to 200ms on a Mac. The commit internally has a lot more details about how this is done but the general idea is to have a cache which is optimized for Cargo to read which is maintained automatically by Cargo. Closes rust-lang#6866

Parse less JSON on null builds This commit fixes a performance pathology in Cargo today. Whenever Cargo generates a lock file (which happens on all invocations of `cargo build` for example) Cargo will parse the crates.io index to learn about dependencies. Currently, however, when it parses a crate it parses the JSON blob for every single version of the crate. With a lock file, however, or with incremental builds only one of these lines of JSON is relevant. Measured today Cargo building Cargo parses 3700 JSON dependencies in the registry. This commit implements an optimization that brings down the number of parsed JSON lines in the registry to precisely the right number necessary to build a project. For example Cargo has 150 crates in its lock file, so now it only parses 150 JSON lines (a 20x reduction from 3700). This in turn can greatly improve Cargo's null build time. Cargo building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms to 200ms on a Mac. The commit internally has a lot more details about how this is done but the general idea is to have a cache which is optimized for Cargo to read which is maintained automatically by Cargo. Closes #6866

alexcrichton mentioned this issue Apr 22, 2019

Backend refactorings and perf improvements #6867

Merged

alexcrichton mentioned this issue Apr 26, 2019

Parse less JSON on null builds #6880

Merged

bors closed this as completed in #6880 May 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Cargo's load time of crates.io index information #6866

Improve Cargo's load time of crates.io index information #6866

alexcrichton commented Apr 22, 2019

Improve Cargo's load time of crates.io index information #6866

Improve Cargo's load time of crates.io index information #6866

Comments

alexcrichton commented Apr 22, 2019