(Option to) Fingerprint by file contents instead of mtime #6529

illicitonion · 2019-01-09T11:20:48Z

Describe the problem you are trying to solve
The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.

Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.

Describe the solution you'd like
Add a new LocalFingerprint::ContentBased(Digest, PathBuf) variant to

cargo/src/cargo/core/compiler/fingerprint.rs

Lines 204 to 209 in b84e625

    
           #[derive(Serialize, Deserialize, Hash)] 
        
           enum LocalFingerprint { 
        
               Precalculated(String), 
        
               MtimeBased(MtimeSlot, PathBuf), 
        
               EnvBased(String, Option<String>), 
        
           }

which reads the content of the PathBuf, passes it through a SipHasher, and mixes that into any aggregate fingerprints. Use this instead of LocalFingerprint::MtimeBased.

Notes
This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps --fingerprint-strategy={mtime,content}).

This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.

I'd be happy to implement this if the PR is likely to be accepted :)

This would probably also fix

The text was updated successfully, but these errors were encountered:

alexcrichton · 2019-01-09T17:25:54Z

Funnily enough this is also highly related to recent discussions on #2426 and #6484 for entirely different reasons (filesystems with second-level mtime granularity and modifications/builds all happen in the same second).

cc @Eh2406, @matklad, this might be a good tracking issue for this!

Eh2406 · 2019-01-09T18:00:11Z

At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash.

I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss?

Do we want to use SipHasher or a different hashing scheme that is designed for fingerprinting? I only ask because of this:
"Although the SipHash algorithm is considered to be generally strong, it is not intended for cryptographic purposes. As such, all cryptographic uses of this implementation are strongly discouraged."

alexcrichton · 2019-01-10T01:21:39Z

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well). I would personally prefer we not wholesale switch away from mtimes just yet, but having this as at least a runtime option to choose from is a great start!

dwijnand · 2019-01-10T10:43:26Z

Would be interesting to compare it to using FxHash.

ehuss · 2019-01-11T14:28:11Z

I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run --emit=dep-info first, which seems to be pretty fast, but there might be some races still.

I don't know of any easy answers here.

Eh2406 · 2019-02-08T18:04:08Z

I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database.

fluffysquirrels · 2019-05-22T15:03:56Z

That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool redo, roughly:

Storing cheap metadata (mtime, size, inode number, owner uid and gid, a sequence number (for outputs only)) for each input and output file.
Rebuild whenever any of that metadata changes.

I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed.

This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only?

[1]: "mtime comparison considered harmful"

gerion0 · 2019-10-14T22:50:16Z

Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well.

d33tah · 2019-11-13T11:35:12Z

I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did echo fn main() {} > src/main.rs. Pretty painful to debug and the touch I added to work around looks clunky.

Steps to reproduce

Create the following files;

Dockerfile

FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release

src/Cargo.toml

[package]
name = "hello"
version = "0.1.0"
authors = ["Jacek Wielemborek"]
edition = "2018"

[dependencies]
flate2 = "*"

project/src/main.rs

fn main() {
    println!("Hello2");
}

Run docker build .

Expected:

Hello2 is printed

Actual result:

hello1 is printed

Workaround

Before building, do touch project/src/main.rs

matklad · 2020-06-20T08:32:46Z

I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run git switch master && git pull --rebase && git switch my-feature-branch && git rebase master. I think both of these workflows lead to busted mtimes for a bunch of files, which haven't actually changed between compilations. For sprawling workspaces (like rust-analyzer one), I hypothesize that this leads to some unneeded crate rebuilds.

gilescope · 2020-07-31T11:58:33Z

Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag.

Eh2406 · 2020-07-31T14:06:49Z

If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the mtime to return a new type (mtime, Option<hash>). If the -Z flag is set fill in the hash. Thread that type all the way thru. When we get to the compare step, check the -Z flag for witch to use. If you are offering to try, that is where I would start.

gilescope · 2020-07-31T14:25:53Z

Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit.

slonopotamus · 2020-07-31T15:18:47Z

I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash.

gilescope · 2020-08-08T05:05:09Z

ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick.

ibaryshnikov · 2021-03-31T13:31:17Z

mtime comparison considered harmful linked above is an excellent blog post! I've ran into one of the problems described there today:

If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds

bjorn3 · 2024-05-08T11:32:54Z

I would probably go with sha1 simply because it leaves the option open to take shortcuts when using a git repository, since git already knows the hashes of the files.

I believe git adds a header to blobs before hashing them. While sha1 allows adding trailing bytes with minimal extra work (enabling length extension attacks), thus is not the case for leading bytes. Adding leading bytes requires recomputing from scratch.

vlovich · 2024-05-08T16:45:34Z

I like the idea of using git, but it only knows the sha1 of tracked & unmodified files. You'd still have to detect dirtied / unmodified files & handle them correctly. It's probably wiser to just make sure that the algorithm is stored in the metadata per file so that it can be heterogeneous & swappable (e.g. using git's metadata for the majority of files & falling back to something else for untracked / unmodified).

As for the speed difference being negligible, I'd caution that codebases will often sit within the page cache, so assuming an I/O penalty for reads may be a faulty assumption. But as I said, you'd want to only bother computing a hash when the file size is the same as it was previously.

The best approach is to have a daemon that monitors for change events live so that you don't even need to bother hashing unless the daemon restarts. That's what Bazel and Buck do.

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file. That would solve all these constant problems in a super-cheap way without needing a complicated daemon & not have any performance or correctness footguns (just need to save & check a mapping of inode -> version number).

bjorn3 · 2024-05-08T22:13:18Z

I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file.

On Linux there a counter for every time a file is recreated which in practice is what most editors do to atomically write to a file. The combination of the inode id and i_generation is supposed to uniquely identify a single file for the entire lifetime of the filesystem even across deletes and recreates. Not every filesystem implements it though. It also doesn't get affected by edits that are not recreating the file.

alexpyattaev · 2024-05-12T05:44:25Z

On the other note, if you look at well-optimized IO logic, such as that found in 1BRC submissions you can quickly see that any plausible amount of source code can be loaded into RAM in seconds, if that is what is needed. Modern machines with NVMe are not IO bound in any meaningful way, and very few people would build a huge rust codebase on a machine from the 90's.

adam-azarchs · 2024-05-13T16:59:56Z

Most of our team has their home directory mounted over NFS; NFS can have great throughput but will generally have pretty terrible metadata latency. Regardless, even a naïve implementation of sha256 can easily handle 500 MiB/s on a modern system; the hashing time should still be negligible on any reasonably-sized codebase, especially since it's trivially parallelizable.

vlovich · 2024-05-13T21:06:42Z

It's a very odd setup to have a git repo running on a NFS share. Even still, you'd expect the source code to be sitting in the page cache which again still invalidates the whole "I/o latency dominates" argument because you're going to be doing a local memcpy+hash. The best choice is designing such that performance is optimized in cases where hash speed could be the bottleneck.

Speaking of hash algorithms, came across gxhash which is a Pareto frontier faster by a lot than xxh3 at smaller inputs and ahash for large inputs.

As for whether or not it's embarrassingly parallel, this kind of stuff can surprisingly be less so because there's coordination required to send and receive the result from a background thread. Given that most of the checks will stay the same not requiring a rebuild, the overhead of that distribution can easily be the bottleneck vs having a fast hash function running inline.

ClementTsang · 2024-05-23T04:45:28Z

@overlookmotel here's a tool that should have similar logic to the retimer tool, though it's a bit of a bandaid solution: https://crates.io/crates/mtime-travel

overlookmotel · 2024-05-23T21:42:20Z

@ClementTsang Thank you! Yes, a bandaid solution, but some kind of solution at least...

briansmith · 2024-06-20T20:06:20Z

We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think SipHasher is reasonably fast as well).

I do think it is worth doing a complete security analysis. Consider a file foo.rs that contains a security bug, which has hash(foo.rs) = X. Now somebody submits a patch for foo.rs that contains a "fix" for a bug, such that the hash of the patched foo.rs also has the value X. Then when one rebuilds the project to get the security "fix," they silently keep using the broken version. This could be made to happen easily if the security bug is intentional and the developer of the security bug plans for this.

I know such a think can seem far-fetched, but AFAICT this is why Bazel and other things that rely heavily on (distributed) caching for incremental rebuilds use strong hashes.

bjorn3 · 2024-06-20T20:24:44Z

A partially mitigating factor for that attack would be that the fingerprint for non-local dependencies (anything from a registry like crates.io or a git dependency) rebuilding doesn't happen because the fingerprint changes (in fact it isn't checked at all) Instead it is rebuilt because the package identity changes due to a different package version (and in case of git dependencies the git commit) is used, which by definition needs to be different to even pull in your changes. As such only for files in the project you are building can this attack be pulled off. I can imagine that we do still want to check the mtime for equality before checking the hash to improve performance, which would fully mitigate the attack as the attacker can't force the mtime to be identical between the original and the new file. It wouldn't be enough for a distributed cache like Bazel uses though.

adam-azarchs · 2024-06-21T06:53:14Z

I would like to emphasize once again that if you're using cargo vendor (as our team does, exclusively), cargo is already verifying sha256 sums against .cargo-checksum.json. It just then goes on to ignore that when deciding whether the file has changed.

michaelwoerister · 2024-06-21T08:55:15Z

For reference, sccache started using BLAKE3 for fingerprinting a while ago and I don't think that causes any performance problems. BLAKE3 should be fast enough for this kind of bulk-hashing scenario for the actual hashing to not factor into performance too much.

adam-azarchs · 2024-06-21T16:46:50Z

But, it's already doing sha256 hashing. Why would we add additional hashing on top of that?

bjorn3 · 2024-06-21T17:02:13Z

The .cargo-checksum.json check is done by cargo, while the file content fingerprinting would be done by rustc. Also cargo doesn't actually check if the source files of non-local dependencies (which includes vendored dependencies) are modified. As such there is no hashing that would be saved by rustc using sha256 too. For vendored dependencies when building the first time both rustc and cargo do their own independent hashing either way and when checking if rebuilding is necessary cargo will immediately consider the vendored dependency to not be modified and check neither .cargo-checksum.json nor the hash calculated by rustc. In other words it doesn't matter what hash is used by rustc. It is never used at all by cargo for vendored dependencies.

adam-azarchs · 2024-06-22T07:24:29Z

Cargo absolutely does verify the checksums of files against .cargo-checksum.json. And I'm pretty sure no one here was actually suggesting adding content hashing to rustc; the request is very specifically to have at least an option for cargo to not use absolute path and mtime as part of the fingerprint that it uses for determining whether rebuilding is required, and use a content hash instead. By the time rustc sees the file in question, it's too late - we've already started the rebuild.

bjorn3 · 2024-06-22T07:34:43Z

Cargo absolutely does verify the checksums of files against .cargo-checksum.json.

It does when building the first time. It does not check it when the crate has already been built. It will immediately consider it not needing any rebuild way before checking .cargo-checksum.json. I actually tried it myself.

And I'm pretty sure no one here was actually suggesting adding content hashing to rustc

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it. I haven't seen any suggestion to do it differently.

adam-azarchs · 2024-06-22T08:04:22Z

It does when building the first time.

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it.

Ok, yes, that makes sense (in general; I'm mostly focused on the CI scenario where such races should be impossible). And rustc is already reading all of the file content, so it's relatively cheap to hash as it does so. But it would still be cargo doing the hashing on subsequent builds to check whether a rebuild was required.

I think the most common use case for this feature request is to be able to do useful caching of the target directory in CI builders, e.g. GitHub actions. It's essentially pointless to cache any build outputs from within the workspace because the mtime will be different every time the repo is checked out so the cache will never be valid. We have some workspaces with dozens of crates, and most of the "core" crates that everything else depends on rarely see any changes, so it's frustrating to see cargo rebuilding them every time.

bjorn3 · 2024-06-22T08:27:18Z

Well, yes, but the problem is that cargo's definition of "the first time" includes some situations that it really shouldn't, and then it still uses the mtime + absolute path as the fingerprint for resulting outputs, so it still ends up unnecessarily invalidating downstream caches.

Non-local dependencies (including vendored dependencies) are always assumed to be up to date if any compiled artifacts exist. It will never check the mtime, absolute path, .cargo-checksums.json, file hashes in the dep info file or anything else involving the actual source files. In fact the fingerprint for non-local files doesn't even contain the list of source files. (target/debug/.fingerprint/mycrate-hash/dep-lib-mycrate is empty for non-local dependencies) And if compiled artifacts don't exist, it will check .cargo-checksum.json before building, but will rebuild independently of whether source files were modified.

Xaeroxe · 2024-06-25T05:39:35Z

Hi, I've made a tracking issue for this to go along with my two PRs. One for cargo and one for rustc. #14136

initial version of checksum based freshness Implementation for #14136 and resolves #6529 This PR implements the use of checksums in cargo fingerprints as an alternative to using mtimes. This is most useful on systems with poor mtime implementations. This has a dependency on rust-lang/rust#126930. It's expected this will increase the time it takes to declare a build to be fresh. Still this loss in performance may be preferable to the issues the ecosystem has had with the use of mtimes for determining freshness.

Why: the original `rustBuildIncremental` actually produces bad output in some cases. (1) it purely relies on `cargo` incremental build correctness (2) nix flakes copies the repo source to the /nix/store. (3) for reproducibility, everything in /nix/store has mtime == 1970-01-01. (4) cargo's rebuild checking only looks at the file mtime, so we get false negatives (cargo believes the file is unchanged when it is actually different). Relevant issue: [cargo - fingerprint by hash instead of mtime](rust-lang/cargo#6529) Introducing sccache: `sccache` caches individual `rustc` invocations and stores the results in `$SCCACHE_DIR = /var/cache/lexe/sccache`. Using `sccache`: - cached builds are about 0.4-0.5x time (faster) - uncached builds are about 1.1-1.2x time (extra overhead) - sccache sadly can't cache proc-macro crates...

illicitonion added the C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` label Jan 9, 2019

ehuss added the A-rebuild-detection Area: rebuild detection and fingerprinting label Jan 31, 2019

alexcrichton mentioned this issue Apr 29, 2019

Don't hash FileTime value in PathSource::fingerprint #6835

Open

alexcrichton mentioned this issue Jul 9, 2019

Canonicalize path_args src #7079

Closed

tiziano88 mentioned this issue Feb 3, 2020

Wrap Rust protoc in Oak utils project-oak/oak#533

Merged

6 tasks

This was referenced Mar 5, 2020

Regression that causes app to not be recompiled #7969

Closed

Cargo not rebuilding with file changed via move #7982

Closed

icy-ux mentioned this issue Apr 13, 2020

Cargo occasionally fails to detect changed source code files #8100

Closed

CodeSandwich mentioned this issue Apr 29, 2020

Improve build cache on CI radicle-dev/radicle-registry#408

Closed

sethp mentioned this issue Jul 28, 2020

Incremental build panic running cargo test rust-lang/rust#74890

Closed

gilescope mentioned this issue Jul 31, 2020

cargo build --dependencies-only #2644

Open

Eh2406 mentioned this issue Aug 13, 2020

Cargo recompiles unchanged re-saved files #8616

Closed

gilescope mentioned this issue Aug 16, 2020

[WIP] mtime+content tracking #8623

Closed

12 tasks

dmikusa mentioned this issue Apr 5, 2021

Cache CARGO_HOME paketo-community/cargo#25

Closed

vlovich mentioned this issue May 9, 2024

Stale results served up by sccache when rebasing mozilla/sccache#2169

Closed

This was referenced Jun 20, 2024

Add support for SipHash128 with all-zero key cr-marcstevens/hashclash#40

Open

Collisions in type_id rust-lang/rust#10389

Closed

This was referenced Jun 25, 2024

Tracking Issue for checksum freshness #14136

Open

initial version of checksum based freshness #14137

Merged

Xaeroxe mentioned this issue Jul 1, 2024

MCP: Alternate cargo freshness algorithm, unstable flag to annotate depinfo file with checksums and file sizes rust-lang/compiler-team#765

Closed

3 tasks

workingjubilee mentioned this issue Sep 3, 2024

always compile pgrx_embed_* without opts pgcentralfoundation/pgrx#1838

Merged

bors closed this as completed in 15fbd2f Oct 8, 2024

(Option to) Fingerprint by file contents instead of mtime #6529

(Option to) Fingerprint by file contents instead of mtime #6529

Comments

illicitonion commented Jan 9, 2019 • edited by weihanglo Loading

alexcrichton commented Jan 9, 2019

Eh2406 commented Jan 9, 2019

alexcrichton commented Jan 10, 2019

dwijnand commented Jan 10, 2019

ehuss commented Jan 11, 2019

Eh2406 commented Feb 8, 2019

fluffysquirrels commented May 22, 2019 • edited Loading

gerion0 commented Oct 14, 2019

d33tah commented Nov 13, 2019 • edited Loading

Steps to reproduce

Expected:

Actual result:

Workaround

matklad commented Jun 20, 2020 • edited Loading

gilescope commented Jul 31, 2020

Eh2406 commented Jul 31, 2020

gilescope commented Jul 31, 2020

slonopotamus commented Jul 31, 2020

gilescope commented Aug 8, 2020 • edited Loading

ibaryshnikov commented Mar 31, 2021 • edited Loading

bjorn3 commented May 8, 2024

vlovich commented May 8, 2024

bjorn3 commented May 8, 2024

alexpyattaev commented May 12, 2024

adam-azarchs commented May 13, 2024

vlovich commented May 13, 2024

ClementTsang commented May 23, 2024

overlookmotel commented May 23, 2024

briansmith commented Jun 20, 2024

bjorn3 commented Jun 20, 2024

adam-azarchs commented Jun 21, 2024

michaelwoerister commented Jun 21, 2024

adam-azarchs commented Jun 21, 2024

bjorn3 commented Jun 21, 2024

adam-azarchs commented Jun 22, 2024

bjorn3 commented Jun 22, 2024 • edited Loading

adam-azarchs commented Jun 22, 2024

bjorn3 commented Jun 22, 2024

Xaeroxe commented Jun 25, 2024

illicitonion commented Jan 9, 2019 •

edited by weihanglo

Loading

fluffysquirrels commented May 22, 2019 •

edited

Loading

d33tah commented Nov 13, 2019 •

edited

Loading

matklad commented Jun 20, 2020 •

edited

Loading

gilescope commented Aug 8, 2020 •

edited

Loading

ibaryshnikov commented Mar 31, 2021 •

edited

Loading

bjorn3 commented Jun 22, 2024 •

edited

Loading