-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Option to) Fingerprint by file contents instead of mtime #6529
Comments
At the moment I would start by removing the mtime based system and replacing it with a hash based one. (except for the corner cases where we have to use mtime, as we don't know which files to hash until after build. cc #5918) Then I would see how big a perf impact this has in reality. If it is small then grand, if not then we need to have a hybrid system that uses mtime to decide whether to hash. I was thinking of trying to impl this soon, but would much prefer to help you do it @illicitonion. I seem to recall someone had an abandoned branch with a start... was it @ehuss? Do we want to use |
We're not really too interested in cryptographic hashing here for its security properties, so I think picking any reasonable fast algorithm should be fine (and I think |
Would be interesting to compare it to using FxHash. |
I'm not sure what I did with my old branch, so I'm just going off memory. My original attempt tried to only hash the contents if the mtime was near the start of compile time. However, since it is not known which files to hash until after compilation, it didn't work out too well. I don't know how to solve that problem. I guess one option is to pre-hash every file in the package directory and maybe hash anything that was included from outside after compilation? But that sounds like it could be very problematic. Or maybe hash protection is just not available for the very first compile (or in other words the second compile is not guaranteed to be correct). Maybe another option is to run I don't know of any easy answers here. |
I just came across a long blog post on the troubles with using mtime in build tools. It discusses using only hash, and points out that go switched to that, but recommends a hybrid approach. Witch is surprisingly similar to what we are doing with the fingerprint hash as a database. |
That blog post [1] is an excellent reference! It describes some of the problems in using mtimes and hashes in build systems and describes the approach used by the tool
I found it helpful to consider the sequence number for an output as a logical time that we control, so it should avoid problems with low resolution, mtime going backward, builds being missed. This wouldn't stop the false-positive builds we see on CI systems that don't preserve mtimes, but to avoid these we could check hashes of source files when their metadata changes. One of the problems claimed in [1] is that hashing large build output files (to confirm downstream should rebuild) is expensive, but we could benchmark this to examine the trade offs. Maybe we could get away with hashing source files only? |
Please also have a look at chash. This is a mechanism for a meaningful fingerprint (based on the AST of the program). Maybe this can be adapted for cargo as well. |
I was just hit by this bug. Docker doesn't refresh timestamps of already existing files and I hacked around the need to cache dependency builds, so I did Steps to reproduce
FROM rust
COPY project/Cargo.toml project/Cargo.toml
RUN cd project && mkdir src && echo 'fn main() { println!("hello1"); }' > src/main.rs && cargo build --release
COPY project/src project/src
RUN cd project && cargo run --release
fn main() {
println!("Hello2");
}
Expected:
Actual result:
WorkaroundBefore building, do |
I think this also has a chance of speeding up some dev workflows around switching branches. Specifically, I often check out branches temporary to look at the PRs (without building them). Sometimes I also run |
Has anyone got a PR or partial WIP PR on this yet? This would be huge to have the option to do this for docker caching which as rust builds get bigger becomes more important for enterprises. Would love to be able to test this out in nightly under a -Z flag. |
If there was progress it would have been linked here. As I recall I thought it could be done strate forwardly, but the fact that Eric's branch did not work suggest I was dunning-kruger myself. My memory of the approach was to change the code that reads the |
Was just checking there's no PR out there to build on. If the mtime's not the same we could check cheaply if the file size was the same before doing an expensive hash contents check - that could cut the perf costs down a bit. |
I believe this problem was already extensively studied in the past, so there's no need to reinvent the solution and just pick a good existing one. For example, git's index is known to store mtime+ctime+device+inode+hash. |
ctime, device and inode will thwart trying to do clever things like copying the target dir and putting it back but on a different machine/dir. When mtime check fails it's probably best to rely on intrinsic properties of the content (size/hash). For a non-cryptographic hash, redox's seahash seems pretty quick. |
mtime comparison considered harmful linked above is an excellent blog post! I've ran into one of the problems described there today:
|
@scottlamb Can I ask, did you ever publicly release "retimer"? Or would you be able/willing to do so? |
I believe git adds a header to blobs before hashing them. While sha1 allows adding trailing bytes with minimal extra work (enabling length extension attacks), thus is not the case for leading bytes. Adding leading bytes requires recomputing from scratch. |
I like the idea of using git, but it only knows the sha1 of tracked & unmodified files. You'd still have to detect dirtied / unmodified files & handle them correctly. It's probably wiser to just make sure that the algorithm is stored in the metadata per file so that it can be heterogeneous & swappable (e.g. using git's metadata for the majority of files & falling back to something else for untracked / unmodified). As for the speed difference being negligible, I'd caution that codebases will often sit within the page cache, so assuming an I/O penalty for reads may be a faulty assumption. But as I said, you'd want to only bother computing a hash when the file size is the same as it was previously. The best approach is to have a daemon that monitors for change events live so that you don't even need to bother hashing unless the daemon restarts. That's what Bazel and Buck do. I do wish filesystems just implemented a 64-bit counter that incremented on any mutation of a file. That would solve all these constant problems in a super-cheap way without needing a complicated daemon & not have any performance or correctness footguns (just need to save & check a mapping of inode -> version number). |
On Linux there a counter for every time a file is recreated which in practice is what most editors do to atomically write to a file. The combination of the inode id and i_generation is supposed to uniquely identify a single file for the entire lifetime of the filesystem even across deletes and recreates. Not every filesystem implements it though. It also doesn't get affected by edits that are not recreating the file. |
On the other note, if you look at well-optimized IO logic, such as that found in 1BRC submissions you can quickly see that any plausible amount of source code can be loaded into RAM in seconds, if that is what is needed. Modern machines with NVMe are not IO bound in any meaningful way, and very few people would build a huge rust codebase on a machine from the 90's. |
Most of our team has their home directory mounted over NFS; NFS can have great throughput but will generally have pretty terrible metadata latency. Regardless, even a naïve implementation of sha256 can easily handle 500 MiB/s on a modern system; the hashing time should still be negligible on any reasonably-sized codebase, especially since it's trivially parallelizable. |
It's a very odd setup to have a git repo running on a NFS share. Even still, you'd expect the source code to be sitting in the page cache which again still invalidates the whole "I/o latency dominates" argument because you're going to be doing a local memcpy+hash. The best choice is designing such that performance is optimized in cases where hash speed could be the bottleneck. Speaking of hash algorithms, came across gxhash which is a Pareto frontier faster by a lot than xxh3 at smaller inputs and ahash for large inputs. As for whether or not it's embarrassingly parallel, this kind of stuff can surprisingly be less so because there's coordination required to send and receive the result from a background thread. Given that most of the checks will stay the same not requiring a rebuild, the overhead of that distribution can easily be the bottleneck vs having a fast hash function running inline. |
@overlookmotel here's a tool that should have similar logic to the |
@ClementTsang Thank you! Yes, a bandaid solution, but some kind of solution at least... |
I do think it is worth doing a complete security analysis. Consider a file foo.rs that contains a security bug, which has hash(foo.rs) = X. Now somebody submits a patch for foo.rs that contains a "fix" for a bug, such that the hash of the patched foo.rs also has the value X. Then when one rebuilds the project to get the security "fix," they silently keep using the broken version. This could be made to happen easily if the security bug is intentional and the developer of the security bug plans for this. I know such a think can seem far-fetched, but AFAICT this is why Bazel and other things that rely heavily on (distributed) caching for incremental rebuilds use strong hashes. |
A partially mitigating factor for that attack would be that the fingerprint for non-local dependencies (anything from a registry like crates.io or a git dependency) rebuilding doesn't happen because the fingerprint changes (in fact it isn't checked at all) Instead it is rebuilt because the package identity changes due to a different package version (and in case of git dependencies the git commit) is used, which by definition needs to be different to even pull in your changes. As such only for files in the project you are building can this attack be pulled off. I can imagine that we do still want to check the mtime for equality before checking the hash to improve performance, which would fully mitigate the attack as the attacker can't force the mtime to be identical between the original and the new file. It wouldn't be enough for a distributed cache like Bazel uses though. |
I would like to emphasize once again that if you're using |
For reference, sccache started using BLAKE3 for fingerprinting a while ago and I don't think that causes any performance problems. BLAKE3 should be fast enough for this kind of bulk-hashing scenario for the actual hashing to not factor into performance too much. |
But, it's already doing sha256 hashing. Why would we add additional hashing on top of that? |
The |
Cargo absolutely does verify the checksums of files against |
It does when building the first time. It does not check it when the crate has already been built. It will immediately consider it not needing any rebuild way before checking .cargo-checksum.json. I actually tried it myself.
The first attempt at implementing this did actually modify rustc to include the source hashes in the dep info file and then lets cargo hash it again on later builds to check if the file is changed as this is the only race free way of implementing it. I haven't seen any suggestion to do it differently. |
Well, yes, but the problem is that
Ok, yes, that makes sense (in general; I'm mostly focused on the CI scenario where such races should be impossible). And I think the most common use case for this feature request is to be able to do useful caching of the |
Non-local dependencies (including vendored dependencies) are always assumed to be up to date if any compiled artifacts exist. It will never check the mtime, absolute path, |
Hi, I've made a tracking issue for this to go along with my two PRs. One for cargo and one for rustc. #14136 |
initial version of checksum based freshness Implementation for #14136 and resolves #6529 This PR implements the use of checksums in cargo fingerprints as an alternative to using mtimes. This is most useful on systems with poor mtime implementations. This has a dependency on rust-lang/rust#126930. It's expected this will increase the time it takes to declare a build to be fresh. Still this loss in performance may be preferable to the issues the ecosystem has had with the use of mtimes for determining freshness.
Describe the problem you are trying to solve
The primary problem I have is that when building my code on travis, the actual code in my workspace builds every time, even though much of it hasn't changed and I have target directory caching on. The reason is that travis makes a new clone of my git repo, which doesn't preserve mtimes. This can add about 5 minutes to every travis run. My project is mixed rust and non-rust code, so this adds 5 minutes to those runs even if no rust code has been affected. I started futzing with mtimes, but that seems fragile and not solving the root of the problem.
Additionally, edit-undo loops cause re-compilation locally, which is a little annoying.
Describe the solution you'd like
Add a new
LocalFingerprint::ContentBased(Digest, PathBuf)
variant tocargo/src/cargo/core/compiler/fingerprint.rs
Lines 204 to 209 in b84e625
PathBuf
, passes it through aSipHasher
, and mixes that into any aggregate fingerprints. Use this instead ofLocalFingerprint::MtimeBased
.Notes
This will probably slow down no-op builds slightly (in some circumstances, such as with large build script inputs over NFS, significantly), so may want to be behind a flag (perhaps
--fingerprint-strategy={mtime,content}
).This would probably also make more shared caching (that people talk about a lot, most recently at https://internals.rust-lang.org/t/idea-cargo-global-binary-cache/9002) easier.
I'd be happy to implement this if the PR is likely to be accepted :)
This would probably also fix
cargo check
not rebuilding dependency that has changed #10175 ?The text was updated successfully, but these errors were encountered: