-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking file integrity: choosing between md5 and sha256 (or other options) #112
Comments
@w-jan @hansvancalster @cecileherr @ToonHub any preference? Zenodo metadata provide md5, Git LFS uses sha256; we can compute and test both while one suffices. Most important difference is with respect to security (hash collision resistance), see above, while both do a perfect job with respect to file integrity verification. But verifying both checksums is absolute overkill, then you can as well restrict to sha256 alone, the most secure. What we could do, is store both checksums in the future built-in checksums table of |
I vote for md5 as security is not an issue here - only file integrity - and difference in speed of the algorithms is also not important here. |
I have not really a hard meaning about this question, I will follow the meaning of the majority |
Continuation of @hansvancalster 's suggestion to have a look at xxHash. The specifics of xxHash fit better in this issue. Discussion of some propertiesIndependent documentation and testing of xxHash appears to be limited. Most information can be found from the project itself at https://github.com/Cyan4973/xxHash. xxHash provides a collection of non-cryptographic hash functions (XXH32, XXH64, and recently, XXH3 and its 128-bit variant XXH128). Compared to md5 and sha256 (cryptographic hash functions) this means the absence of introducing 'obscurity' in the hash with referral to revealing the original data, IIUC. For verifying file integrity, this doesn't matter. What matters more, is the sensitivity to small changes, and uniqueness.
In R, XXH3 and XXH128 are not yet implemented in the Importantly, xxHash has far superior speed to md5 and sha256 (with XXH3/XXH128 even much faster, but not available in R for file hashing). See #122 (comment) for timings in R. Stability of the implementation in R
On the other hand If we'd have an implementation of xxHash in R, analogous to Concluding remarkSo in the end, more confusion 🤔? I think we should at least support the usage of more recent and faster file hashing algorithms, if not use them by default. And take into account that our preferred choice in |
Update after a small experiment with XXH64 and XXH128, using a large file of 2.7 GB, to test how sensitive the hash function is to the smallest possible change (one bit) in a large file. Seems very promising: once file is loaded from disk, xxh64 and xxh128 calculations not only take approximately just a second (or less); the hash value is completely different after flipping one bit (tried three times, with different bits). Also, it is demonstrated how flipping back restores original hash values. The flipping bits of large file and effect on xxHash value$ xxhsum --version
xxhsum 0.7.3 (64-bit x86_64 + SSE2 little endian), GCC 9.3.0, by Yann Collet
$
$ ls -lh hirsute-desktop-amd64.iso
-rw-rw---- 1 floris floris 2,7G mrt 29 16:52 hirsute-desktop-amd64.iso
$
$ xxh64sum hirsute-desktop-amd64.iso # original checksum
d4252122926c351a hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 73513453 # flip bit
$
$ xxh64sum hirsute-desktop-amd64.iso
b934ff48481a4fe9 hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
dabffcb9b6ef8d9665d035af0af7d179 hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 73513453 # restore bit
$
$ xxh64sum hirsute-desktop-amd64.iso
d4252122926c351a hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 12345654 # flip bit
$
$ xxh64sum hirsute-desktop-amd64.iso
b594dc513f3c7321 hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
541819a95d06d4b5ffa8fca70d8e9971 hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 12345654 # restore bit
$
$ xxh64sum hirsute-desktop-amd64.iso
d4252122926c351a hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 147852369 # flip bit
$
$ xxh64sum hirsute-desktop-amd64.iso
79f15e384d23459e hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
d7692da02a17da15f8cb4779b8b68f8e hirsute-desktop-amd64.iso
$
$ bitflip hirsute-desktop-amd64.iso 147852369 # restore bit
$
$ xxh64sum hirsute-desktop-amd64.iso
d4252122926c351a hirsute-desktop-amd64.iso
$ xxh128sum hirsute-desktop-amd64.iso
d8a90b17b19941634efd03e4633a31bb hirsute-desktop-amd64.iso
|
A bit elaboration of the file checksum topic as a reference for later - comments most welcome! Discussion originated in inbo/n2khab-preprocessing#50 but is of broader relevance given the future n2khab intentions. Currently we still keep track of both checksums, e.g. with compute_filehashes.R at a7fafb8.
Experiments
First, a small experiment, repeated 3 times, on a 1.2 GiB file. Not shown is the first run, where md5 (because it is run as the first) took about 15 s, which is simply due to reading the file into memory for the first (and only) time - from my slow HDD that is.
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
Session info
So actually it won't make much difference in terms of time to calculate - you need a very large file (as above) to notice the difference (about 1.3 s). Compare with a 95.4 MiB file - difference not large (0.1 s).
Created on 2021-02-11 by the reprex package (v1.0.0)
Created on 2021-02-11 by the reprex package (v1.0.0)
So calculations will differ more between md5 and sha256 only when handling a bunch of (larger) files at once, or for a much larger file (which we currently don't use).
Which one to choose? Background information
Some background information comes from Wikipedia (especially here).
both are examples of cryptographic hash functions. The hash is not unique by design, so theoretically for a given hash it is possible - not necessarily feasible - to create different inputs that produce the same hash (= checksum): a hash collision.
for use in security applications (e.g. when hashing a password), it is important that it is infeasible to calculate a list of colliding inputs - hence to try guessing the possible input.
again for security purposes, SHA-2 (of which SHA-256 is one algorithm) has much better collision resistance (compared to MD5, but also to SHA-1), i.e. in terms of feasibility to find collisions. It is used where security is important, e.g.:
On the topic of data integrity:
See also https://en.wikipedia.org/wiki/File_verification, which describes the difference between integrity vs. authenticity verification.
Concluding thoughts
So, just for verifying file integrity in a trusted context (as ours) it does not actually matter.
Opinions do differ on this, e.g. in https://stackoverflow.com/q/14139727. It seems mainly a concern about: do you want it to be secure as well? The following rather states it well IMO:
The text was updated successfully, but these errors were encountered: