-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring and extending the Digest module with better hash functions #12307
Conversation
This seems lovely to me, as I looked at It's good to have I think something like |
|
||
(** {1 Generic interface} *) | ||
|
||
module type S = sig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to add a val id : string
or val : name : string
made to hold a lowercased ascii identifier for the digest function. This allows to easily index them by name via first-class modules. See for example here for something similar (for now not suggesting to add the lookup API, only the names).
stdlib/digest.ml
Outdated
external update: state -> string -> int -> int -> unit = "caml_blake2_update" | ||
external final: state -> int -> t = "caml_blake2_final" | ||
external unsafe_string: int -> string -> string -> int -> int -> t | ||
= "caml_blake2_string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about tweaking that to a HASH
interface without the blake specific int
s. And provide Digest.Make
that derives S
from that. This could be useful for the larger eco-system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the implementation is structured like this, with a Make
functor parameterized by a low-level hash function, but I haven't exported it yet, and I'm not sure it would really be usable.
stdlib/digest.mli
Outdated
val get_digest : state -> t | ||
(** [get_digest st] returns the digest of all the data that has | ||
been added to [st]. | ||
The rolling digest [st] should not be used afterwards. *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this due to destructively padding to 128 bytes at the end?
Then this could be made "safe" by padding a copy of the state. A copy function might be useful by itself.
Alternatively, if we want to restrict ourselves to more basic scenarios, various tricks could be used to hint at or enforce this linear discipline. For example with_state : (state -> unit) -> t
similar to In_channel.with_open_text
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be made safe for reuse, but I'm not sure there's a use case. Opinions welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is interesting, it provides two things that I would like to have:
- a path towards a non-MD5 Digest implementation (this PR does only a first step, for compatibility reasons)
- a better OCaml-side API for hashes where input can be fed gradually to a hasher state
I broadly agree with the approach and would be in favor of merging eventually. I have not reviewed the Blake2 implementation in details -- I think that someone should check that it corresponds to its published pseudo-code.
Note to reviewers: the documentation needs more work, I'm not happy with the wording in several places. |
I do appreciate this work. Thanks a lot. I only have one question / remark: why is the "incremental" API mutable (is that for performance reasons? if it has been measured, with what usage pattern)? If it stays mutable, a function |
I see some legitimate questions about the new incremental hash API, and they make me think twice about my API choice. Originally, this PR did not have an incremental API, it was only about providing multiple implementations of the existing API, so that we can gradually move away from MD5. The incremental API was added following reviewer's comments, and I didn't think hard about the API, just using the low-level, imperative, linear-use API that is typical of C crypto libraries and that I used previously in my Cryptokit library. To me it's an obvious choice, but not to everyone, apparently. Since PRs are the wrong place to evolve a new API, I decided to remove the incremental API from this PR. Now we're back to the multiple implementations of the old, nonincremental API that I'd like to push in 5.2. I'm asking reviewers to please focus on this API and not ask for more functions until we get something merged. The original code with the questionable incremental API is kept at https://github.com/xavierleroy/ocaml/tree/blake2b-incremental . |
During the last triaging meeting it was agreed that I would review this PR. I was told that more-than-basic cryptographic knowledge wasn’t necessary. Did I understand correctly @gasche? The longest task will be checking the C implementation with regard to the BLAKE2 RFC. Since said RFC provides a reference implementation, I was wondering why that implementation was not the one used? |
Thanks a lot for your help!
I took the BLAKE2b implementation from my Cryptokit library https://github.com/xavierleroy/cryptokit/blob/master/src/blake2.c , but it's essentially the RFC 7693 reference code with a few style changes to match the style of the rest of Cryptokit.
For symmetric cryptographic functions, it's quite hard to make a honest mistake and still pass the test vectors, owing to the "pseudo-random" nature of these functions. Well, the (non-cryptographic) buffering code could trigger a small buffer overflow yet pass the tests, so maybe this needs special attention. So, if I were you I'd focus on the test vectors, which I'm not completely happy about : the 512-bit vectors are taken from public sources and/or running openssl, but the 256-bit and 128-bit vectors are home-made from the RFC implementation. Also, there's a "self test" in the RFC that I didn't implement and may be worth adding somewhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As advised, I did not review each and every line of cryptographic computations, but I:
- looked particularly hard to convince myself of the absence of overflows of the buffer;
- checked that I obtained the same results for the test vectors on 256-bit and 128-bit with the GNU coreutils implementation of
b2sum
; - propose a patch adding the BLAKE2b self-test from the RFC at OlivierNicole@117bedb.
…gest.MD5 Also: use In_channel.with_open_bin.
`Digest.from_hex` remains available as an alias for `Digest.of_hex`.
"length" makes it clearer that the unit is bytes, not bits.
… size As pointed out during review, an unsigned type makes more sense. RFC 7693 uses size_t throughout; this is safe and nicely consistent.
Thanks a million @OlivierNicole for the review and for the "self test", which adds significant confidence! Your remark about unsigned vs signed integers is well taken. Going back to the RFC code, they use |
Thank you, the change for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had previously reviewed the interface, but not the crypto details. Together with Olivier's review of the actual code (the hardest part), we now have a full review, maybe slightly more.
I would be keen to see this PR merged in 5.2 as it introduces the Digest. submodules that people can use for forward-compatibility if they want a specific implementation.
Other avenues of extensions have been discussed and could come in further PRs:
- rolling/incremental hashes
- a bit more runtime information in each module, for example the id/name of the algorithm used, as suggested by @dbuenzli
Changes
Outdated
@@ -258,6 +258,9 @@ Working version | |||
Guillaume Munch-Maccagnoni, KC Sivaramakrishnan, Stefan Muenzel, | |||
Xavier Leroy) | |||
|
|||
- #12307: Add BLAKE2b hashing and an MD5 submodule to the Digest module. | |||
(Xavier Leroy, review by Olivier Nicole) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I suspect that more reviewers deserve a mention: myself, wikku (Wiktor Kuchta), Daniel and, possibly Hannes.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Note that reviewing is one thing and just commenting is another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @xavierleroy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I checked that the implementation matches Cryptokit's modulo the nice modernisations and the API change works for me.
Given the effect on the PR's diff, a whitespace-only commit amending the indentation of the modules in stdlib/digest.ml could be direct-pushed after merging?
runtime/caml/blake2.h
Outdated
|
||
#ifdef CAML_INTERNALS | ||
|
||
#include "mlvalues.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
#include "mlvalues.h" | |
#include "misc.h" |
(just CAMLextern
is needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done in d2a89ee
stdlib/digest.ml
Outdated
|
||
module BLAKE2 (X: sig val hash_length : int end) : S = struct | ||
|
||
type t = string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't L64-L133 be indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no good excuse to keep this code un-indented at this point (but it makes editing simpler during development), so I re-indented in 715b727.
stdlib/digest.ml
Outdated
|
||
(* MD5 hashing *) | ||
|
||
module MD5 = struct | ||
|
||
type t = string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, it's diff-noisy but this ought to be indented, too, at this final stage? (the GH interface at least makes space change clear, and there's patdiff and the --ignore-
family of git-diff
options elsewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 715b727 as well.
Co-authored-by: David Allsopp <david.allsopp@metastack.com>
FTR: if I'm not the one who merges this PR, I'd like it to be squash-merged, as the history is too non-linear. |
I was about to squash, but it occurs to me that it might be nice to keep the git-blame of the test @OlivierNicole wrote intact, unless he's not bothered? It would just be a final rebase squashing all the other commits together into one, but leaving that commit separate. |
I definitely want @OlivierNicole to be credited, but I though he could just appear as co-author, as the Github "squash-merge" button does by default. |
I'm don't mind either way. Squashing is fine (although I have the impression that it gives me too much credit, given that @xavierleroy did most of the work). But thanks for asking! |
Three cheers! |
The Digest standard library module provides functions to compute 128-bit digests (a.k.a. hashes) of strings, files, and other sequences of bytes. It's been used from the very beginning of OCaml to record dependencies over .cmi compiled interfaces in .cmi, .cmo and .cmx compiled files, supporting the infamous "make inconsistent assumptions on" link-time error.
The hashing function used is MD5. Back in 1996 it was the recommended cryptographic hash function, fast and reputedly secure. Today, it is known to be completely insecure, with easy collision attacks and chosen-prefix attacks. For many applications such as checksumming .cmi files, there is no security expectation and this is not a problem. But it can be a problem if users wrongly assume digests to be unforgeable. Also, any use of MD5 raises alarms in a security review, even if it is used for checksumming and not as a cryptographic hash function.
This PR does not (yet) propose to change the existing Digest functions to use another, more secure hashing function. Instead, it lays the ground work to do so.
The first commit introduces a signature
Digest.S
for hash functions, providing the same functions as Digest, and one implementation of this signature,Digest.MD5
, which contains the same MD5 implementation that we have today forDigest
, with a guarantee that it will always use MD5 in the future. This way, existing OCaml code that relies onDigest
implementing MD5 can start usingDigest.MD5
instead ofDigest
and be assured not to break in the near future.The second commit adds modules
Digest.BLAKE2_512
,Digest.BLAKE2_256
, andDigest.BLAKE2_128
, using the BLAKE2b hash function specialized to 512, 256 or 128-bit hashes. BLAKE2b is an instance of a modern, cryptographically-safe hash function. (I'll explain later why I find it particularly interesting compared to other obvious choices like SHA-256.)As soon as BLAKE2b is provided as submodules of
Digest
, we can start experimenting with it, e.g. try to use it for .cmi checksumming instead of MD5.If everything goes well, in 2-3 major releases from now, we could consider changing the default
Digest
implementation from MD5 to, for instance, BLAKE2_128. (I have a hunch that most users ofDigest
don't care about the hashing algorithm used but do depend on hashes being 16-byte long.)