-
Notifications
You must be signed in to change notification settings - Fork 712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: reusable fingerprinting interface #4628
Conversation
409288b
to
f595aa8
Compare
/* Originally we represented the hash as the raw bytes of the digest. | ||
* However, the JA4 hash is composed of a string prefix and two digests, | ||
* so cannot reasonably be represented as raw bytes. | ||
* Keep the old behavior for the old method. | ||
*/ | ||
unsigned int legacy_hash_format : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this. The other options I considered:
- Remove the old method. After all, it is marked unstable. But we know it's currently in use.
- Translate the hex back to raw bytes in the old method. That seemed messy, since we'd translate to hex just to immediately translate back, but maybe that's better than these flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, personally I like the second option? I prefer complicating the legacy stuff to complicating the new stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree, but I'd prefer to fix it in a follow-up PR.
Customers are actively using the legacy stuff, but not actively using the new stuff. snprintf / sscanf are a quick way to handle hex without complicating the code, but they're not exactly efficient. With a quick timing test, I've got the same number of operations taking 12s with the extra unnecessary snprintf + sscanf that took 9s before. Not a crazy slowdown, but also really not great.
I can get it down to 10s if I use the stuffer hex methods (+ LTO), but the stuffer hex methods are currently only for testing. Making them available for non-testing would complicate this PR. I can also actually keep the timing at 9s if I inline my own hex conversion logic, so I'd also want to investigate whether the stuffer hex methods are doing something inefficient.
api/unstable/fingerprint.h
Outdated
/** | ||
* Calculates the raw string for a fingerprint. | ||
* | ||
* @param fingerprint The fingerprint to be used for the raw string | ||
* @param max_output_size The maximum size of data that may be written to `output`. | ||
* If `output` is too small, an S2N_ERR_T_USAGE error will occur. | ||
* @param output The location that the requested hash will be written to. | ||
* @param output_size Output variable to be set to the actual size of the data | ||
* written to `output`. | ||
* @returns S2N_SUCCESS on success, S2N_FAILURE on failure. | ||
*/ | ||
S2N_API int s2n_fingerprint_get_raw(struct s2n_fingerprint *fingerprint, | ||
uint32_t max_output_size, uint8_t *output, uint32_t *output_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone back and forth on including or not including the raw / full string. Currently no customers are using it, but customers did originally ask for it. With the s2n_fingerprint_hash struct I don't think calculating it adds much complexity, but it does still add some complexity. If we want to remove the complexity, we also have to remove the old version of the method, but again this API is marked unstable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very possible that people might have some new usecase for this in the future, so I like having it included now to ensure that it always fits in well with the feature.
.hash = S2N_HASH_MD5, | ||
/* The hash string is a single MD5 digest represented as hex */ | ||
.hash_str_size = MD5_DIGEST_LENGTH * S2N_HEX_PER_BYTE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4578 (comment) So I now remember why I had the hash size on the struct instead of the hash itself :) But two separate fields works just fine.
api/unstable/fingerprint.h
Outdated
/** | ||
* Calculates the raw string for a fingerprint. | ||
* | ||
* @param fingerprint The fingerprint to be used for the raw string | ||
* @param max_output_size The maximum size of data that may be written to `output`. | ||
* If `output` is too small, an S2N_ERR_T_USAGE error will occur. | ||
* @param output The location that the requested hash will be written to. | ||
* @param output_size Output variable to be set to the actual size of the data | ||
* written to `output`. | ||
* @returns S2N_SUCCESS on success, S2N_FAILURE on failure. | ||
*/ | ||
S2N_API int s2n_fingerprint_get_raw(struct s2n_fingerprint *fingerprint, | ||
uint32_t max_output_size, uint8_t *output, uint32_t *output_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very possible that people might have some new usecase for this in the future, so I like having it included now to ensure that it always fits in well with the feature.
/* Originally we represented the hash as the raw bytes of the digest. | ||
* However, the JA4 hash is composed of a string prefix and two digests, | ||
* so cannot reasonably be represented as raw bytes. | ||
* Keep the old behavior for the old method. | ||
*/ | ||
unsigned int legacy_hash_format : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, personally I like the second option? I prefer complicating the legacy stuff to complicating the new stuff.
} | ||
|
||
#[test] | ||
fn raw_may_allocate_memory() -> Result<(), Box<dyn Error>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo, these tests are fun, I like them.
S2N_API int s2n_fingerprint_set_client_hello(struct s2n_fingerprint *fingerprint, struct s2n_client_hello *ch); | ||
|
||
/** | ||
* Get the size of the fingerprint hash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer short
over hash
? It's less descriptive but I'd expect that to be to our benefit as we add more fingerprint types, e.g. if the short representation of some fingerprint in the future doesn't actually rely on a hash.
Description of changes:
Adds a new fingerprint struct. The reasoning behind this struct is:
The new methods behave like the old methods, except:
This change is backwards compatible and should not break anything, even though fingerprinting is unstable, because we don't NEED to break anything yet. We may want to remove the legacy methods someday though.
Call-outs:
This PR is a little big, but I didn't want to split the Rust and C APIs because it's important that our C API can translate to reasonable, usable Rust. However, if it's difficult to review, I can split it.
Testing:
New unit tests for the new methods. The s2n_fingerprint_ja3_test continues to pass, as do the wireshark comparison fingerprint tests in the pcap crate.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.