-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ingest] Re-introduce the Hash Processor with consistent keys across all nodes #34085
Comments
Pinging @elastic/es-core-infra |
Unblocked by #40416 |
Pinging @elastic/es-core-features |
Once #46241 is merged, we may want to consider this processor to be async. |
@jakelandis Can you clarify why we would want to go async here? I would expect calculating the hash to be fast, and not worth context switching off the processing thread? |
At job-1 we did something similar on high volume ingestion (not to ES) and calculating the hash was by far the most expensive of the transformations applied to the data. Once this is implemented, running this with Rally would be helpful to make that decision. |
It might be the most expensive, but that doesn't mean it's so expensive that it needs to be done after a couple of context switches. Recall, the use-case for the hash processor is to anonymize fields like names. Those are not huge fields. |
I'm not sure if context switches or cryptographic hashes (for small data) is more expensive. However, I do think we should measure it and let that guide the decision. EDIT: for clarity, i am not saying we should choose async if it is marginally faster in the tests... IIRC at job-1 the overhead was substantial 5-10ms per doc on enterprise class bare metal when multiplied by millions adds up. I would just want to confirm that we are not introducing a large performance difference |
I am concerned with spending time on details that are intuitively not expected to be a problem, and that maybe we should only consider if it proves to be a problem in the wild. |
I ran a quick JMH benchmark to see if my concerns were justified. During that process, remembered that that at job-1 we used SHA256withRSA, not HmacSHA256 (the default). HmacSHA256 is plenty fast enough and we don't need to explore going aysnc here. (SHA256withRSA is indeed slow but a not relevant here since it is not supported) |
Is this issue still being worked on? |
Team, wondering if someone can verify if this example is workable/safe purely for the purpose of generating unique document ids for de-duplication purposes (ie, not safe for cryptography purposes due to the original statement of this issue)? |
This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed. |
A Hash processor was introduced (#31087), and subsequently reverted once it was realized that if the keys (local to the node) ever got out of sync that could cause silent functional issues (same values hashed to different values based on the node which processed them).
Since the key is sensitive data, we can not simply store it cluster state. After some internal discussions we have identified two possible strategies to mitigate the out of sync issue.
a) Introduce the concept of a consistent setting. Keep the key stored in the keystore and keep a id of sorts (a hash of the key ?) either in cluster state, or require that id to be included in the configuration for the hash processor. What to do when the setting is inconsistent will require some more discussion.
b) Introduce the concept of encrypted settings stored in the cluster state. More discussion here: #32727
The text was updated successfully, but these errors were encountered: