Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a Hashing Processor #31087

Merged
merged 10 commits into from
Jun 29, 2018
Merged

Introduce a Hashing Processor #31087

merged 10 commits into from
Jun 29, 2018

Conversation

talevy
Copy link
Contributor

@talevy talevy commented Jun 5, 2018

It is useful to have a processor similar to
https://www.elastic.co/guide/en/logstash/6.0/plugins-filters-fingerprint.html
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.

supersedes #30790

TODO:

  • rest tests maybe

follow-up PR: add documentation #31694

@talevy talevy added WIP :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@joshbressers
Copy link

I think replicating the logstash functionality would be a good first step. I see a lot of value in having a consistent experience for the users. I think having similar functionality out of the gate then evolving both processors would be a big win. My current understanding is the logstash functionality is acceptable for GDPR requirements.

@gingerwizard
Copy link

@talevy i assume in the first iteration the salt/hmac key will just be held as part of the ingest processor defn itself?

@talevy
Copy link
Contributor Author

talevy commented Jun 13, 2018

@gingerwizard any private keys will probably need to be stored in the keystore.

talevy added a commit to talevy/elasticsearch that referenced this pull request Jun 13, 2018
It makes sense to introduce new Security ingest
processors (example: elastic#31087), and this change would
give them a good place to be written.
talevy added a commit that referenced this pull request Jun 13, 2018
It makes sense to introduce new Security ingest
processors (example: #31087), and this change would
give them a good place to be written.
talevy added a commit that referenced this pull request Jun 13, 2018
It makes sense to introduce new Security ingest
processors (example: #31087), and this change would
give them a good place to be written.
@talevy talevy force-pushed the hash-processor branch 3 times, most recently from 2494c30 to b7e6898 Compare June 18, 2018 23:23
It is useful to have a processor similar to
https://www.elastic.co/guide/en/logstash/6.0/plugins-filters-fingerprint.html
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.
@talevy talevy requested review from jkakavas and martijnvg June 26, 2018 00:36
@talevy talevy added review and removed WIP labels Jun 26, 2018
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the ingest side it looks good. I left two minor comments. If docs and an integ test is added then it LGTM.

* A processor that hashes the contents of a field (or fields) using various hashing algorithms
*/
public final class HashProcessor extends AbstractProcessor {
enum Method {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe move this enum below the factory class?

public void execute(IngestDocument document) {
Map<String, String> hashedFieldValues = fields.stream().map(f -> {
try {
String value = document.getFieldValue(f, String.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be missing field support?

@talevy
Copy link
Contributor Author

talevy commented Jun 27, 2018

thanks for the review @martijnvg, I'll follow up soon!

@talevy
Copy link
Contributor Author

talevy commented Jun 28, 2018

hi @martijnvg thank you for the review. I've added the ignore_missing flag, as well as the salt param back. The salt param can both be used by users, and also leveraged by our test infrastructure to verify hash values.

I could not find a great place in the security documentation to add docs for this and I do not want to block the PR because of this (I am going on vacation next week). For this reason, I will open an issue to add documentation for when I come back, what do you think?

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming the PR build is green. Currently it fails because of a checkstyle violation.

I could not find a great place in the security documentation to add docs for this and I do not want to block the PR because of this (I am going on vacation next week). For this reason, I will open an issue to add documentation for when I come back, what do you think?

I'm ok with this.

@@ -277,7 +277,7 @@ public static Hasher resolve(String name) {

public abstract boolean verify(SecureString data, char[] hash);

static final class SaltProvider {
public static final class SaltProvider {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed SaltProvider in soon to be merged #31234 and opted for generating a random byte array from SecureRandom and then Base64 encoding that to a string, so probably you need to do something similar here too. https://github.com/elastic/elasticsearch/pull/31234/files#diff-ebc23bc2cb194fa926b2cdafaedef9d4R565

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, it is also private! I see, thanks for the heads up. How soon will that be merged? One of us will have to change their code pre-pushing 😄

@talevy talevy deleted the hash-processor branch June 29, 2018 16:30
@talevy talevy added the v7.0.0 label Jun 29, 2018
talevy added a commit to talevy/elasticsearch that referenced this pull request Jun 29, 2018
It is useful to have a processor similar to
logstash-filter-fingerprint
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.

This processor introduces a pbkdf2hmac hashing scheme to fields in documents
for indexing
talevy added a commit that referenced this pull request Jun 29, 2018
It is useful to have a processor similar to
logstash-filter-fingerprint
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.

This processor introduces a pbkdf2hmac hashing scheme to fields in documents
for indexing
@talevy talevy added v6.4.0 and removed review labels Jun 29, 2018
dnhatn added a commit that referenced this pull request Jun 29, 2018
* master:
  Mute 'Test typed keys parameter for suggesters' as we await a fix.
  Build test: Thread linger
  Fix gradle4.8 deprecation warnings (#31654)
  Mute FileRealmTests#testAuthenticateCaching with an @AwaitsFix.
  Mute TransportChangePasswordActionTests#testIncorrectPasswordHashingAlgorithm with an @AwaitsFix.
  Build: Fix naming conventions task   (#31681)
  Introduce a Hashing Processor (#31087)
jasontedor added a commit that referenced this pull request Jul 1, 2018
* elastic/6.x:
  Enable setting client path prefix to / (#30119)
  [DOCS] Secure settings specified per node (#31621)
  Build test: Thread linger
  Build: Fix naming conventions task   (#31681)
  Introduce a Hashing Processor (#31087)
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request Jul 1, 2018
* elastic/ccr: (30 commits)
  Enable setting client path prefix to / (elastic#30119)
  [DOCS] Secure settings specified per node (elastic#31621)
  has_parent builder: exception message/param fix (elastic#31182)
  TEST: Randomize soft-deletes settings (elastic#31585)
  Mute 'Test typed keys parameter for suggesters' as we await a fix.
  Build test: Thread linger
  Fix gradle4.8 deprecation warnings (elastic#31654)
  Mute FileRealmTests#testAuthenticateCaching with an @AwaitsFix.
  Mute TransportChangePasswordActionTests#testIncorrectPasswordHashingAlgorithm with an @AwaitsFix.
  Build: Fix naming conventions task   (elastic#31681)
  Introduce a Hashing Processor (elastic#31087)
  Do not check for object existence when deleting repository index files (elastic#31680)
  Remove extra check for object existence in repository-gcs read object (elastic#31661)
  Support multiple system store types (elastic#31650)
  [Test] Clean up some repository-s3 tests (elastic#31601)
  [Docs] Use capital letters in section headings (elastic#31678)
  muted tests that will be replaced by the shard follow task refactoring: elastic#31581
  [DOCS] Add PQL language Plugin (elastic#31237)
  Merge AzureStorageService and AzureStorageServiceImpl and clean up tests (elastic#31607)
  TEST: Fix test task invocation (elastic#31657)
  ...
talevy added a commit to talevy/elasticsearch that referenced this pull request Jul 18, 2018
talevy added a commit to talevy/elasticsearch that referenced this pull request Jul 18, 2018
@talevy
Copy link
Contributor Author

talevy commented Jul 18, 2018

There are concerns that this implementation as was merged can easily lead to incorrect usage without clear feedback. For example, secret keys may be inconsistent between ESKeyStores which will lead to inconsistent hashing across ingests depending on which node operated on the documents.

For this reason, and others, This change is to be reverted.
PR for 6.x(6.4): #32179
PR for master: #32178

talevy added a commit that referenced this pull request Jul 18, 2018
talevy added a commit that referenced this pull request Jul 18, 2018
dnhatn added a commit that referenced this pull request Jul 19, 2018
* 6.x:
  Fix rollup on date fields that don't support epoch_millis (#31890)
  Revert "Introduce a Hashing Processor (#31087)" (#32179)
  [test] use randomized runner in packaging tests (#32109)
  Painless: Fix caching bug and clean up addPainlessClass. (#32142)
  Fix BwC Tests looking for UUID Pre 6.4 (#32158) (#32169)
  Call setReferences() on custom referring tokenfilters in _analyze (#32157)
  Add more contexts to painless execute api (#30511)
  Add EC2 credential test for repository-s3 (#31918)
  Fix CP for namingConventions when gradle home has spaces (#31914)
  Convert Version to Java - clusterformation part1 (#32009)
  Fix Java 11 javadoc compile problem
  Improve docs for search preferences (#32098)
  Configurable password hashing algorithm/cost(#31234) (#32092)
  [DOCS] Update TLS on Docker for 6.3
  ESIndexLevelReplicationTestCase doesn't support replicated failures but it's good to know what they are
  Switch distribution to new style Requests (#30595)
  Build: Skip jar tests if jar disabled
  Build: Move shadow customizations into common code (#32014)
  Painless: Add PainlessClassBuilder (#32141)
  Fix accidental duplication of bwc test for script behavior
  Handle missing values in painless (#30975) (#31903)
  Build: Make additional test deps of check (#32015)
  Painless: Fix Bug with Duplicate PainlessClasses (#32110)
  Adjust translog after versionType removed in 7.0 (#32020)
  Disable C2 from using AVX-512 on JDK 10 (#32138)
  [Rollup] Add new capabilities endpoint for concrete rollup indices (#32111)
  Mute :qa:mixed-cluster indices.stats/10_index/Index - all’
  [ML] Wait for aliases in multi-node tests (#32086)
  Ensure to release translog snapshot in primary-replica resync (#32045)
  Docs: Fix missing example script quote (#32010)
  Add Index UUID to `/_stats` Response (#31871) (#32113)
  [ML] Move analyzer dependencies out of categorization config (#32123)
  [ML][DOCS] Add missing 6.3.0 release notes (#32099)
  Updates the build to gradle 4.9 (#32087)
  Update monitoring template version to 6040099 (#32088)
  Fix put mappings java API documentation (#31955)
  Add exclusion option to `keep_types` token filter (#32012)
dnhatn added a commit that referenced this pull request Jul 20, 2018
* master:
  Painless: Simplify Naming in Lookup Package (#32177)
  Handle missing values in painless (#32207)
  add support for write index resolution when creating/updating documents (#31520)
  ECS Task IAM profile credentials ignored in repository-s3 plugin (#31864)
  Remove indication of future multi-homing support (#32187)
  Rest test - allow for snapshots to take 0 milliseconds
  Make x-pack-core generate a pom file
  Rest HL client: Add put watch action (#32026)
  Build: Remove pom generation for plugin zip files (#32180)
  Fix comments causing errors with Java 11
  Fix rollup on date fields that don't support epoch_millis (#31890)
  Detect and prevent configuration that triggers a Gradle bug (#31912)
  [test] port linux package packaging tests (#31943)
  Revert "Introduce a Hashing Processor (#31087)" (#32178)
  Remove empty @return from JavaDoc
  Adjust SSLDriver behavior for JDK11 changes (#32145)
  [test] use randomized runner in packaging tests (#32109)
  Add support for field aliases. (#32172)
  Painless: Fix caching bug and clean up addPainlessClass. (#32142)
  Call setReferences() on custom referring tokenfilters in _analyze (#32157)
  Fix BwC Tests looking for UUID Pre 6.4 (#32158)
  Improve docs for search preferences (#32159)
  use before instead of onOrBefore
  Add more contexts to painless execute api (#30511)
  Add EC2 credential test for repository-s3 (#31918)
  A replica can be promoted and started in one cluster state update (#32042)
  Fix Java 11 javadoc compile problem
  Fix CP for namingConventions when gradle home has spaces (#31914)
  Fix `range` queries on `_type` field for singe type indices (#31756)
  [DOCS] Update TLS on Docker for 6.3 (#32114)
  ESIndexLevelReplicationTestCase doesn't support replicated failures but it's good to know what they are
  Remove versionType from translog (#31945)
  Switch distribution to new style Requests (#30595)
  Build: Skip jar tests if jar disabled
  Painless: Add PainlessClassBuilder (#32141)
  Build: Make additional test deps of check (#32015)
  Disable C2 from using AVX-512 on JDK 10 (#32138)
  Build: Move shadow customizations into common code (#32014)
  Painless: Fix Bug with Duplicate PainlessClasses (#32110)
  Remove empty @param from Javadoc
  Re-disable packaging tests on suse boxes
  Docs: Fix missing example script quote (#32010)
  [ML] Wait for aliases in multi-node tests (#32086)
  [ML] Move analyzer dependencies out of categorization config (#32123)
  Ensure to release translog snapshot in primary-replica resync (#32045)
  Handle TokenizerFactory  TODOs (#32063)
  Relax TermVectors API to work with textual fields other than TextFieldType (#31915)
  Updates the build to gradle 4.9 (#32087)
  Mute :qa:mixed-cluster indices.stats/10_index/Index - all’
  Check that client methods match API defined in the REST spec (#31825)
  Enable testing in FIPS140 JVM (#31666)
  Fix put mappings java API documentation (#31955)
  Add exclusion option to `keep_types` token filter (#32012)
  [Test] Modify assert statement for ssl handshake (#32072)
@cdahlqvist
Copy link

cdahlqvist commented Sep 21, 2018

A hashing function that like Logstash fingerprint filter supports MURMUR3, MD5, SHA1 and SHA256 would be very useful when creating pipelines that can avoid duplicates without requiring Logstash in the mix. For this, the secrecy of the key is probably less important than when hashing from a security perspective.

danhermann pushed a commit to danhermann/elasticsearch that referenced this pull request Sep 24, 2019
It is useful to have a processor similar to
logstash-filter-fingerprint
in Elasticsearch. A processor that leverages a variety of hashing algorithms
to create cryptographically-secure one-way hashes of values in documents.

This processor introduces a pbkdf2hmac hashing scheme to fields in documents
for indexing
@AydinChavez
Copy link

There are concerns that this implementation as was merged can easily lead to incorrect usage without clear feedback. For example, secret keys may be inconsistent between ESKeyStores which will lead to inconsistent hashing across ingests depending on which node operated on the documents.

For this reason, and others, This change is to be reverted.
PR for 6.x(6.4): #32179
PR for master: #32178

Is there any progress on this topic? Incorrect usage is quite a broad topic and configuration issues can arise anywhere. I don't think it should block the progress of this one. I mean as long as the functional part is working it should be fine..

@danhermann
Copy link
Contributor

Note that a processor suitable for content fingerprinting was added in #68415 though it is not designed for the content anonymization use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants