Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add enrich processor #48039

Merged
merged 165 commits into from
Oct 15, 2019
Merged

Add enrich processor #48039

merged 165 commits into from
Oct 15, 2019

Conversation

martijnvg
Copy link
Member

This PR adds a new ingest processor, named enrich processor, that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy. An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

Closes #32789

martijnvg and others added 30 commits April 5, 2019 09:31
This allows the transport client use this class in enrich APIs.

Relates to #40997
There is no need to create a enrich store component for the transport
layer since the inner components of the store are either present in the
master node calls or via an already injected ClusterService. This commit
cleans up the class, adds the forthcoming delete call and tests the new
code.
This commit wires up the Rest calls and Transport calls for PUT enrich
policy, as well as tests and rest spec additions.
move put policy api yaml test to this rest module.

The main benefit is that all tests will then be run when running:
`./gradlew -p x-pack/plugin/enrich check`

The rest qa module starts a node with default distribution and basic
license.

This qa module will also be used for adding different rest tests (not yaml),
for example rest tests needed for #41532

Also when we are going to work on security integration then we can
add a security qa module under the qa folder. Also at some point
we should add a multi node qa module.
The enrich processor performs a lookup in a locally allocated
enrich index shard using a field value from the document being enriched.
If there is a match then the _source of the enrich document is fetched.
The document being enriched then gets the decorate values from the
enrich document based on the configured decorate fields in the pipeline.

Note that the usage of the _source field is temporary until the enrich
source field that is part of #41521 is merged into the enrich branch.
Using the _source field involves significant decompression which not
desired for enrich use cases.

The policy contains the information what field in the enrich index
to query and what fields are available to decorate a document being
enriched with.

The enrich processor has the following configuration options:
* `policy_name` - the name of the policy this processor should use
* `enrich_key` - the field in the document being enriched that holds to lookup value
* `ignore_missing` - Whether to allow the key field to be missing
* `enrich_values` - a list of fields to decorate the document being enriched with.
                    Each entry holds a source field and a target field.
                    The source field indicates what decorate field to use that is available in the policy.
                    The target field controls the field name to use in the document being enriched.
                    The source and target fields can be the same.

Example pipeline config:

```
{
   "processors": [
      {
         "policy_name": "my_policy",
         "enrich_key": "host_name",
         "enrich_values": [
            {
              "source": "globalRank",
              "target": "global_rank"
            }
         ]
      }
   ]
}
```

In the above example documents are being enriched with a global rank value.
For each document that has match in the enrich index based on its host_name field,
the document gets an global rank field value, which is fetched from the `globalRank`
field in the enrich index and saved as `global_rank` in the document being enriched.

This is PR is part one of #41521
This commit wires up the Rest calls and Transport calls for listing all
enrich policies, as well  as tests and rest spec additions.
This commit wires up the Rest calls and Transport calls for DELETE enrich
policy, as well as tests and rest spec additions.
Adds the foundation of the execution logic to execute an enrich policy. Validates
the source index existence as well as mappings, creates a new enrich index for
the policy, reindexes the source index into the new enrich index, and swaps the 
enrich alias for the policy to the new index.
…41839)

its own helper method to determine alias / policy base name.

This way both the enrich processor and policy runner use the same logic
to determine the alias to use.

Relates to #32789
)

Reindex uses scroll searches to read the source data. It is more efficient
to read more data in one search scroll round then several. I think 10000
is a good sweet spot.

Relates to #32789
The enrich key field is being kept track in _meta field by the policy runner.
The ingest processor uses the field name defined in enrich index _meta field and
not in the policy. This will avoid problems if policy is changed without
a new enrich index being created.

This also complete decouples EnrichPolicy from ExactMatchProcessor.

The following scenario results in failure without this change:
1) Create policy
2) Execute policy
3) Create pipeline with enrich processor
4) Use pipeline
5) Update enrich key in policy
6) Use pipeline, which then fails.
martijnvg and others added 8 commits October 14, 2019 19:44
This PR also includes HLRC docs for the enrich stats api.

Relates to #32789
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.

However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.

This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
This PR adds the ability to run the enrich policy execution task in the background, 
returning a task id instead of waiting for the completed operation.
@martijnvg martijnvg added >feature release highlight :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.5.0 labels Oct 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Ingest)

@martijnvg martijnvg merged commit d941e1b into master Oct 15, 2019
martijnvg added a commit that referenced this pull request Oct 15, 2019
which adds a new ingest processor, named enrich processor,
that allows document being ingested to be enriched with data from other indices.

Besides a new enrich processor, this PR adds several APIs to manage an enrich policy.
An enrich policy is in charge of making the data from other indices available to the enrich processor in an efficient manner.

Closes #32789
@colings86 colings86 deleted the enrich branch May 27, 2020 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >feature release highlight v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ingest] Enrich documents prior to indexing
7 participants