Automatic tuning of mappings of data streams #87469

jpountz · 2022-06-07T16:23:06Z

I've been in a few discussions about improving index templates to change the index/search trade-off via runtime fields and doc-value only fields, and these discussions are hard to move forward because it depends on how end users leverage their data, which isn't known ahead of time. The same fields might be used very differently depending on whether and how end users leverage SIEM, alerting, e.g. are there custom rules?

Since we can't know ahead of time how end users will leverage the data, and since this information can change over time, I'm considering making data streams able to tune their index templates based on usage. The high-level idea I have in mind is that data streams could look at usage statistics upon rollover and update the index template if there is a mismatch between search-time usage of the fields and how they are mapped. This way, the next index should hopefully get mappings that better fit the sort of searches that run.

Some interesting things we could do that way:

Switch between fully-indexed, doc-value-only fields and runtime fields based on usage.
Switch between text and match_only_text fields depending on how frequently positional queries run.
Switch between keyword and wildcard depending on the cardinality of the field and whether users run infix queries.
Enable eager_global_ordinals on fields that are frequently used for terms/composite aggregations.

More thoughts/notes:

This wouldn't work for all use-cases. E.g. some users might prefer to have slow searches than slowing down ingestion. So it might need a parameter on index templates to opt in to this behavior?
Would we need a way to disable automatic tuning at the field level? E.g. please keep host.name indexed even if it hasn't been used for filtering recently.
The field usage statistics API gives us information we can use to downgrade mappings, e.g. disabling indexing. How can we figure out if/when we should enalble upgrade a field, e.g. enabling indexing, since we don't have data for the data structures that we don't have?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-06-07T16:23:09Z

Pinging @elastic/es-data-management (Team:Data Management)

ruflin · 2022-06-09T06:58:03Z

I really like the basic idea behind this: "If you use it often, we will make it faster for you". At first, I'm thinking of this as a flag that can be enabled so we can start playing with it.

We recently had a related discussion about having a switch between using mapping runtime fields to index runtime fields to go from slow queries to fast queries and the other way around. Having the above would skip this option and only optimise the parts that have to be optimised.

This way, the next index should hopefully get mappings that better fit the sort of searches that run.

This is a topic that keeps turning in my head. Should there also be an option to make old data faster? If users now start to query a lot on the 1 month old data for the keyword foo, should there be a way to create doc values for this field for old data? Or as we talk about time series, do all optimisations always only happen on new data?

jpountz · 2022-06-09T14:17:59Z

Should there also be an option to make old data faster?

It's not completely impossible, but it's hard to do without reindexing and I would suggest making it a separate discussion.

jpountz added >feature discuss :Data Management/Data streams Data streams and their lifecycles labels Jun 7, 2022

elasticmachine added the Team:Data Management Meta label for data/management team label Jun 7, 2022

jpountz removed the discuss label Jul 18, 2022

jpountz mentioned this issue Mar 16, 2023

[Fleet] Set dynamic field mapping to false or runtime elastic/kibana#128072

Open

ruflin mentioned this issue Apr 12, 2024

Optimize mappings/storage/query of datasets elastic/elastic-package#1764

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic tuning of mappings of data streams #87469

Automatic tuning of mappings of data streams #87469

jpountz commented Jun 7, 2022

elasticmachine commented Jun 7, 2022

ruflin commented Jun 9, 2022

jpountz commented Jun 9, 2022

Automatic tuning of mappings of data streams #87469

Automatic tuning of mappings of data streams #87469

Comments

jpountz commented Jun 7, 2022

elasticmachine commented Jun 7, 2022

ruflin commented Jun 9, 2022

jpountz commented Jun 9, 2022