Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic tuning of mappings of data streams #87469

Open
jpountz opened this issue Jun 7, 2022 · 3 comments
Open

Automatic tuning of mappings of data streams #87469

jpountz opened this issue Jun 7, 2022 · 3 comments
Labels
:Data Management/Data streams Data streams and their lifecycles >feature Team:Data Management Meta label for data/management team

Comments

@jpountz
Copy link
Contributor

jpountz commented Jun 7, 2022

I've been in a few discussions about improving index templates to change the index/search trade-off via runtime fields and doc-value only fields, and these discussions are hard to move forward because it depends on how end users leverage their data, which isn't known ahead of time. The same fields might be used very differently depending on whether and how end users leverage SIEM, alerting, e.g. are there custom rules?

Since we can't know ahead of time how end users will leverage the data, and since this information can change over time, I'm considering making data streams able to tune their index templates based on usage. The high-level idea I have in mind is that data streams could look at usage statistics upon rollover and update the index template if there is a mismatch between search-time usage of the fields and how they are mapped. This way, the next index should hopefully get mappings that better fit the sort of searches that run.

Some interesting things we could do that way:

  • Switch between fully-indexed, doc-value-only fields and runtime fields based on usage.
  • Switch between text and match_only_text fields depending on how frequently positional queries run.
  • Switch between keyword and wildcard depending on the cardinality of the field and whether users run infix queries.
  • Enable eager_global_ordinals on fields that are frequently used for terms/composite aggregations.

More thoughts/notes:

  • This wouldn't work for all use-cases. E.g. some users might prefer to have slow searches than slowing down ingestion. So it might need a parameter on index templates to opt in to this behavior?
  • Would we need a way to disable automatic tuning at the field level? E.g. please keep host.name indexed even if it hasn't been used for filtering recently.
  • The field usage statistics API gives us information we can use to downgrade mappings, e.g. disabling indexing. How can we figure out if/when we should enalble upgrade a field, e.g. enabling indexing, since we don't have data for the data structures that we don't have?
@jpountz jpountz added >feature discuss :Data Management/Data streams Data streams and their lifecycles labels Jun 7, 2022
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@ruflin
Copy link
Member

ruflin commented Jun 9, 2022

I really like the basic idea behind this: "If you use it often, we will make it faster for you". At first, I'm thinking of this as a flag that can be enabled so we can start playing with it.

We recently had a related discussion about having a switch between using mapping runtime fields to index runtime fields to go from slow queries to fast queries and the other way around. Having the above would skip this option and only optimise the parts that have to be optimised.

This way, the next index should hopefully get mappings that better fit the sort of searches that run.

This is a topic that keeps turning in my head. Should there also be an option to make old data faster? If users now start to query a lot on the 1 month old data for the keyword foo, should there be a way to create doc values for this field for old data? Or as we talk about time series, do all optimisations always only happen on new data?

@jpountz
Copy link
Contributor Author

jpountz commented Jun 9, 2022

Should there also be an option to make old data faster?

It's not completely impossible, but it's hard to do without reindexing and I would suggest making it a separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >feature Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

3 participants