Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Optimize bucket level monitor querying aliases to query only those indices that can contain relevant docs #1710

Open
eirsep opened this issue Oct 21, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@eirsep
Copy link
Member

eirsep commented Oct 21, 2024

Is your feature request related to a problem?
Bucket level monitors are periodic jobs that execute an aggregation search query on a set of indices.
If an alias is configured in datasource to be queried, bucket-level monitors currently execute aggregation queries against all indices within an alias, even if those indices fall outside the query's time range. This can lead to significant performance degradation, especially when dealing with large numbers of indices or indices residing in colder storage tiers.
Time range of bucket level monitor
If the aggregation search query has a time range filter, it supports a field period_end that is a search parameter which user can use verbatim and will be replaced with time of monitor execution.
In below

Example search query

{
       "size": 0,
       "query": {
         "bool": {
           "filter": [{
             "range": {
               "timestamp": {
                 "from": "{{period_end}}||-1h",
                 "to": "{{period_end}}",
                 "include_lower": true,
                 "include_upper": true,
                 "format": "epoch_millis",
                 "boost": 1
               }
             }
           }],
           "adjust_pure_negative": true,
           "boost": 1
         }
       },
       "aggregations": .....
     }

user is querying last 1 hr of data every time the monitor executes by signifying start time interval as "from": "{{period_end}}||-1h" and end time of interval "to": "{{period_end}}"

This enhancement works for aliases that do rollover and ingesting time series data.
This enhancement proposes optimizing monitor execution by resolving aliases to only those indices that potentially contain data within the query's time range. This optimization will be applied when the aggregation query includes a time range filter using the period_end search parameter.

By limiting the number of indices queried, we can significantly reduce query execution time and improve overall monitor performance.

Benefits:

  • Improved monitor execution speed.
  • Reduced load on the cluster, especially for cold-tier indices.
  • More efficient resource utilization.

What solution would you like?
Check if bucket level monitor datasource is an alias
If alias, check if it has a time frame mentioned in query
If timeframe interval present, fetch only 2 types of indices of that alias -
Indices that are created after the start of the timeframe interval
the one index chronologically just before the list of fetched indices in 1
(for example: if timeframe is 1 hr and current time is 5 pm that makes start of interval is 4 pm and end of interval is 5 pm. we need indices created after 4 pm and the one index which was prolly created at 3.30 pm as it will have 4 pm data)
That way we filter out warm indices and other indices which don’t have data from that interval
What alternatives have you considered?
A clear and concise description of any alternative solutions or features you've considered.

@shwetathareja
Copy link
Member

@eirsep Thanks for the proposal. If I understand correctly you are looking can_match sort of behavior to skip shards which dont fall into primary sort range.

@eirsep
Copy link
Member Author

eirsep commented Oct 21, 2024

@shwetathareja that's right
but can match is still a Pre-filter phase of a search query which if querying Ultrawarm indices would require the ultrawarm nodes to download the indices onto cluster before executing can_match

In this case i am simply calculating that from resolving indices knowing that timeseries data would only have monotonically increasing timestamps and simply picking the indices by creation date would suffice.

@shwetathareja
Copy link
Member

@eirsep - Index creation times might be misleading if customer is running backfill or there was some issue at user client or OpenSearch service side and ingestion was delayed. It is better to rely on the timestamp of the actual data ingested.

But thats good point if lowest timestamp across shards can be populated in the index property once it is marked read-only.

@eirsep
Copy link
Member Author

eirsep commented Oct 22, 2024

@eirsep - Index creation times might be misleading if customer is running backfill or there was some issue at user client or OpenSearch service side and ingestion was delayed. It is better to rely on the timestamp of the actual data ingested.

But thats good point if lowest timestamp across shards can be populated in the index property once it is marked read-only.

@shwetathareja
Let's consider the case of backfill :
If customer is ingesting data from 1st October into index named mylogs-2024-10-22-00001 (i.e. an index created on 22 Oct)
his timestamp for backfilled logs would be having 1st October dates and the bucket level monitors would anyway not be picking up those docs for aggregation if the time frame is, say 10 hours.

So essentially only index selection from alias is done based on index creation time but the actual query will still execute by verifying the documents' timestamp field

Does that deal with the scenario that you pointed out?

@eirsep eirsep removed the untriaged label Oct 24, 2024
@eirsep eirsep self-assigned this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants