Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible regression of date histogram starting from 2.12 #13171

Closed
bowenlan-amzn opened this issue Apr 11, 2024 · 1 comment · Fixed by #13317
Closed

Possible regression of date histogram starting from 2.12 #13171

bowenlan-amzn opened this issue Apr 11, 2024 · 1 comment · Fixed by #13317
Assignees
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs Search:Aggregations

Comments

@bowenlan-amzn
Copy link
Member

bowenlan-amzn commented Apr 11, 2024

GitHub issue: #13087
pmc workload: https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/pmc

Regression query
note: by default, the request cache is disabled on the target index

{
      "name": "articles_monthly_agg_uncached",
      "operation-type": "search",
      "body": {
        "size": 0,
        "aggs": {
          "articles_over_time": {
            "date_histogram": {
              "field": "timestamp",
              "calendar_interval": "month"
            }
          }
        }
      }
    }
    

Reproduce the regression

1 r6g.xlarge node, 5p0r
Using the opensearch benchmark tool

rerun 3 times, get relatively consistent results. One example result:

|                                                 Min Throughput | articles_monthly_agg_uncached |       14.77 |  ops/s |
|                                                Mean Throughput | articles_monthly_agg_uncached |       14.79 |  ops/s |
|                                              Median Throughput | articles_monthly_agg_uncached |       14.79 |  ops/s |
|                                                 Max Throughput | articles_monthly_agg_uncached |        14.8 |  ops/s |
|                                        50th percentile latency | articles_monthly_agg_uncached |     10604.5 |     ms |
|                                        90th percentile latency | articles_monthly_agg_uncached |     12003.5 |     ms |
|                                        99th percentile latency | articles_monthly_agg_uncached |       12364 |     ms |
|                                       100th percentile latency | articles_monthly_agg_uncached |     12400.4 |     ms |
|                                   50th percentile service time | articles_monthly_agg_uncached |     67.1243 |     ms |
|                                   90th percentile service time | articles_monthly_agg_uncached |     71.8665 |     ms |
|                                   99th percentile service time | articles_monthly_agg_uncached |     76.3119 |     ms |
|                                  100th percentile service time | articles_monthly_agg_uncached |     78.9056 |     ms |
|                                                     error rate | articles_monthly_agg_uncached |           0 |      % |

For 2.11, the result is

|                                                 Min Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                                Mean Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                              Median Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                                 Max Throughput | articles_monthly_agg_uncached |       20.01 |   ops/s |
|                                        50th percentile latency | articles_monthly_agg_uncached |     18.1556 |      ms |
|                                        90th percentile latency | articles_monthly_agg_uncached |     18.5772 |      ms |
|                                        99th percentile latency | articles_monthly_agg_uncached |     18.6992 |      ms |
|                                       100th percentile latency | articles_monthly_agg_uncached |     18.8191 |      ms |
|                                   50th percentile service time | articles_monthly_agg_uncached |     17.4722 |      ms |
|                                   90th percentile service time | articles_monthly_agg_uncached |     17.5867 |      ms |
|                                   99th percentile service time | articles_monthly_agg_uncached |     17.7151 |      ms |
|                                  100th percentile service time | articles_monthly_agg_uncached |     17.9092 |      ms |

Note: service time is the query took time.

Impact

The code for the optimization was released in 2.12.
The pre-conditions for this optimization include:

  • the date histogram doesn’t have parent or sub aggregation
  • the query part is in effect a match all or a range query on same time field
  • total bucket number < 1024

Profiling results

Using async profiler, 60s when sending the query in a while true loop

2.11

https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/211flamegraph.html (GH issue doesn't support html upload, you may need to download and see)

2.12

https://github.com/bowenlan-amzn/file-share/blob/75bc67384e4c9fe83792ed1f41b7d49e95768281/212flamegraph.html

Qualitative Analysis

By default, aggregation iterate over the documents and poll the value of the aggregated field from DocValues and bucketize it. From the profiling result of 2.11, the time consuming parts are reading doc value, doing rounding of the value and adding the value to the results.8

In our optimization, we don’t iterate doc values. Instead we create the bucket ranges first. Then use PointRangeQuery to get the doc count for each bucket. From the profiling result of 2.12, the time consuming part is the ponit count of PointRangeQuery. And inside it, the visitDocValues method takes most of the time. This visitDocValues is to iterate the points of leaf node in BKD and this is the most heavy part when using BKD.

With respect to doing aggregation on one segment, first define some expressions for later deduction

N_R: number of ranges or buckets
N_C: number of leaf nodes that need visitDocValues.
N_Leaf: total number of leaf nodes that have matching documents
N_Doc: total number of documents that matches

The approximate time cost of default method
T_default = C_default x N_Doc
C_default is the cost of handling a doc in default method or DocValues iteration, the heavy part is related to how doc value is read from the file and get added into the result.

The approximate time cost of optimized method
T_optim = C_optim x ((N_C/N_Leaf) x N_Doc) C_optim is the cost of handling a doc in optimized method or BKD traversal, the heavy part is related to how point value is read read from leaf nodes and a good way to count them into the result.

The problem is to decide when to use the optimized method

First important element is N_C. N_C is related to N_R, generally more ranges of smaller interval lead to more leaf node visitDocValues. Considering the ranges rewritten from date histogram are connected with each other, we are probably traverse many leaf nodes twice, while ideally we should only do it once. I am looking into a way to not doing range query one by one but just one bkd traversal for all ranges. We cannot avoid traverse some leaf nodes when using the optimized method, so how to calculate the N_C/N_Leaf is the key. In 1-D scenario, assume we are doing multi-range traversal, N_C is bounded by N_R+1.

Now the question become how to estimate N_Leaf, if it's a segment level match all scenario, N_Leaf is about (segment's maxDoc/512). If the top level query has a range query, N_leaf calculation is not straightforward. There's an existing light-weighted method of PointValues, estimatePointCount, that can tell how many points falls in the range. Similarly, we can also write a custom intersector to tell how many leaf nodes fall into the range query of top level query. Since this won't involve any leaf node visitDocValues, only tree traversal, I don't see a performance issue here.

On the other hand, I distinguish C_default and C_optim because they may have noticeable difference. Currently in IndexOrDocValuesQuery, C_default = 8 x C_optim, which means using DV is considered 8 times costly than using BKD in terms of verifing the matched documents of the lead iterator in a conjunction. The number is probably different in our case since the logic here are very different. We will carry out some experiments to get their ratio.

First Experiment

The pmc dataset has 123 months (2006/2/1 ~ 2016/4/1) of data.
Based on the approximate time cost formulas, we can see for the optimized method, the monthly aggregation leads to a higher N_R. Considering the time field is 1 dimensional, an approximate of N_C is 2xN_R. So if we decrease the N_R by increasing the time interval of the agg query, we should observe the performance of optimized method becomes better.

I increase the time interval of the aggregation query from month to year and 1000d, and manually run the query multiple times against the cluster. This is the query took time in ms.

month 180d year 1000d
default 16 11 12 10
optimized 63 16 8 4

Fix Plan

  1. Short Term

Update the current threshold 1024 to 24, which is the a relatively safer number in terms of pmc workload.
Add a cluster setting that can be used to easily turn off the optimization. #13179

  1. Long Term

Develop a clever BKD traversal that only traverse once for any number of bucket ranges and able to count the points in a light weight. #13317

@bowenlan-amzn bowenlan-amzn self-assigned this Apr 11, 2024
@bowenlan-amzn bowenlan-amzn added bug Something isn't working Search:Aggregations Performance This is for any performance related enhancements or bugs and removed untriaged labels Apr 11, 2024
bowenlan-amzn added a commit to bowenlan-amzn/OpenSearch that referenced this issue Apr 13, 2024
Short term solution for pmc workload regression opensearch-project#13171

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>
@rishabh6788
Copy link
Contributor

As mentioned we are seeing huge spike in date_histogram_hourly_agg query in big5 workload.
Screenshot 2024-04-18 at 6 42 49 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Performance This is for any performance related enhancements or bugs Search:Aggregations
Projects
Status: Done
Status: Done
2 participants