Sort leaves on search according to the primary numeric sort field #44021

jimczi · 2019-07-05T16:37:56Z

This change pre-sort the index reader leaves (segment) prior to search
when the primary sort is a numeric field eligible to the distance feature
optimization. It also adds a tie breaker on _doc to the rewritten sort
in order to bypass the fact that leaves will be collected in a random order.
I ran this patch on the http_logs benchmark and the results are very promising:

|                                                        Metric |                Task |   Baseline |   Contender |     Diff |   Unit |
|--------------------------------------------------------------:|--------------------:|-----------:|------------:|---------:|-------:|
|                                       50th percentile latency | desc_sort_timestamp |    220.706 |      136544 |   136324 |     ms |
|                                       90th percentile latency | desc_sort_timestamp |    244.847 |      162084 |   161839 |     ms |
|                                       99th percentile latency | desc_sort_timestamp |    316.627 |      172005 |   171688 |     ms |
|                                      100th percentile latency | desc_sort_timestamp |    335.306 |      173325 |   172989 |     ms |
|                                  50th percentile service time | desc_sort_timestamp |    218.369 |     1968.11 |  1749.74 |     ms |
|                                  90th percentile service time | desc_sort_timestamp |    244.182 |      2447.2 |  2203.02 |     ms |
|                                  99th percentile service time | desc_sort_timestamp |    313.176 |     2950.85 |  2637.67 |     ms |
|                                 100th percentile service time | desc_sort_timestamp |    332.924 |     2959.38 |  2626.45 |     ms |
|                                                    error rate | desc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |  asc_sort_timestamp |   0.801824 |    0.800855 | -0.00097 |  ops/s |
|                                             Median Throughput |  asc_sort_timestamp |   0.802595 |    0.801104 | -0.00149 |  ops/s |
|                                                Max Throughput |  asc_sort_timestamp |   0.803282 |    0.801351 | -0.00193 |  ops/s |
|                                       50th percentile latency |  asc_sort_timestamp |    220.761 |     824.098 |  603.336 |     ms |
|                                       90th percentile latency |  asc_sort_timestamp |    251.741 |     853.984 |  602.243 |     ms |
|                                       99th percentile latency |  asc_sort_timestamp |    368.761 |     893.943 |  525.182 |     ms |
|                                      100th percentile latency |  asc_sort_timestamp |    431.042 |      908.85 |  477.808 |     ms |
|                                  50th percentile service time |  asc_sort_timestamp |    218.547 |     820.757 |  602.211 |     ms |
|                                  90th percentile service time |  asc_sort_timestamp |    249.578 |     849.886 |  600.308 |     ms |
|                                  99th percentile service time |  asc_sort_timestamp |    366.317 |     888.894 |  522.577 |     ms |
|                                 100th percentile service time |  asc_sort_timestamp |    430.952 |     908.401 |   477.45 |     ms |
|                                                    error rate |  asc_sort_timestamp |          0 |           0 |        0 |      % |

So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note
that I indexed the http_logs with a single client in order to simulate real time-based indices
where document are indexed in their timestamp order.

Relates #37043

This change pre-sort the index reader leaves (segment) prior to search when the primary sort is a numeric field eligible to the distance feature optimization. It also adds a tie breaker on `_doc` to the rewritten sort in order to bypass the fact that leaves will be collected in a random order. I ran this patch on the http_logs benchmark and the results are very promising: ``` | 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms | | 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms | | 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms | | 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms | | 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms | | 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms | | 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms | | 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms | | error rate | desc_sort_timestamp | 0 | 0 | 0 | % | | Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s | | Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s | | Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s | | 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms | | 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms | | 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms | | 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms | | 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms | | 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms | | 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms | | 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms | | error rate | asc_sort_timestamp | 0 | 0 | 0 | % | ``` So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note that I indexed the http_logs with a single client in order to simulate real time-based indices where document are indexed in their timestamp order. Relates elastic#37043

elasticmachine · 2019-07-05T16:37:58Z

Pinging @elastic/es-search

mayya-sharipova · 2019-07-09T19:16:32Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+                    // Add a tiebreak on _doc in order to be able to search
+                    // the leaves in any order. This is needed since we reorder
+                    // the leaves based on the minimum value in each segment.
+                    newSortFields[newSortFields.length-1] = SortField.FIELD_SCORE;


should we have here FIELD_DOC?

mayya-sharipova · 2019-07-09T19:22:07Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+            for (LeafReaderContext ctx : leaves) {
+                PointValues values = ctx.reader().getPointValues(sortField.getField());
+                if (values == null) {
+                    minValues[ctx.ord] = (long) sortField.getMissingValue();


can we also use missingValue defined on the line 448?

mayya-sharipova · 2019-07-09T20:05:30Z

server/src/main/java/org/elasticsearch/search/query/QueryPhase.java

+                }
+            }
+            Comparator<LeafReaderContext> comparator = Comparator.comparingLong(l -> minValues[l.ord]);
+            if (sortField.getReverse()) {


I would think in the case of the reverse sort, we should look at the maxValues not minValues?

mayya-sharipova

@jimczi Thanks for the PR. The speedups indeed look very promising.
I have two main questions:

Do we have to sort leaves every time during optimization? I assume for data such as logs, it indeed very reasonable to do. But what about data such as population of geonames, where a segment may have any random data from an index?
In QueryPhase::createLeafSorter should we always look only on the min value of a segment. Does it make sense to look at the max values per segment when the sort is descending?

mayya-sharipova · 2019-08-18T13:03:44Z

@jimczi Jim, I wanted to confirm with you that you are ok with my modifications of your PR in the last commit.

* Optimize sort on numeric long and date fields (#39770) Optimize sort on numeric long and date fields, when the system property `es.search.long_sort_optimized` is true. * Skip optimization if the index has duplicate data (#43121) Skip sort optimization if the index has 50% or more data with the same value. When index has a lot of docs with the same value, sort optimization doesn't make sense, as DistanceFeatureQuery will produce same scores for these docs, and Lucene will use the second sort to tie-break. This could be slower than usual sorting. * Sort leaves on search according to the primary numeric sort field (#44021) This change pre-sort the index reader leaves (segment) prior to search when the primary sort is a numeric field eligible to the distance feature optimization. It also adds a tie breaker on `_doc` to the rewritten sort in order to bypass the fact that leaves will be collected in a random order. I ran this patch on the http_logs benchmark and the results are very promising: ``` | 50th percentile latency | desc_sort_timestamp | 220.706 | 136544 | 136324 | ms | | 90th percentile latency | desc_sort_timestamp | 244.847 | 162084 | 161839 | ms | | 99th percentile latency | desc_sort_timestamp | 316.627 | 172005 | 171688 | ms | | 100th percentile latency | desc_sort_timestamp | 335.306 | 173325 | 172989 | ms | | 50th percentile service time | desc_sort_timestamp | 218.369 | 1968.11 | 1749.74 | ms | | 90th percentile service time | desc_sort_timestamp | 244.182 | 2447.2 | 2203.02 | ms | | 99th percentile service time | desc_sort_timestamp | 313.176 | 2950.85 | 2637.67 | ms | | 100th percentile service time | desc_sort_timestamp | 332.924 | 2959.38 | 2626.45 | ms | | error rate | desc_sort_timestamp | 0 | 0 | 0 | % | | Min Throughput | asc_sort_timestamp | 0.801824 | 0.800855 | -0.00097 | ops/s | | Median Throughput | asc_sort_timestamp | 0.802595 | 0.801104 | -0.00149 | ops/s | | Max Throughput | asc_sort_timestamp | 0.803282 | 0.801351 | -0.00193 | ops/s | | 50th percentile latency | asc_sort_timestamp | 220.761 | 824.098 | 603.336 | ms | | 90th percentile latency | asc_sort_timestamp | 251.741 | 853.984 | 602.243 | ms | | 99th percentile latency | asc_sort_timestamp | 368.761 | 893.943 | 525.182 | ms | | 100th percentile latency | asc_sort_timestamp | 431.042 | 908.85 | 477.808 | ms | | 50th percentile service time | asc_sort_timestamp | 218.547 | 820.757 | 602.211 | ms | | 90th percentile service time | asc_sort_timestamp | 249.578 | 849.886 | 600.308 | ms | | 99th percentile service time | asc_sort_timestamp | 366.317 | 888.894 | 522.577 | ms | | 100th percentile service time | asc_sort_timestamp | 430.952 | 908.401 | 477.45 | ms | | error rate | asc_sort_timestamp | 0 | 0 | 0 | % | ``` So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note that I indexed the http_logs with a single client in order to simulate real time-based indices where document are indexed in their timestamp order. Relates #37043 * Remove nested collector in docs response As we don't use cancellableCollector anymore, it should be removed from the expected docs response. * Use collector manager for search when necessary (#45829) When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. Thus for such a case, we use collectorManager, where for every segment a dedicated collector will be created. * Use shared TopFieldCollector manager Use shared TopFieldCollector manager for sort optimization. This collector manager is able to exchange minimum competitive score between collectors * Correct calculation of avg value to avoid overflow * Optimize calculating if index has duplicate data

This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Closes #37043 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

This rewrites long sort as a `DistanceFeatureQuery`, which can efficiently skip non-competitive blocks and segments of documents. Depending on the dataset, the speedups can be 2 - 10 times. The optimization can be disabled with setting the system property `es.search.rewrite_sort` to `false`. Optimization is skipped when an index has 50% or more data with the same value. Optimization is done through: 1. Rewriting sort as `DistanceFeatureQuery` which can efficiently skip non-competitive blocks and segments of documents. 2. Sorting segments according to the primary numeric sort field(#44021) This allows to skip non-competitive segments. 3. Using collector manager. When we optimize sort, we sort segments by their min/max value. As a collector expects to have segments in order, we can not use a single collector for sorted segments. We use collectorManager, where for every segment a dedicated collector will be created. 4. Using Lucene's shared TopFieldCollector manager This collector manager is able to exchange minimum competitive score between collectors, which allows us to efficiently skip the whole segments that don't contain competitive scores. 5. When index is force merged to a single segment, #48533 interleaving old and new segments allows for this optimization as well, as blocks with non-competitive docs can be skipped. Backport for #48804 Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

jimczi added >enhancement :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 5, 2019

jimczi requested a review from mayya-sharipova July 5, 2019 16:37

jimczi added 2 commits July 5, 2019 18:43

checkstyle

e09162e

add comments

3071429

mayya-sharipova reviewed Jul 9, 2019

View reviewed changes

mayya-sharipova approved these changes Jul 9, 2019

View reviewed changes

Address feedback

ee5abf4

jimczi merged commit 33c8275 into elastic:long_sort_optimization Aug 19, 2019

jimczi deleted the long_sort_optimization_jim branch August 19, 2019 07:57

mayya-sharipova mentioned this pull request Nov 29, 2019

Optimize sort on numeric long and date fields. #49732

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort leaves on search according to the primary numeric sort field #44021

Sort leaves on search according to the primary numeric sort field #44021

jimczi commented Jul 5, 2019 •

edited

Loading

elasticmachine commented Jul 5, 2019

mayya-sharipova Jul 9, 2019

mayya-sharipova Jul 9, 2019

mayya-sharipova Jul 9, 2019

mayya-sharipova left a comment •

edited

Loading

mayya-sharipova commented Aug 18, 2019

Sort leaves on search according to the primary numeric sort field #44021

Sort leaves on search according to the primary numeric sort field #44021

Conversation

jimczi commented Jul 5, 2019 • edited Loading

elasticmachine commented Jul 5, 2019

mayya-sharipova Jul 9, 2019

Choose a reason for hiding this comment

mayya-sharipova Jul 9, 2019

Choose a reason for hiding this comment

mayya-sharipova Jul 9, 2019

Choose a reason for hiding this comment

mayya-sharipova left a comment • edited Loading

Choose a reason for hiding this comment

mayya-sharipova commented Aug 18, 2019

jimczi commented Jul 5, 2019 •

edited

Loading

mayya-sharipova left a comment •

edited

Loading