Optimize sort #6766

skadilover · 2023-09-27T12:27:48Z

Description

We are replacing presto in the system with velox, but we have done a lot of self-research and optimization on performance based on presto. We found that there is a performance gap between directly using velox to replace presto and using our optimized presto. In the specific Sort scenario, we concluded after analysis that it may be caused by the misalignment of the processing methods in the following places:：

Null value check
When the 'null value' in velox is stored in RowContainer, it is stored in the form of null flags + T() default value. When comparing, nullFlags will be read first, and then the data will be read.
The improvement we made is that we use the min/max of T according to compareFlags when storing, and directly read the data when comparing, eliminating the overhead of reading the null flag in one step.
Prefix sort
When compare(), the data is read directly from the underlying data structure. Whether it is Presto or Velox, there is some additional memory access overhead. In Presto, a compare requires reading data in different columns, while in Velox, the data Stored in a row-base manner, in addition to the sort key, the RowContainer also contains null flags, normalize data, and payload. In this way, the data required for sorting (sort key) is stored in a larger memory block than actually required. There will be additional cache misses.
The optimization we did was to introduce a prefix column in PageIndex/RowContainer, which is a vector< long > or vector< byte[] >:

For Byte/Integer/Bigint, it is directly stored in the prefix
For Varchar, take the first 8 bytes of the string and store them in prefix
For the case of multiple columns, use a fixed-length byte[] to store a prefix. If Byte/Integer/Bigint is written into the prefix, when the first Varchar is encountered, the bytes of the remaining length of the prefix are written into the prefix. Also ignore remaining sort keys
When compare(), compare the prefix first. If the prefix is equal, then read the data from the underlying data structure. If the prefix has a higher selectivity, the performance of sort will be improved.

Range Sort
In our system, we previously implemented "range sort" based on the column-base data structure in PagesIndex (presto). The general logic is:
Suppose there is a sort statement order by a, b, c. First sort column a. If there are repeated data in column a, use the repeated data to construct several ranges, and then sort column b. When sorting, you only need to sort the items in the range, and then process column c in the same way.
However, this algorithm is based on the column-base data structure. Currently, RowContainer stores data in row-base. Can RowContainer support column-base storage?

To summarize the issues :

Are the above improvements to Sort reasonable?
Currently, velox’s sort implementation is simply std::sort. Does velox have plans to optimize sort performance in the future?
In practice, how should we quickly incorporate optimization suggestions into velox?

Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing

The text was updated successfully, but these errors were encountered:

mbasmanova · 2023-09-27T19:30:21Z

CC: @oerling @pedroerp

mbasmanova · 2023-09-27T19:33:49Z

Hi @skadilover, welcome to the community. Would you introduce yourself? Where are you from?

The optimizations you described above sound interesting. Are you interested in trying them out, iterating and eventually incorporating them into Velox?

skadilover · 2023-09-28T11:36:23Z

@mbasmanova
I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.
Very glad that the community is interested in these ideas. Add something that was not mentioned in the above description:

The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.
Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

mbasmanova · 2023-09-28T11:59:30Z

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

mbasmanova · 2023-09-28T12:00:50Z

Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Fantastic. Looking forward to the PR.

mbasmanova · 2023-09-28T12:03:29Z

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

Check out velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp

This is a benchmark for reduce_agg aggregate function which benchmarks a mini query (table scan + aggregation). I suggest you write similar one to benchmark table scan + order-by. One may comment that this doesn't benchmark order by in isolation, but that's Ok. After all there is no point to speed a portion of the query that's not using significant amount of CPU time to begin with. E2e query speed up is what we are after.

Hope this helps.

mbasmanova · 2023-09-28T12:46:51Z

The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.

This makes a lot of sense. Excited to see this optimization coming to Velox.

CC: @amitkdutta @aditi-pandit

skadilover · 2023-09-29T01:33:43Z

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

My name is Heng Jiang.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

Not available in the prestodb repo, these are internal enhancements.

skadilover · 2023-09-29T01:35:35Z

Fantastic. Looking forward to the PR.

Next week is our National Day holiday (10 days). After the holiday, I will continue to organize the code and submit the PR as soon as possible

mbasmanova · 2023-09-29T10:01:41Z

@skadilover Nice to meet you, Heng. Enjoy the holiday. Looking forward to hearing again from you in a couple of weeks.

zhouyuan · 2023-10-11T08:32:14Z

@mbasmanova @skadilover
here's one related patch on s/std::sort/timsort.
#6745
TimSort behaviors very well on real-world data, here's a good intro(https://hackernoon.com/timsort-the-fastest-sorting-algorithm-youve-never-heard-of-36b28417f399)
We got up to ~2x performance on customer workloads.

thanks, -yuan

pedroerp · 2023-10-23T16:19:21Z

Hi guys, sorry about the delay to follow up on this thread. I've recently helped review this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

A few specific questions:

Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.
"The improvement we made is that we use the min/max of T according to compareFlags when storing, "
How does this work if you actually have the min/max values as part of the data?
"Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

skadilover · 2023-10-25T06:17:28Z

Hi guys, sorry about the delay to follow up on this thread. I've recently helped reviewed this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

@pedroerp Glad to hear your response for this issue!~ The paper mentioned above is very useful.

PrefixSort is not just a normalizekey technology: it is different from velox’s existing storage structure (RowContainer
) It hopes to encode the sort key separately like duckdb. At the same time, in order to avoid the performance degradation of the swap method in scenarios where the sort keys are particularly long, we only encode the first few columns in the form of prefix.

In order to facilitate communication, I submitted a PR #7230 (it is not 100% ready yet, some code will be adjusted based on my E2E test results), and the design document is also being prepared (it will take a few days).If have time, could you help me take a look at this PR first？ @mbasmanova @pedroerp

skadilover · 2023-10-25T06:26:56Z

Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.

@pedroerp Yes, for some specific columns we may refer to the offset-value encoding format, but this work may be done later. I hope to optimize the single-column bigint/varchar scenario first, especially in the bigint scenario. radix-sort

"The improvement we made is that we use the min/max of T according to compareFlags when storing, "
How does this work if you actually have the min/max values as part of the data?

Directly store min/max instead of T() (what velox does now)

"Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

Like question 1, I think this is a encoding issue. I will discuss it scene by scene in the subsequent PRs.

skadilover · 2023-10-25T06:44:28Z

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

@pedroerp
At present, I can only verify the optimization results through E2E testing (our framework is similar to presto-cpp). The sort optimization results are affected by the size of the data set. I am currently using a 1T tpch table. Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

pedroerp · 2023-10-27T01:13:01Z

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

skadilover · 2023-10-27T01:41:37Z

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

Thanks a lot, This example looks like it could work. I will try it soon

mbasmanova · 2023-10-27T13:36:49Z

@skadilover Wondering if you had a chance to review Arrow Row format: https://docs.rs/arrow-row/latest/arrow_row/index.html

"Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post."

mbasmanova · 2023-10-27T14:27:48Z

Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf

skadilover · 2023-10-30T05:19:38Z

Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf

@mbasmanova Thanks for the references above.

The encoding tech (include "normalized key", "memory compare" etc. ) can be very efficiently , and we have implemented them in our inernal presto before. In PrefixSort optimization, I hope to implement them step by step. This week I plan to prepare a design document (including some research E2E tests and benchmarks) to describe my plan in detail.

mbasmanova · 2023-10-30T13:08:06Z

@skadilover Looking forward to hearing more from you on this topic. Thanks.

skadilover · 2023-11-08T09:56:47Z

Overview

PrefixSort's optimization plan includes two optimization points: normalized keys optimization and cpu cache optimization.

Normalized Key For Multi Sort Keys

The core idea of normalized key is: assuming that sort keys can be encoded into a binary string and the sorting semantics remain unchanged, then all sort keys can be regarded as a whole for sorting (see [1]). At this time, memcmp can also use simd to speed up. In the submitted PR [https://github.com//pull/7230] implementation, in the scenario of two columns of BIGINT, we can get a performance improvement of 200%-300%.

The benchmark test results are as follows:

PrefixSort currently supports BIGINT, and the current research results of other fixed-length types are that they can be implemented. The VARCHAR scenario will be a little complicated. We also studied the implementation of duckDB, and the following somewhat special method was used in its implementation. Range sort is used to process varchar. Further research and testing of the encoding format will be conducted to improve this part of the design:

Memory Data Structure

In order to implement the above-mentioned Normalized Key technology in velox, we consider redesigning a data structure to store the normalized key for the following two reasons:

The Sort algorithm requires an iterator with "Random Access Iterator" semantics see[3][4], and the RowContainer data currently uses non-contiguous allocation to store row data, which is implemented on the basis of non-contiguous memory. Random Access Iterator has performance problems and requires complex pointer mapping.
Currently, the normalized field in RowContainer does not serve for sort and does not contain compareFlags semantic information. In addition, the processing of null values will also affect the implementation of the compare method.

First, we review the current OrderBy operator implementation of velox, as shown in the figure below:

The general process of implementation is:

Get the std::vector<char*> returnRows of the row pointer through the listRow() method of RowContainer. The reason for this step is that the sort algorithm requires a contiguous memory block to facilitate the implementation of the iterator.
Give the begin/end iterator of returnRows to std::stort for sorting. The logic of the custom compare method provided for std::vector is: first get the row ptr from returnRows, and then use ptr to access the data in the RowContainer.
From the perspective of CPU cache, the problem with this implementation is:
Research on the sort algorithm basically has an assumption: the memory access operation is constant and low-time consuming. This assumption is wrong in engineering practice. In engineering practice, we need to consider the issue of memory access efficiency. Here is a classic example see [2]:

Going back to the implementation of the OrderBy operator, we can find that the compare method in the sort loop needs to access two memory blocks (address vector and row container) at the same time. As the memory grows, memory access problems will significantly affect the efficiency of sort execution. . The solution to this problem is to store the sort key and row Num separately. Take duckDB as an example:

In this solution, the performance of the compare method is improved, while the performance of the swap method will decrease (swapping all sort keys will be slower than swapping only row addresses). When the number of sort columns increases to a certain extent, the overhead of swap will exceed the benefits of the compare method. , so there needs to be a rollback mechanism here. The design of PrefixSort is as follows:

Use row ptr to replace row num in the implementation of PrefixSort. The purpose of this is:

There can be a fallback mechanism when the size of sort keys is greater than a certain value.
At the same time, using row ptr can also better adapt to the current velox operator implementation and facilitate subsequent output operations.
Data in RowConatiner can be reused without re-storing payload data.

In the E2E test of single column BIGINT, we can get 30%-40% gain. In fact, it will be faster if we use reintercept_cast to get the value, but for the unified implementation of normalize key, memcmp is used here for comparison. In long scenarios, memcmp is slower than reintercept_cast. In the implementation, template memcmp in duckDB is used to replace std::memcmp. There is some improvement in the effect, but it is still slower than the implementation of reintercept_cast.
In addition, directly using std::vecotr to store keys during prototype testing can achieve greater performance improvements. However, considering the maintenance cost of the overall code and the possibility of changing the subsequent sort algorithm to RadixSort, this is not currently possible to consider this optimization:

Conclusion and plan

By optimizing the memory data structure and using Normalize technology, PrefixSort will have performance improvements compared to the original implementation in both single-column and multi-column cases. It is planned to be divided into 4 stages to achieve complete functions.

[done] PR#1 a Implement PrefixSort framework code b Support BIGINT c Implement the BenchMark testing framework
[on going] PR#2 Supports all basic types including (INT/SORT/TIME type/DOUBLE...)
[on going] PR#3 Support VARCHAR type
[suspend] PR#4 Support Radix Sort

How To Reproduce Optimization

Just Run PrefixSortBenchmark.cpp && OrderByTest.cpp

reference：

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf
[2] https://courses.cs.washington.edu/courses/cse373/21sp/lectures/lecture24.pdf
[3] https://en.cppreference.com/w/cpp/iterator/random_access_iterator
[4] https://en.cppreference.com/w/cpp/algorithm/sort

skadilover · 2023-11-08T10:03:43Z

@mbasmanova updates for the topic.
also pr : #7230

mbasmanova · 2023-11-08T16:51:01Z

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

skadilover · 2023-11-08T16:57:24Z

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

yes ，there is a copy

mbasmanova · 2023-11-08T17:50:36Z

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key. Also, can you extend the design to support complex types as well?

skadilover · 2023-11-09T02:24:28Z

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key.

If the type of column that needs to be sorted can be normalized and is small fix size such as bigint, I think there is no need to limit the size of the Prefix. The reason is that as seen in the benchmark results, the execution time in this scenario does not increase significantly as the number of columns increases.
If only one part can be normalized, such as String_view, the remaining parts need to be compared using pointers. From a design perspective, if the remaining columns can be normlized, we can continue memcmp after comparing using pointers.But this goes against our original intention of treating Prefix as a whole and using memcmp.In this scenario I think it would be better to only store the first part normalized column in prefix.When VARCHAR scenarios are supported in the future, choise will be made based on benchmark results.Currently in code design, we can flexibly define behavior through PrefixSortLayout(holding relevant information)
For variable-width types , we have two choise as below , I prefer the plan B right now, the reason is that in real scenarios there may not be so many columns that need to be sorted
In some special scenarios, our customers know the data distribution. For example, the first column that needs to be sorted has high selectivity, and there is no need to store the remaining sorted columns in Prefix (especially if there are many columns left). , we hope that our customers can control the size of Prefix through SQL Hint, leaving space for optimization for specific scenarios, and this can also be applied to some adaptive technologies
swap(unlike use row ptr, we swap the whole prefix), copy and continuous allocation may involve overhead，this also need to put a hard limit on the "prefix" of the normalized key.

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

Also, can you extend the design to support complex types as well?

If necessary, we can store some potential sorting prefixes in Prefix and use row ptr to compare(simply call RowContainer`s compare method). Like Varchar, if complex type cannot obtain benefits through memcmp, I feel that it is possible to use only pointers for comparison will be better .

@mbasmanova

skadilover · 2023-11-09T04:32:52Z

@mbasmanova the pr closed above is not used, stil use #7230 to merge code.

skadilover · 2023-11-14T01:46:22Z

@mbasmanova Is there more questions about the topic ? Could you help me to review the code?

mbasmanova · 2023-11-14T12:25:23Z

@skadilover Thank you for working on this. Took a quick look and left some initial comments on the PR.

mbasmanova · 2023-11-14T12:59:01Z

@skadilover

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

I think we need to have a sensible limit by default. I'm also interested in figuring out support for varchar and row() keys as these are common.

skadilover · 2023-11-14T13:25:29Z

I think we need to have a sensible limit by default.

I will check it later.

I'm also interested in figuring out support for varchar and row() keys as these are common.

Our previous method of dealing with varchar was to 1. analyze the maximum length of the data before sorting and then process it according to a fixed size, 2. only retain the prefix for those that exceed the length limit. I hope to find a better way. Currently I am still doing relevant research work.

@mbasmanova

skadilover · 2023-11-17T05:44:53Z

@mbasmanova
As we discussed before, I have started to split the original PR into 3 sub PRs and merge them according to the dependencies. Could you help me to review this PR first?

Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.

1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. 2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.

…lized keys (#7624) Summary: Part of #6766 1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. 2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf. Pull Request resolved: #7624 Reviewed By: xiaoxmeng Differential Revision: D52155526 Pulled By: mbasmanova fbshipit-source-id: fa3f59c33f14f2fcf3dc80e1589426d142aa98d7

Summary: When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding. The changes ： 1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method 2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing Part of #6766 Pull Request resolved: #8350 Reviewed By: pedroerp Differential Revision: D53348718 Pulled By: mbasmanova fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62

Summary: When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding. The changes ： 1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method 2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8350 Reviewed By: pedroerp Differential Revision: D53348718 Pulled By: mbasmanova fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62

…cubator#8146) Summary: Add PrefixSort and ut & benchmarks for it : 1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up. 2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc. 3. Benchmarks, std-sort vs prefix-sort : ` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_Bigint100k 34.30ms 29.16 PrefixSort_Bigint100k 27.16ms 36.81 StdSort_Bigint1M 465.00ms 2.15 PrefixSort_Bigint1M 330.97ms 3.02 StdSort_Bigint10M 7.96s 125.68m PrefixSort_Bigint10M 3.87s 258.28m StdSort_Varchar100k 62.02ms 16.12 PrefixSort_Varchar100k 61.64ms 16.22 StdSort_Varchar1M 1.31s 763.74m PrefixSort_Varchar1M 1.27s 785.51m StdSort_Varchar10M 23.06s 43.36m PrefixSort_Varchar10M 22.67s 44.11m StdSort_BigintBigint100k 48.61ms 20.57 PrefixSort_BigintBigint100k 30.70ms 32.58 StdSort_BigintBigint1M 631.11ms 1.58 PrefixSort_BigintBigint1M 379.40ms 2.64 StdSort_BigintBigint10M 11.38s 87.86m PrefixSort_BigintBigint10M 4.42s 226.42m StdSort_BigintVarchar100k 65.36ms 15.30 PrefixSort_BigintVarchar100k 56.52ms 17.69 StdSort_BigintVarchar1M 1.29s 777.33m PrefixSort_BigintVarchar1M 1.11s 901.87m StdSort_BigintVarchar10M 22.03s 45.39m PrefixSort_BigintVarchar10M 18.00s 55.56m ` For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects about 1x-2x up. For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same. One more thing is that we use std::memcmp in this PR, if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious. Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Differential Revision: D54247347 Pulled By: mbasmanova

Summary: PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up. Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys. The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance. ``` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_no-payload_1_bigint_0.01k 49.40ns 20.24M PrefixSort 100.00% 49.40ns 20.24M StdSort_no-payload_2_bigint_0.01k 87.85ns 11.38M PrefixSort 99.904% 87.94ns 11.37M StdSort_no-payload_3_bigint_0.01k 65.04ns 15.37M PrefixSort 99.350% 65.47ns 15.27M StdSort_no-payload_4_bigint_0.01k 69.19ns 14.45M PrefixSort 99.566% 69.49ns 14.39M StdSort_no-payload_1_bigint_0.015k 47.93ns 20.86M PrefixSort 100.02% 47.92ns 20.87M StdSort_no-payload_2_bigint_0.015k 54.22ns 18.44M PrefixSort 99.921% 54.26ns 18.43M StdSort_no-payload_3_bigint_0.015k 61.00ns 16.39M PrefixSort 99.958% 61.02ns 16.39M StdSort_no-payload_4_bigint_0.015k 57.38ns 17.43M PrefixSort 99.870% 57.46ns 17.40M StdSort_no-payload_1_bigint_0.02k 47.82ns 20.91M PrefixSort 99.914% 47.86ns 20.90M StdSort_no-payload_2_bigint_0.02k 83.94ns 11.91M PrefixSort 100.00% 83.93ns 11.91M StdSort_no-payload_3_bigint_0.02k 128.43ns 7.79M PrefixSort 100.17% 128.21ns 7.80M StdSort_no-payload_4_bigint_0.02k 165.29ns 6.05M PrefixSort 99.997% 165.30ns 6.05M StdSort_no-payload_1_bigint_0.05k 77.34ns 12.93M PrefixSort 99.425% 77.79ns 12.86M StdSort_no-payload_2_bigint_0.05k 113.79ns 8.79M PrefixSort 99.786% 114.03ns 8.77M StdSort_no-payload_3_bigint_0.05k 152.81ns 6.54M PrefixSort 99.696% 153.27ns 6.52M StdSort_no-payload_4_bigint_0.05k 185.93ns 5.38M PrefixSort 100.02% 185.90ns 5.38M StdSort_no-payload_1_bigint_0.1k 87.39ns 11.44M PrefixSort 99.499% 87.83ns 11.39M StdSort_no-payload_2_bigint_0.1k 139.93ns 7.15M PrefixSort 162.24% 86.25ns 11.59M StdSort_no-payload_3_bigint_0.1k 186.27ns 5.37M PrefixSort 186.72% 99.76ns 10.02M StdSort_no-payload_4_bigint_0.1k 234.01ns 4.27M PrefixSort 187.97% 124.49ns 8.03M StdSort_no-payloads_1_bigint_1k 173.31ns 5.77M PrefixSort 136.72% 126.76ns 7.89M StdSort_no-payloads_2_bigint_1k 249.77ns 4.00M PrefixSort 199.49% 125.20ns 7.99M StdSort_no-payloads_3_bigint_1k 314.18ns 3.18M PrefixSort 219.49% 143.14ns 6.99M StdSort_no-payloads_4_bigint_1k 348.38ns 2.87M PrefixSort 203.28% 171.38ns 5.84M StdSort_no-payloads_1_bigint_10k 251.90ns 3.97M PrefixSort 165.99% 151.76ns 6.59M StdSort_no-payloads_2_bigint_10k 363.09ns 2.75M PrefixSort 253.07% 143.47ns 6.97M StdSort_no-payloads_3_bigint_10k 483.58ns 2.07M PrefixSort 293.67% 164.67ns 6.07M StdSort_no-payloads_4_bigint_10k 593.29ns 1.69M PrefixSort 312.83% 189.65ns 5.27M StdSort_no-payloads_1_bigint_100k 330.44ns 3.03M PrefixSort 192.57% 171.59ns 5.83M StdSort_no-payloads_2_bigint_100k 470.79ns 2.12M PrefixSort 293.67% 160.31ns 6.24M StdSort_no-payloads_3_bigint_100k 607.15ns 1.65M PrefixSort 303.88% 199.80ns 5.01M StdSort_no-payloads_4_bigint_100k 706.03ns 1.42M PrefixSort 315.15% 224.03ns 4.46M StdSort_no-payloads_1_bigint_1000k 452.05ns 2.21M PrefixSort 204.92% 220.60ns 4.53M StdSort_no-payloads_2_bigint_1000k 645.35ns 1.55M PrefixSort 306.42% 210.61ns 4.75M StdSort_no-payloads_3_bigint_1000k 818.78ns 1.22M PrefixSort 328.01% 249.62ns 4.01M StdSort_no-payloads_4_bigint_1000k 981.65ns 1.02M PrefixSort 343.79% 285.54ns 3.50M StdSort_2payloads_1_bigint_1k 177.21ns 5.64M PrefixSort 139.94% 126.63ns 7.90M StdSort_2payloads_2_bigint_1k 248.46ns 4.02M PrefixSort 199.08% 124.80ns 8.01M StdSort_2payloads_3_bigint_1k 313.66ns 3.19M PrefixSort 218.48% 143.56ns 6.97M StdSort_2payloads_4_bigint_1k 359.17ns 2.78M PrefixSort 208.57% 172.21ns 5.81M StdSort_2payloads_1_bigint_10k 254.83ns 3.92M PrefixSort 168.28% 151.43ns 6.60M StdSort_2payloads_2_bigint_10k 363.92ns 2.75M PrefixSort 254.35% 143.08ns 6.99M StdSort_2payloads_3_bigint_10k 475.61ns 2.10M PrefixSort 288.96% 164.60ns 6.08M StdSort_2payloads_4_bigint_10k 594.78ns 1.68M PrefixSort 314.12% 189.35ns 5.28M StdSort_2payloads_1_bigint_100k 349.99ns 2.86M PrefixSort 205.28% 170.49ns 5.87M StdSort_2payloads_2_bigint_100k 489.03ns 2.04M PrefixSort 307.77% 158.89ns 6.29M StdSort_2payloads_3_bigint_100k 607.88ns 1.65M PrefixSort 305.62% 198.90ns 5.03M StdSort_2payloads_4_bigint_100k 715.90ns 1.40M PrefixSort 321.73% 222.52ns 4.49M StdSort_2payloads_1_bigint_1000k 574.14ns 1.74M PrefixSort 262.05% 219.10ns 4.56M StdSort_2payloads_2_bigint_1000k 796.05ns 1.26M PrefixSort 377.73% 210.75ns 4.75M StdSort_2payloads_3_bigint_1000k 1.02us 975.65K PrefixSort 411.02% 249.37ns 4.01M StdSort_2payloads_4_bigint_1000k 1.16us 858.48K PrefixSort 408.37% 285.24ns 3.51M StdSort_2payloads_1_bigint_0.01k 49.37ns 20.26M PrefixSort 99.124% 49.81ns 20.08M StdSort_2payloads_2_bigint_0.01k 89.08ns 11.23M PrefixSort 99.958% 89.12ns 11.22M StdSort_2payloads_3_bigint_0.01k 64.40ns 15.53M PrefixSort 99.991% 64.41ns 15.53M StdSort_2payloads_4_bigint_0.01k 86.56ns 11.55M PrefixSort 100.34% 86.26ns 11.59M StdSort_2payloads_1_bigint_0.015k 48.17ns 20.76M PrefixSort 100.11% 48.12ns 20.78M StdSort_2payloads_2_bigint_0.015k 55.44ns 18.04M PrefixSort 99.994% 55.45ns 18.03M StdSort_2payloads_3_bigint_0.015k 61.17ns 16.35M PrefixSort 99.988% 61.18ns 16.35M StdSort_2payloads_4_bigint_0.015k 57.55ns 17.38M PrefixSort 99.895% 57.61ns 17.36M StdSort_2payloads_1_bigint_0.02k 47.93ns 20.86M PrefixSort 99.916% 47.97ns 20.85M StdSort_2payloads_2_bigint_0.02k 84.10ns 11.89M PrefixSort 100.38% 83.78ns 11.94M StdSort_2payloads_3_bigint_0.02k 126.53ns 7.90M PrefixSort 100.05% 126.47ns 7.91M StdSort_2payloads_4_bigint_0.02k 164.44ns 6.08M PrefixSort 99.935% 164.55ns 6.08M StdSort_2payloads_1_bigint_0.05k 77.86ns 12.84M PrefixSort 99.171% 78.51ns 12.74M StdSort_2payloads_2_bigint_0.05k 118.10ns 8.47M PrefixSort 100.53% 117.48ns 8.51M StdSort_2payloads_3_bigint_0.05k 152.74ns 6.55M PrefixSort 100.02% 152.71ns 6.55M StdSort_2payloads_4_bigint_0.05k 184.56ns 5.42M PrefixSort 99.925% 184.70ns 5.41M StdSort_2payloads_1_bigint_0.1k 88.01ns 11.36M PrefixSort 100.46% 87.60ns 11.42M StdSort_2payloads_2_bigint_0.1k 138.22ns 7.24M PrefixSort 159.92% 86.43ns 11.57M StdSort_2payloads_3_bigint_0.1k 187.49ns 5.33M PrefixSort 187.96% 99.75ns 10.03M StdSort_2payloads_4_bigint_0.1k 232.52ns 4.30M PrefixSort 188.98% 123.04ns 8.13M StdSort_no-payloads_1_varchar_1k 292.32ns 3.42M PrefixSort 100.47% 290.96ns 3.44M StdSort_no-payloads_2_varchar_1k 380.98ns 2.62M PrefixSort 98.623% 386.29ns 2.59M StdSort_no-payloads_3_varchar_1k 456.15ns 2.19M PrefixSort 98.302% 464.03ns 2.16M StdSort_no-payloads_4_varchar_1k 520.84ns 1.92M PrefixSort 98.186% 530.46ns 1.89M StdSort_no-payloads_1_varchar_10k 422.83ns 2.37M PrefixSort 99.186% 426.30ns 2.35M StdSort_no-payloads_2_varchar_10k 495.10ns 2.02M PrefixSort 98.218% 504.08ns 1.98M StdSort_no-payloads_3_varchar_10k 584.89ns 1.71M PrefixSort 99.079% 590.33ns 1.69M StdSort_no-payloads_4_varchar_10k 667.37ns 1.50M PrefixSort 98.887% 674.88ns 1.48M StdSort_no-payloads_1_varchar_100k 605.27ns 1.65M PrefixSort 99.425% 608.78ns 1.64M StdSort_no-payloads_2_varchar_100k 741.11ns 1.35M PrefixSort 99.107% 747.78ns 1.34M StdSort_no-payloads_3_varchar_100k 890.60ns 1.12M PrefixSort 99.089% 898.78ns 1.11M StdSort_no-payloads_4_varchar_100k 1.11us 903.14K PrefixSort 104.50% 1.06us 943.76K StdSort_no-payloads_1_varchar_1000k 1.22us 822.83K PrefixSort 99.534% 1.22us 818.99K StdSort_no-payloads_2_varchar_1000k 1.52us 656.78K PrefixSort 99.353% 1.53us 652.53K StdSort_no-payloads_3_varchar_1000k 1.78us 560.23K PrefixSort 98.862% 1.81us 553.86K StdSort_no-payloads_4_varchar_1000k 1.93us 519.34K PrefixSort 99.159% 1.94us 514.97K ``` Part of #6766 Pull Request resolved: #8146 Reviewed By: Yuhta Differential Revision: D54247347 Pulled By: mbasmanova fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923

yingsu00 · 2024-04-25T06:25:06Z

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

skadilover · 2024-04-25T07:38:21Z

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

The core idea of this opt is 'use binary string to opt mulit columns sort', use quick-sort just for speeding up things.

jinchengchenghh · 2024-06-05T06:34:59Z

We will also try to optimize sort, initial proposal is:
Currently, the common usage is to request space for each row.
Firstly, we can request all the rows memory at once, and newRow is only for fixed size column, maybe we can also optimize string column at the same way.

for (auto row = 0; row < input->size(); ++row) {
    char* newRow = inputData_->newRow();

    for (auto i = 0; i < inputs_.size(); ++i) {
      inputData_->store(decodedInputs_[i], row, newRow, i);
    }

    addNewRow(groups[row], newRow);
  }

Secondly, currently we will spill input for memory pressure, we can also request all the memory in the addInput function, and test if we can get the memory before request, if not, try to sort by prefixsort and spill output as spark does, sort will not request any memory in that case.
What Spark does is:

Insert record and check if the memory is enough, if not, sort and spill https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L491
Spill and sort: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L234

Will add a sort benchmark and sort spill benchmark to track the performance.
First PR is #10041
Please let me know if it is acceptable. Thanks!

skadilover · 2024-06-05T10:12:41Z

@jinchengchenghh
I think this feature you proposed has no conflict with the optimization of prefix-sort.
I think you can file a separate issue and let the code maintainer know, which may make things progress faster.

jinchengchenghh · 2024-06-06T01:03:10Z

Yes, because this title is Optimize sort, same topic with mine, so I add it here, And I would like to apply the second optimization after prefix sort merged. Otherwise it will make it hard to enable prefixsort because it requires sort does not require memory.

…cubator#8146) Summary: PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up. Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys. The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance. ``` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_no-payload_1_bigint_0.01k 49.40ns 20.24M PrefixSort 100.00% 49.40ns 20.24M StdSort_no-payload_2_bigint_0.01k 87.85ns 11.38M PrefixSort 99.904% 87.94ns 11.37M StdSort_no-payload_3_bigint_0.01k 65.04ns 15.37M PrefixSort 99.350% 65.47ns 15.27M StdSort_no-payload_4_bigint_0.01k 69.19ns 14.45M PrefixSort 99.566% 69.49ns 14.39M StdSort_no-payload_1_bigint_0.015k 47.93ns 20.86M PrefixSort 100.02% 47.92ns 20.87M StdSort_no-payload_2_bigint_0.015k 54.22ns 18.44M PrefixSort 99.921% 54.26ns 18.43M StdSort_no-payload_3_bigint_0.015k 61.00ns 16.39M PrefixSort 99.958% 61.02ns 16.39M StdSort_no-payload_4_bigint_0.015k 57.38ns 17.43M PrefixSort 99.870% 57.46ns 17.40M StdSort_no-payload_1_bigint_0.02k 47.82ns 20.91M PrefixSort 99.914% 47.86ns 20.90M StdSort_no-payload_2_bigint_0.02k 83.94ns 11.91M PrefixSort 100.00% 83.93ns 11.91M StdSort_no-payload_3_bigint_0.02k 128.43ns 7.79M PrefixSort 100.17% 128.21ns 7.80M StdSort_no-payload_4_bigint_0.02k 165.29ns 6.05M PrefixSort 99.997% 165.30ns 6.05M StdSort_no-payload_1_bigint_0.05k 77.34ns 12.93M PrefixSort 99.425% 77.79ns 12.86M StdSort_no-payload_2_bigint_0.05k 113.79ns 8.79M PrefixSort 99.786% 114.03ns 8.77M StdSort_no-payload_3_bigint_0.05k 152.81ns 6.54M PrefixSort 99.696% 153.27ns 6.52M StdSort_no-payload_4_bigint_0.05k 185.93ns 5.38M PrefixSort 100.02% 185.90ns 5.38M StdSort_no-payload_1_bigint_0.1k 87.39ns 11.44M PrefixSort 99.499% 87.83ns 11.39M StdSort_no-payload_2_bigint_0.1k 139.93ns 7.15M PrefixSort 162.24% 86.25ns 11.59M StdSort_no-payload_3_bigint_0.1k 186.27ns 5.37M PrefixSort 186.72% 99.76ns 10.02M StdSort_no-payload_4_bigint_0.1k 234.01ns 4.27M PrefixSort 187.97% 124.49ns 8.03M StdSort_no-payloads_1_bigint_1k 173.31ns 5.77M PrefixSort 136.72% 126.76ns 7.89M StdSort_no-payloads_2_bigint_1k 249.77ns 4.00M PrefixSort 199.49% 125.20ns 7.99M StdSort_no-payloads_3_bigint_1k 314.18ns 3.18M PrefixSort 219.49% 143.14ns 6.99M StdSort_no-payloads_4_bigint_1k 348.38ns 2.87M PrefixSort 203.28% 171.38ns 5.84M StdSort_no-payloads_1_bigint_10k 251.90ns 3.97M PrefixSort 165.99% 151.76ns 6.59M StdSort_no-payloads_2_bigint_10k 363.09ns 2.75M PrefixSort 253.07% 143.47ns 6.97M StdSort_no-payloads_3_bigint_10k 483.58ns 2.07M PrefixSort 293.67% 164.67ns 6.07M StdSort_no-payloads_4_bigint_10k 593.29ns 1.69M PrefixSort 312.83% 189.65ns 5.27M StdSort_no-payloads_1_bigint_100k 330.44ns 3.03M PrefixSort 192.57% 171.59ns 5.83M StdSort_no-payloads_2_bigint_100k 470.79ns 2.12M PrefixSort 293.67% 160.31ns 6.24M StdSort_no-payloads_3_bigint_100k 607.15ns 1.65M PrefixSort 303.88% 199.80ns 5.01M StdSort_no-payloads_4_bigint_100k 706.03ns 1.42M PrefixSort 315.15% 224.03ns 4.46M StdSort_no-payloads_1_bigint_1000k 452.05ns 2.21M PrefixSort 204.92% 220.60ns 4.53M StdSort_no-payloads_2_bigint_1000k 645.35ns 1.55M PrefixSort 306.42% 210.61ns 4.75M StdSort_no-payloads_3_bigint_1000k 818.78ns 1.22M PrefixSort 328.01% 249.62ns 4.01M StdSort_no-payloads_4_bigint_1000k 981.65ns 1.02M PrefixSort 343.79% 285.54ns 3.50M StdSort_2payloads_1_bigint_1k 177.21ns 5.64M PrefixSort 139.94% 126.63ns 7.90M StdSort_2payloads_2_bigint_1k 248.46ns 4.02M PrefixSort 199.08% 124.80ns 8.01M StdSort_2payloads_3_bigint_1k 313.66ns 3.19M PrefixSort 218.48% 143.56ns 6.97M StdSort_2payloads_4_bigint_1k 359.17ns 2.78M PrefixSort 208.57% 172.21ns 5.81M StdSort_2payloads_1_bigint_10k 254.83ns 3.92M PrefixSort 168.28% 151.43ns 6.60M StdSort_2payloads_2_bigint_10k 363.92ns 2.75M PrefixSort 254.35% 143.08ns 6.99M StdSort_2payloads_3_bigint_10k 475.61ns 2.10M PrefixSort 288.96% 164.60ns 6.08M StdSort_2payloads_4_bigint_10k 594.78ns 1.68M PrefixSort 314.12% 189.35ns 5.28M StdSort_2payloads_1_bigint_100k 349.99ns 2.86M PrefixSort 205.28% 170.49ns 5.87M StdSort_2payloads_2_bigint_100k 489.03ns 2.04M PrefixSort 307.77% 158.89ns 6.29M StdSort_2payloads_3_bigint_100k 607.88ns 1.65M PrefixSort 305.62% 198.90ns 5.03M StdSort_2payloads_4_bigint_100k 715.90ns 1.40M PrefixSort 321.73% 222.52ns 4.49M StdSort_2payloads_1_bigint_1000k 574.14ns 1.74M PrefixSort 262.05% 219.10ns 4.56M StdSort_2payloads_2_bigint_1000k 796.05ns 1.26M PrefixSort 377.73% 210.75ns 4.75M StdSort_2payloads_3_bigint_1000k 1.02us 975.65K PrefixSort 411.02% 249.37ns 4.01M StdSort_2payloads_4_bigint_1000k 1.16us 858.48K PrefixSort 408.37% 285.24ns 3.51M StdSort_2payloads_1_bigint_0.01k 49.37ns 20.26M PrefixSort 99.124% 49.81ns 20.08M StdSort_2payloads_2_bigint_0.01k 89.08ns 11.23M PrefixSort 99.958% 89.12ns 11.22M StdSort_2payloads_3_bigint_0.01k 64.40ns 15.53M PrefixSort 99.991% 64.41ns 15.53M StdSort_2payloads_4_bigint_0.01k 86.56ns 11.55M PrefixSort 100.34% 86.26ns 11.59M StdSort_2payloads_1_bigint_0.015k 48.17ns 20.76M PrefixSort 100.11% 48.12ns 20.78M StdSort_2payloads_2_bigint_0.015k 55.44ns 18.04M PrefixSort 99.994% 55.45ns 18.03M StdSort_2payloads_3_bigint_0.015k 61.17ns 16.35M PrefixSort 99.988% 61.18ns 16.35M StdSort_2payloads_4_bigint_0.015k 57.55ns 17.38M PrefixSort 99.895% 57.61ns 17.36M StdSort_2payloads_1_bigint_0.02k 47.93ns 20.86M PrefixSort 99.916% 47.97ns 20.85M StdSort_2payloads_2_bigint_0.02k 84.10ns 11.89M PrefixSort 100.38% 83.78ns 11.94M StdSort_2payloads_3_bigint_0.02k 126.53ns 7.90M PrefixSort 100.05% 126.47ns 7.91M StdSort_2payloads_4_bigint_0.02k 164.44ns 6.08M PrefixSort 99.935% 164.55ns 6.08M StdSort_2payloads_1_bigint_0.05k 77.86ns 12.84M PrefixSort 99.171% 78.51ns 12.74M StdSort_2payloads_2_bigint_0.05k 118.10ns 8.47M PrefixSort 100.53% 117.48ns 8.51M StdSort_2payloads_3_bigint_0.05k 152.74ns 6.55M PrefixSort 100.02% 152.71ns 6.55M StdSort_2payloads_4_bigint_0.05k 184.56ns 5.42M PrefixSort 99.925% 184.70ns 5.41M StdSort_2payloads_1_bigint_0.1k 88.01ns 11.36M PrefixSort 100.46% 87.60ns 11.42M StdSort_2payloads_2_bigint_0.1k 138.22ns 7.24M PrefixSort 159.92% 86.43ns 11.57M StdSort_2payloads_3_bigint_0.1k 187.49ns 5.33M PrefixSort 187.96% 99.75ns 10.03M StdSort_2payloads_4_bigint_0.1k 232.52ns 4.30M PrefixSort 188.98% 123.04ns 8.13M StdSort_no-payloads_1_varchar_1k 292.32ns 3.42M PrefixSort 100.47% 290.96ns 3.44M StdSort_no-payloads_2_varchar_1k 380.98ns 2.62M PrefixSort 98.623% 386.29ns 2.59M StdSort_no-payloads_3_varchar_1k 456.15ns 2.19M PrefixSort 98.302% 464.03ns 2.16M StdSort_no-payloads_4_varchar_1k 520.84ns 1.92M PrefixSort 98.186% 530.46ns 1.89M StdSort_no-payloads_1_varchar_10k 422.83ns 2.37M PrefixSort 99.186% 426.30ns 2.35M StdSort_no-payloads_2_varchar_10k 495.10ns 2.02M PrefixSort 98.218% 504.08ns 1.98M StdSort_no-payloads_3_varchar_10k 584.89ns 1.71M PrefixSort 99.079% 590.33ns 1.69M StdSort_no-payloads_4_varchar_10k 667.37ns 1.50M PrefixSort 98.887% 674.88ns 1.48M StdSort_no-payloads_1_varchar_100k 605.27ns 1.65M PrefixSort 99.425% 608.78ns 1.64M StdSort_no-payloads_2_varchar_100k 741.11ns 1.35M PrefixSort 99.107% 747.78ns 1.34M StdSort_no-payloads_3_varchar_100k 890.60ns 1.12M PrefixSort 99.089% 898.78ns 1.11M StdSort_no-payloads_4_varchar_100k 1.11us 903.14K PrefixSort 104.50% 1.06us 943.76K StdSort_no-payloads_1_varchar_1000k 1.22us 822.83K PrefixSort 99.534% 1.22us 818.99K StdSort_no-payloads_2_varchar_1000k 1.52us 656.78K PrefixSort 99.353% 1.53us 652.53K StdSort_no-payloads_3_varchar_1000k 1.78us 560.23K PrefixSort 98.862% 1.81us 553.86K StdSort_no-payloads_4_varchar_1000k 1.93us 519.34K PrefixSort 99.159% 1.94us 514.97K ``` Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Reviewed By: Yuhta Differential Revision: D54247347 Pulled By: mbasmanova fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923

Summary: According to the benchmark, for data sets larger than 0.5k, PrefixSort outperforms std::sort with performance improvements ranging from approximately 250% to over 500%. Here's a summary of the benchmark results: | Dataset Size | PrefixSort Improvement (No Payload) |PrefixSort Improvement(With Payload) | |--------------|-------------------------------------|-------------------------------------| | 0.5k | 248.97% - 287.43% | 249.71% - 289.74% | | 1k | 214.44% - 310.92% | 215.03% - 315.14% | | 10k | 216.21% - 255.38% | 217.88% - 256.88% | | 100k | 279.81% - 318.26% | 284.89% - 295.21% | | 1000k | 304.36% - 351.31% | 454.04% - 514.28% | follow-up #8350 Part of #6766 Pull Request resolved: #10946 Reviewed By: Yuhta Differential Revision: D62373593 Pulled By: mbasmanova fbshipit-source-id: b8594e05cc6aee736d09db1695db770b84e9d4bd

skadilover added the enhancement New feature or request label Sep 27, 2023

hengjiangly mentioned this issue Nov 9, 2023

WIP add prefix sort #7481

Closed

skadilover mentioned this issue Nov 14, 2023

add prefix sort #7230

Closed

skadilover mentioned this issue Nov 22, 2023

Add custom quick-sort algorithm to use for efficient sorting of normalized keys #7624

Closed

mbasmanova mentioned this issue Jan 9, 2024

Add prefix-sort with support for fixed width sorting keys #8146

Closed

mbasmanova mentioned this issue Jan 19, 2024

Add null support to PrefixSortEncoder #8350

Closed

mbasmanova changed the title ~~Sort optimization suggestions~~ Optimize sort Jan 19, 2024

kevincmchen mentioned this issue Sep 7, 2024

Add support for smallint to PrefixSort #10946

Closed

Optimize sort #6766

Optimize sort #6766

Comments

skadilover commented Sep 27, 2023 • edited by mbasmanova Loading

Description

mbasmanova commented Sep 27, 2023

mbasmanova commented Sep 27, 2023

skadilover commented Sep 28, 2023 • edited Loading

mbasmanova commented Sep 28, 2023

mbasmanova commented Sep 28, 2023

mbasmanova commented Sep 28, 2023

mbasmanova commented Sep 28, 2023

skadilover commented Sep 29, 2023

skadilover commented Sep 29, 2023

mbasmanova commented Sep 29, 2023

zhouyuan commented Oct 11, 2023 • edited Loading

pedroerp commented Oct 23, 2023 • edited Loading

skadilover commented Oct 25, 2023 • edited Loading

skadilover commented Oct 25, 2023 • edited Loading

skadilover commented Oct 25, 2023

pedroerp commented Oct 27, 2023

skadilover commented Oct 27, 2023

mbasmanova commented Oct 27, 2023

mbasmanova commented Oct 27, 2023

skadilover commented Oct 30, 2023 • edited Loading

mbasmanova commented Oct 30, 2023

skadilover commented Nov 8, 2023 • edited Loading

Overview

Normalized Key For Multi Sort Keys

Memory Data Structure

Conclusion and plan

How To Reproduce Optimization

reference：

skadilover commented Nov 8, 2023 • edited Loading

mbasmanova commented Nov 8, 2023

skadilover commented Nov 8, 2023

mbasmanova commented Nov 8, 2023

skadilover commented Nov 9, 2023 • edited Loading

skadilover commented Nov 9, 2023 • edited Loading

skadilover commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

skadilover commented Nov 14, 2023 • edited Loading

skadilover commented Nov 17, 2023

yingsu00 commented Apr 25, 2024

skadilover commented Apr 25, 2024

jinchengchenghh commented Jun 5, 2024 • edited Loading

skadilover commented Jun 5, 2024

jinchengchenghh commented Jun 6, 2024

skadilover commented Sep 27, 2023 •

edited by mbasmanova

Loading

skadilover commented Sep 28, 2023 •

edited

Loading

zhouyuan commented Oct 11, 2023 •

edited

Loading

pedroerp commented Oct 23, 2023 •

edited

Loading

skadilover commented Oct 25, 2023 •

edited

Loading

skadilover commented Oct 25, 2023 •

edited

Loading

skadilover commented Oct 30, 2023 •

edited

Loading

skadilover commented Nov 8, 2023 •

edited

Loading

skadilover commented Nov 8, 2023 •

edited

Loading

skadilover commented Nov 9, 2023 •

edited

Loading

skadilover commented Nov 9, 2023 •

edited

Loading

skadilover commented Nov 14, 2023 •

edited

Loading

jinchengchenghh commented Jun 5, 2024 •

edited

Loading