-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize sort #6766
Comments
Hi @skadilover, welcome to the community. Would you introduce yourself? Where are you from? The optimizations you described above sound interesting. Are you interested in trying them out, iterating and eventually incorporating them into Velox? |
@mbasmanova
Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice? |
Welcome. What's your name? I only see GItHub handle skadilover. I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements. |
Fantastic. Looking forward to the PR. |
Check out velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp This is a benchmark for reduce_agg aggregate function which benchmarks a mini query (table scan + aggregation). I suggest you write similar one to benchmark table scan + order-by. One may comment that this doesn't benchmark order by in isolation, but that's Ok. After all there is no point to speed a portion of the query that's not using significant amount of CPU time to begin with. E2e query speed up is what we are after. Hope this helps. |
This makes a lot of sense. Excited to see this optimization coming to Velox. |
My name is Heng Jiang.
Not available in the prestodb repo, these are internal enhancements. |
Next week is our National Day holiday (10 days). After the holiday, I will continue to organize the code and submit the PR as soon as possible |
@skadilover Nice to meet you, Heng. Enjoy the holiday. Looking forward to hearing again from you in a couple of weeks. |
@mbasmanova @skadilover thanks, -yuan |
Hi guys, sorry about the delay to follow up on this thread. I've recently helped review this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject: https://arxiv.org/pdf/2209.08420.pdf This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations. A few specific questions:
|
@pedroerp Glad to hear your response for this issue!~ The paper mentioned above is very useful. PrefixSort is not just a normalizekey technology: it is different from velox’s existing storage structure (RowContainer In order to facilitate communication, I submitted a PR #7230 (it is not 100% ready yet, some code will be adjusted based on my E2E test results), and the design document is also being prepared (it will take a few days).If have time, could you help me take a look at this PR first? @mbasmanova @pedroerp |
@pedroerp Yes, for some specific columns we may refer to the offset-value encoding format, but this work may be done later. I hope to optimize the single-column bigint/varchar scenario first, especially in the bigint scenario. radix-sort
Directly store min/max instead of T() (what velox does now)
Like question 1, I think this is a encoding issue. I will discuss it scene by scene in the subsequent PRs. |
@pedroerp |
This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance? https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp |
Thanks a lot, This example looks like it could work. I will try it soon |
@skadilover Wondering if you had a chance to review Arrow Row format: https://docs.rs/arrow-row/latest/arrow_row/index.html "Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post." |
Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf |
@mbasmanova Thanks for the references above. The encoding tech (include "normalized key", "memory compare" etc. ) can be very efficiently , and we have implemented them in our inernal presto before. In PrefixSort optimization, I hope to implement them step by step. This week I plan to prepare a design document (including some research E2E tests and benchmarks) to describe my plan in detail. |
@skadilover Looking forward to hearing more from you on this topic. Thanks. |
OverviewPrefixSort's optimization plan includes two optimization points: normalized keys optimization and cpu cache optimization. Normalized Key For Multi Sort KeysThe core idea of normalized key is: assuming that sort keys can be encoded into a binary string and the sorting semantics remain unchanged, then all sort keys can be regarded as a whole for sorting (see [1]). At this time, memcmp can also use simd to speed up. In the submitted PR [https://github.com//pull/7230] implementation, in the scenario of two columns of BIGINT, we can get a performance improvement of 200%-300%. PrefixSort currently supports BIGINT, and the current research results of other fixed-length types are that they can be implemented. The VARCHAR scenario will be a little complicated. We also studied the implementation of duckDB, and the following somewhat special method was used in its implementation. Range sort is used to process varchar. Further research and testing of the encoding format will be conducted to improve this part of the design: Memory Data StructureIn order to implement the above-mentioned Normalized Key technology in velox, we consider redesigning a data structure to store the normalized key for the following two reasons:
First, we review the current OrderBy operator implementation of velox, as shown in the figure below: The general process of implementation is:
Going back to the implementation of the OrderBy operator, we can find that the compare method in the sort loop needs to access two memory blocks (address vector and row container) at the same time. As the memory grows, memory access problems will significantly affect the efficiency of sort execution. . The solution to this problem is to store the sort key and row Num separately. Take duckDB as an example:
In the E2E test of single column BIGINT, we can get 30%-40% gain. In fact, it will be faster if we use reintercept_cast to get the value, but for the unified implementation of normalize key, memcmp is used here for comparison. In long scenarios, memcmp is slower than reintercept_cast. In the implementation, template memcmp in duckDB is used to replace std::memcmp. There is some improvement in the effect, but it is still slower than the implementation of reintercept_cast. Conclusion and planBy optimizing the memory data structure and using Normalize technology, PrefixSort will have performance improvements compared to the original implementation in both single-column and multi-column cases. It is planned to be divided into 4 stages to achieve complete functions.
How To Reproduce OptimizationJust Run PrefixSortBenchmark.cpp && OrderByTest.cpp reference:[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf |
@mbasmanova updates for the topic. |
@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram. |
yes ,there is a copy |
@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key. Also, can you extend the design to support complex types as well? |
Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.
If necessary, we can store some potential sorting prefixes in Prefix and use row ptr to compare(simply call RowContainer`s compare method). Like Varchar, if complex type cannot obtain benefits through memcmp, I feel that it is possible to use only pointers for comparison will be better . |
@mbasmanova the pr closed above is not used, stil use #7230 to merge code. |
@mbasmanova Is there more questions about the topic ? Could you help me to review the code? |
@skadilover Thank you for working on this. Took a quick look and left some initial comments on the PR. |
I think we need to have a sensible limit by default. I'm also interested in figuring out support for varchar and row() keys as these are common. |
I will check it later.
Our previous method of dealing with varchar was to 1. analyze the maximum length of the data before sorting and then process it according to a fixed size, 2. only retain the prefix for those that exceed the length limit. I hope to find a better way. Currently I am still doing relevant research work. |
@mbasmanova |
Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. 2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
…lized keys (#7624) Summary: Part of #6766 1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods. 2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up. [1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf. Pull Request resolved: #7624 Reviewed By: xiaoxmeng Differential Revision: D52155526 Pulled By: mbasmanova fbshipit-source-id: fa3f59c33f14f2fcf3dc80e1589426d142aa98d7
Summary: When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding. The changes : 1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method 2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing Part of #6766 Pull Request resolved: #8350 Reviewed By: pedroerp Differential Revision: D53348718 Pulled By: mbasmanova fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62
Summary: When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding. The changes : 1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method 2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8350 Reviewed By: pedroerp Differential Revision: D53348718 Pulled By: mbasmanova fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62
…cubator#8146) Summary: Add PrefixSort and ut & benchmarks for it : 1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up. 2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc. 3. Benchmarks, std-sort vs prefix-sort : ` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_Bigint100k 34.30ms 29.16 PrefixSort_Bigint100k 27.16ms 36.81 StdSort_Bigint1M 465.00ms 2.15 PrefixSort_Bigint1M 330.97ms 3.02 StdSort_Bigint10M 7.96s 125.68m PrefixSort_Bigint10M 3.87s 258.28m StdSort_Varchar100k 62.02ms 16.12 PrefixSort_Varchar100k 61.64ms 16.22 StdSort_Varchar1M 1.31s 763.74m PrefixSort_Varchar1M 1.27s 785.51m StdSort_Varchar10M 23.06s 43.36m PrefixSort_Varchar10M 22.67s 44.11m StdSort_BigintBigint100k 48.61ms 20.57 PrefixSort_BigintBigint100k 30.70ms 32.58 StdSort_BigintBigint1M 631.11ms 1.58 PrefixSort_BigintBigint1M 379.40ms 2.64 StdSort_BigintBigint10M 11.38s 87.86m PrefixSort_BigintBigint10M 4.42s 226.42m StdSort_BigintVarchar100k 65.36ms 15.30 PrefixSort_BigintVarchar100k 56.52ms 17.69 StdSort_BigintVarchar1M 1.29s 777.33m PrefixSort_BigintVarchar1M 1.11s 901.87m StdSort_BigintVarchar10M 22.03s 45.39m PrefixSort_BigintVarchar10M 18.00s 55.56m ` For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects about 1x-2x up. For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same. One more thing is that we use std::memcmp in this PR, if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious. Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Differential Revision: D54247347 Pulled By: mbasmanova
…cubator#8146) Summary: Add PrefixSort and ut & benchmarks for it : 1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up. 2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc. 3. Benchmarks, std-sort vs prefix-sort : ` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_Bigint100k 34.30ms 29.16 PrefixSort_Bigint100k 27.16ms 36.81 StdSort_Bigint1M 465.00ms 2.15 PrefixSort_Bigint1M 330.97ms 3.02 StdSort_Bigint10M 7.96s 125.68m PrefixSort_Bigint10M 3.87s 258.28m StdSort_Varchar100k 62.02ms 16.12 PrefixSort_Varchar100k 61.64ms 16.22 StdSort_Varchar1M 1.31s 763.74m PrefixSort_Varchar1M 1.27s 785.51m StdSort_Varchar10M 23.06s 43.36m PrefixSort_Varchar10M 22.67s 44.11m StdSort_BigintBigint100k 48.61ms 20.57 PrefixSort_BigintBigint100k 30.70ms 32.58 StdSort_BigintBigint1M 631.11ms 1.58 PrefixSort_BigintBigint1M 379.40ms 2.64 StdSort_BigintBigint10M 11.38s 87.86m PrefixSort_BigintBigint10M 4.42s 226.42m StdSort_BigintVarchar100k 65.36ms 15.30 PrefixSort_BigintVarchar100k 56.52ms 17.69 StdSort_BigintVarchar1M 1.29s 777.33m PrefixSort_BigintVarchar1M 1.11s 901.87m StdSort_BigintVarchar10M 22.03s 45.39m PrefixSort_BigintVarchar10M 18.00s 55.56m ` For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects about 1x-2x up. For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same. One more thing is that we use std::memcmp in this PR, if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious. Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Differential Revision: D54247347 Pulled By: mbasmanova
…cubator#8146) Summary: Add PrefixSort and ut & benchmarks for it : 1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up. 2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc. 3. Benchmarks, std-sort vs prefix-sort : ` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_Bigint100k 34.30ms 29.16 PrefixSort_Bigint100k 27.16ms 36.81 StdSort_Bigint1M 465.00ms 2.15 PrefixSort_Bigint1M 330.97ms 3.02 StdSort_Bigint10M 7.96s 125.68m PrefixSort_Bigint10M 3.87s 258.28m StdSort_Varchar100k 62.02ms 16.12 PrefixSort_Varchar100k 61.64ms 16.22 StdSort_Varchar1M 1.31s 763.74m PrefixSort_Varchar1M 1.27s 785.51m StdSort_Varchar10M 23.06s 43.36m PrefixSort_Varchar10M 22.67s 44.11m StdSort_BigintBigint100k 48.61ms 20.57 PrefixSort_BigintBigint100k 30.70ms 32.58 StdSort_BigintBigint1M 631.11ms 1.58 PrefixSort_BigintBigint1M 379.40ms 2.64 StdSort_BigintBigint10M 11.38s 87.86m PrefixSort_BigintBigint10M 4.42s 226.42m StdSort_BigintVarchar100k 65.36ms 15.30 PrefixSort_BigintVarchar100k 56.52ms 17.69 StdSort_BigintVarchar1M 1.29s 777.33m PrefixSort_BigintVarchar1M 1.11s 901.87m StdSort_BigintVarchar10M 22.03s 45.39m PrefixSort_BigintVarchar10M 18.00s 55.56m ` For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects about 1x-2x up. For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same. One more thing is that we use std::memcmp in this PR, if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious. Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Differential Revision: D54247347 Pulled By: mbasmanova
Summary: PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up. Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys. The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance. ``` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_no-payload_1_bigint_0.01k 49.40ns 20.24M PrefixSort 100.00% 49.40ns 20.24M StdSort_no-payload_2_bigint_0.01k 87.85ns 11.38M PrefixSort 99.904% 87.94ns 11.37M StdSort_no-payload_3_bigint_0.01k 65.04ns 15.37M PrefixSort 99.350% 65.47ns 15.27M StdSort_no-payload_4_bigint_0.01k 69.19ns 14.45M PrefixSort 99.566% 69.49ns 14.39M StdSort_no-payload_1_bigint_0.015k 47.93ns 20.86M PrefixSort 100.02% 47.92ns 20.87M StdSort_no-payload_2_bigint_0.015k 54.22ns 18.44M PrefixSort 99.921% 54.26ns 18.43M StdSort_no-payload_3_bigint_0.015k 61.00ns 16.39M PrefixSort 99.958% 61.02ns 16.39M StdSort_no-payload_4_bigint_0.015k 57.38ns 17.43M PrefixSort 99.870% 57.46ns 17.40M StdSort_no-payload_1_bigint_0.02k 47.82ns 20.91M PrefixSort 99.914% 47.86ns 20.90M StdSort_no-payload_2_bigint_0.02k 83.94ns 11.91M PrefixSort 100.00% 83.93ns 11.91M StdSort_no-payload_3_bigint_0.02k 128.43ns 7.79M PrefixSort 100.17% 128.21ns 7.80M StdSort_no-payload_4_bigint_0.02k 165.29ns 6.05M PrefixSort 99.997% 165.30ns 6.05M StdSort_no-payload_1_bigint_0.05k 77.34ns 12.93M PrefixSort 99.425% 77.79ns 12.86M StdSort_no-payload_2_bigint_0.05k 113.79ns 8.79M PrefixSort 99.786% 114.03ns 8.77M StdSort_no-payload_3_bigint_0.05k 152.81ns 6.54M PrefixSort 99.696% 153.27ns 6.52M StdSort_no-payload_4_bigint_0.05k 185.93ns 5.38M PrefixSort 100.02% 185.90ns 5.38M StdSort_no-payload_1_bigint_0.1k 87.39ns 11.44M PrefixSort 99.499% 87.83ns 11.39M StdSort_no-payload_2_bigint_0.1k 139.93ns 7.15M PrefixSort 162.24% 86.25ns 11.59M StdSort_no-payload_3_bigint_0.1k 186.27ns 5.37M PrefixSort 186.72% 99.76ns 10.02M StdSort_no-payload_4_bigint_0.1k 234.01ns 4.27M PrefixSort 187.97% 124.49ns 8.03M StdSort_no-payloads_1_bigint_1k 173.31ns 5.77M PrefixSort 136.72% 126.76ns 7.89M StdSort_no-payloads_2_bigint_1k 249.77ns 4.00M PrefixSort 199.49% 125.20ns 7.99M StdSort_no-payloads_3_bigint_1k 314.18ns 3.18M PrefixSort 219.49% 143.14ns 6.99M StdSort_no-payloads_4_bigint_1k 348.38ns 2.87M PrefixSort 203.28% 171.38ns 5.84M StdSort_no-payloads_1_bigint_10k 251.90ns 3.97M PrefixSort 165.99% 151.76ns 6.59M StdSort_no-payloads_2_bigint_10k 363.09ns 2.75M PrefixSort 253.07% 143.47ns 6.97M StdSort_no-payloads_3_bigint_10k 483.58ns 2.07M PrefixSort 293.67% 164.67ns 6.07M StdSort_no-payloads_4_bigint_10k 593.29ns 1.69M PrefixSort 312.83% 189.65ns 5.27M StdSort_no-payloads_1_bigint_100k 330.44ns 3.03M PrefixSort 192.57% 171.59ns 5.83M StdSort_no-payloads_2_bigint_100k 470.79ns 2.12M PrefixSort 293.67% 160.31ns 6.24M StdSort_no-payloads_3_bigint_100k 607.15ns 1.65M PrefixSort 303.88% 199.80ns 5.01M StdSort_no-payloads_4_bigint_100k 706.03ns 1.42M PrefixSort 315.15% 224.03ns 4.46M StdSort_no-payloads_1_bigint_1000k 452.05ns 2.21M PrefixSort 204.92% 220.60ns 4.53M StdSort_no-payloads_2_bigint_1000k 645.35ns 1.55M PrefixSort 306.42% 210.61ns 4.75M StdSort_no-payloads_3_bigint_1000k 818.78ns 1.22M PrefixSort 328.01% 249.62ns 4.01M StdSort_no-payloads_4_bigint_1000k 981.65ns 1.02M PrefixSort 343.79% 285.54ns 3.50M StdSort_2payloads_1_bigint_1k 177.21ns 5.64M PrefixSort 139.94% 126.63ns 7.90M StdSort_2payloads_2_bigint_1k 248.46ns 4.02M PrefixSort 199.08% 124.80ns 8.01M StdSort_2payloads_3_bigint_1k 313.66ns 3.19M PrefixSort 218.48% 143.56ns 6.97M StdSort_2payloads_4_bigint_1k 359.17ns 2.78M PrefixSort 208.57% 172.21ns 5.81M StdSort_2payloads_1_bigint_10k 254.83ns 3.92M PrefixSort 168.28% 151.43ns 6.60M StdSort_2payloads_2_bigint_10k 363.92ns 2.75M PrefixSort 254.35% 143.08ns 6.99M StdSort_2payloads_3_bigint_10k 475.61ns 2.10M PrefixSort 288.96% 164.60ns 6.08M StdSort_2payloads_4_bigint_10k 594.78ns 1.68M PrefixSort 314.12% 189.35ns 5.28M StdSort_2payloads_1_bigint_100k 349.99ns 2.86M PrefixSort 205.28% 170.49ns 5.87M StdSort_2payloads_2_bigint_100k 489.03ns 2.04M PrefixSort 307.77% 158.89ns 6.29M StdSort_2payloads_3_bigint_100k 607.88ns 1.65M PrefixSort 305.62% 198.90ns 5.03M StdSort_2payloads_4_bigint_100k 715.90ns 1.40M PrefixSort 321.73% 222.52ns 4.49M StdSort_2payloads_1_bigint_1000k 574.14ns 1.74M PrefixSort 262.05% 219.10ns 4.56M StdSort_2payloads_2_bigint_1000k 796.05ns 1.26M PrefixSort 377.73% 210.75ns 4.75M StdSort_2payloads_3_bigint_1000k 1.02us 975.65K PrefixSort 411.02% 249.37ns 4.01M StdSort_2payloads_4_bigint_1000k 1.16us 858.48K PrefixSort 408.37% 285.24ns 3.51M StdSort_2payloads_1_bigint_0.01k 49.37ns 20.26M PrefixSort 99.124% 49.81ns 20.08M StdSort_2payloads_2_bigint_0.01k 89.08ns 11.23M PrefixSort 99.958% 89.12ns 11.22M StdSort_2payloads_3_bigint_0.01k 64.40ns 15.53M PrefixSort 99.991% 64.41ns 15.53M StdSort_2payloads_4_bigint_0.01k 86.56ns 11.55M PrefixSort 100.34% 86.26ns 11.59M StdSort_2payloads_1_bigint_0.015k 48.17ns 20.76M PrefixSort 100.11% 48.12ns 20.78M StdSort_2payloads_2_bigint_0.015k 55.44ns 18.04M PrefixSort 99.994% 55.45ns 18.03M StdSort_2payloads_3_bigint_0.015k 61.17ns 16.35M PrefixSort 99.988% 61.18ns 16.35M StdSort_2payloads_4_bigint_0.015k 57.55ns 17.38M PrefixSort 99.895% 57.61ns 17.36M StdSort_2payloads_1_bigint_0.02k 47.93ns 20.86M PrefixSort 99.916% 47.97ns 20.85M StdSort_2payloads_2_bigint_0.02k 84.10ns 11.89M PrefixSort 100.38% 83.78ns 11.94M StdSort_2payloads_3_bigint_0.02k 126.53ns 7.90M PrefixSort 100.05% 126.47ns 7.91M StdSort_2payloads_4_bigint_0.02k 164.44ns 6.08M PrefixSort 99.935% 164.55ns 6.08M StdSort_2payloads_1_bigint_0.05k 77.86ns 12.84M PrefixSort 99.171% 78.51ns 12.74M StdSort_2payloads_2_bigint_0.05k 118.10ns 8.47M PrefixSort 100.53% 117.48ns 8.51M StdSort_2payloads_3_bigint_0.05k 152.74ns 6.55M PrefixSort 100.02% 152.71ns 6.55M StdSort_2payloads_4_bigint_0.05k 184.56ns 5.42M PrefixSort 99.925% 184.70ns 5.41M StdSort_2payloads_1_bigint_0.1k 88.01ns 11.36M PrefixSort 100.46% 87.60ns 11.42M StdSort_2payloads_2_bigint_0.1k 138.22ns 7.24M PrefixSort 159.92% 86.43ns 11.57M StdSort_2payloads_3_bigint_0.1k 187.49ns 5.33M PrefixSort 187.96% 99.75ns 10.03M StdSort_2payloads_4_bigint_0.1k 232.52ns 4.30M PrefixSort 188.98% 123.04ns 8.13M StdSort_no-payloads_1_varchar_1k 292.32ns 3.42M PrefixSort 100.47% 290.96ns 3.44M StdSort_no-payloads_2_varchar_1k 380.98ns 2.62M PrefixSort 98.623% 386.29ns 2.59M StdSort_no-payloads_3_varchar_1k 456.15ns 2.19M PrefixSort 98.302% 464.03ns 2.16M StdSort_no-payloads_4_varchar_1k 520.84ns 1.92M PrefixSort 98.186% 530.46ns 1.89M StdSort_no-payloads_1_varchar_10k 422.83ns 2.37M PrefixSort 99.186% 426.30ns 2.35M StdSort_no-payloads_2_varchar_10k 495.10ns 2.02M PrefixSort 98.218% 504.08ns 1.98M StdSort_no-payloads_3_varchar_10k 584.89ns 1.71M PrefixSort 99.079% 590.33ns 1.69M StdSort_no-payloads_4_varchar_10k 667.37ns 1.50M PrefixSort 98.887% 674.88ns 1.48M StdSort_no-payloads_1_varchar_100k 605.27ns 1.65M PrefixSort 99.425% 608.78ns 1.64M StdSort_no-payloads_2_varchar_100k 741.11ns 1.35M PrefixSort 99.107% 747.78ns 1.34M StdSort_no-payloads_3_varchar_100k 890.60ns 1.12M PrefixSort 99.089% 898.78ns 1.11M StdSort_no-payloads_4_varchar_100k 1.11us 903.14K PrefixSort 104.50% 1.06us 943.76K StdSort_no-payloads_1_varchar_1000k 1.22us 822.83K PrefixSort 99.534% 1.22us 818.99K StdSort_no-payloads_2_varchar_1000k 1.52us 656.78K PrefixSort 99.353% 1.53us 652.53K StdSort_no-payloads_3_varchar_1000k 1.78us 560.23K PrefixSort 98.862% 1.81us 553.86K StdSort_no-payloads_4_varchar_1000k 1.93us 519.34K PrefixSort 99.159% 1.94us 514.97K ``` Part of #6766 Pull Request resolved: #8146 Reviewed By: Yuhta Differential Revision: D54247347 Pulled By: mbasmanova fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923
@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale. |
The core idea of this opt is 'use binary string to opt mulit columns sort', use quick-sort just for speeding up things. |
We will also try to optimize sort, initial proposal is:
Secondly, currently we will spill input for memory pressure, we can also request all the memory in the addInput function, and test if we can get the memory before request, if not, try to sort by prefixsort and spill
Will add a sort benchmark and sort spill benchmark to track the performance. |
@jinchengchenghh |
Yes, because this title is Optimize sort, same topic with mine, so I add it here, And I would like to apply the second optimization after prefix sort merged. Otherwise it will make it hard to enable prefixsort because it requires sort does not require memory. |
…cubator#8146) Summary: PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix' and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up. Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys. The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance. ``` ============================================================================ [...]ec/benchmarks/PrefixSortBenchmark.cpp relative time/iter iters/s ============================================================================ StdSort_no-payload_1_bigint_0.01k 49.40ns 20.24M PrefixSort 100.00% 49.40ns 20.24M StdSort_no-payload_2_bigint_0.01k 87.85ns 11.38M PrefixSort 99.904% 87.94ns 11.37M StdSort_no-payload_3_bigint_0.01k 65.04ns 15.37M PrefixSort 99.350% 65.47ns 15.27M StdSort_no-payload_4_bigint_0.01k 69.19ns 14.45M PrefixSort 99.566% 69.49ns 14.39M StdSort_no-payload_1_bigint_0.015k 47.93ns 20.86M PrefixSort 100.02% 47.92ns 20.87M StdSort_no-payload_2_bigint_0.015k 54.22ns 18.44M PrefixSort 99.921% 54.26ns 18.43M StdSort_no-payload_3_bigint_0.015k 61.00ns 16.39M PrefixSort 99.958% 61.02ns 16.39M StdSort_no-payload_4_bigint_0.015k 57.38ns 17.43M PrefixSort 99.870% 57.46ns 17.40M StdSort_no-payload_1_bigint_0.02k 47.82ns 20.91M PrefixSort 99.914% 47.86ns 20.90M StdSort_no-payload_2_bigint_0.02k 83.94ns 11.91M PrefixSort 100.00% 83.93ns 11.91M StdSort_no-payload_3_bigint_0.02k 128.43ns 7.79M PrefixSort 100.17% 128.21ns 7.80M StdSort_no-payload_4_bigint_0.02k 165.29ns 6.05M PrefixSort 99.997% 165.30ns 6.05M StdSort_no-payload_1_bigint_0.05k 77.34ns 12.93M PrefixSort 99.425% 77.79ns 12.86M StdSort_no-payload_2_bigint_0.05k 113.79ns 8.79M PrefixSort 99.786% 114.03ns 8.77M StdSort_no-payload_3_bigint_0.05k 152.81ns 6.54M PrefixSort 99.696% 153.27ns 6.52M StdSort_no-payload_4_bigint_0.05k 185.93ns 5.38M PrefixSort 100.02% 185.90ns 5.38M StdSort_no-payload_1_bigint_0.1k 87.39ns 11.44M PrefixSort 99.499% 87.83ns 11.39M StdSort_no-payload_2_bigint_0.1k 139.93ns 7.15M PrefixSort 162.24% 86.25ns 11.59M StdSort_no-payload_3_bigint_0.1k 186.27ns 5.37M PrefixSort 186.72% 99.76ns 10.02M StdSort_no-payload_4_bigint_0.1k 234.01ns 4.27M PrefixSort 187.97% 124.49ns 8.03M StdSort_no-payloads_1_bigint_1k 173.31ns 5.77M PrefixSort 136.72% 126.76ns 7.89M StdSort_no-payloads_2_bigint_1k 249.77ns 4.00M PrefixSort 199.49% 125.20ns 7.99M StdSort_no-payloads_3_bigint_1k 314.18ns 3.18M PrefixSort 219.49% 143.14ns 6.99M StdSort_no-payloads_4_bigint_1k 348.38ns 2.87M PrefixSort 203.28% 171.38ns 5.84M StdSort_no-payloads_1_bigint_10k 251.90ns 3.97M PrefixSort 165.99% 151.76ns 6.59M StdSort_no-payloads_2_bigint_10k 363.09ns 2.75M PrefixSort 253.07% 143.47ns 6.97M StdSort_no-payloads_3_bigint_10k 483.58ns 2.07M PrefixSort 293.67% 164.67ns 6.07M StdSort_no-payloads_4_bigint_10k 593.29ns 1.69M PrefixSort 312.83% 189.65ns 5.27M StdSort_no-payloads_1_bigint_100k 330.44ns 3.03M PrefixSort 192.57% 171.59ns 5.83M StdSort_no-payloads_2_bigint_100k 470.79ns 2.12M PrefixSort 293.67% 160.31ns 6.24M StdSort_no-payloads_3_bigint_100k 607.15ns 1.65M PrefixSort 303.88% 199.80ns 5.01M StdSort_no-payloads_4_bigint_100k 706.03ns 1.42M PrefixSort 315.15% 224.03ns 4.46M StdSort_no-payloads_1_bigint_1000k 452.05ns 2.21M PrefixSort 204.92% 220.60ns 4.53M StdSort_no-payloads_2_bigint_1000k 645.35ns 1.55M PrefixSort 306.42% 210.61ns 4.75M StdSort_no-payloads_3_bigint_1000k 818.78ns 1.22M PrefixSort 328.01% 249.62ns 4.01M StdSort_no-payloads_4_bigint_1000k 981.65ns 1.02M PrefixSort 343.79% 285.54ns 3.50M StdSort_2payloads_1_bigint_1k 177.21ns 5.64M PrefixSort 139.94% 126.63ns 7.90M StdSort_2payloads_2_bigint_1k 248.46ns 4.02M PrefixSort 199.08% 124.80ns 8.01M StdSort_2payloads_3_bigint_1k 313.66ns 3.19M PrefixSort 218.48% 143.56ns 6.97M StdSort_2payloads_4_bigint_1k 359.17ns 2.78M PrefixSort 208.57% 172.21ns 5.81M StdSort_2payloads_1_bigint_10k 254.83ns 3.92M PrefixSort 168.28% 151.43ns 6.60M StdSort_2payloads_2_bigint_10k 363.92ns 2.75M PrefixSort 254.35% 143.08ns 6.99M StdSort_2payloads_3_bigint_10k 475.61ns 2.10M PrefixSort 288.96% 164.60ns 6.08M StdSort_2payloads_4_bigint_10k 594.78ns 1.68M PrefixSort 314.12% 189.35ns 5.28M StdSort_2payloads_1_bigint_100k 349.99ns 2.86M PrefixSort 205.28% 170.49ns 5.87M StdSort_2payloads_2_bigint_100k 489.03ns 2.04M PrefixSort 307.77% 158.89ns 6.29M StdSort_2payloads_3_bigint_100k 607.88ns 1.65M PrefixSort 305.62% 198.90ns 5.03M StdSort_2payloads_4_bigint_100k 715.90ns 1.40M PrefixSort 321.73% 222.52ns 4.49M StdSort_2payloads_1_bigint_1000k 574.14ns 1.74M PrefixSort 262.05% 219.10ns 4.56M StdSort_2payloads_2_bigint_1000k 796.05ns 1.26M PrefixSort 377.73% 210.75ns 4.75M StdSort_2payloads_3_bigint_1000k 1.02us 975.65K PrefixSort 411.02% 249.37ns 4.01M StdSort_2payloads_4_bigint_1000k 1.16us 858.48K PrefixSort 408.37% 285.24ns 3.51M StdSort_2payloads_1_bigint_0.01k 49.37ns 20.26M PrefixSort 99.124% 49.81ns 20.08M StdSort_2payloads_2_bigint_0.01k 89.08ns 11.23M PrefixSort 99.958% 89.12ns 11.22M StdSort_2payloads_3_bigint_0.01k 64.40ns 15.53M PrefixSort 99.991% 64.41ns 15.53M StdSort_2payloads_4_bigint_0.01k 86.56ns 11.55M PrefixSort 100.34% 86.26ns 11.59M StdSort_2payloads_1_bigint_0.015k 48.17ns 20.76M PrefixSort 100.11% 48.12ns 20.78M StdSort_2payloads_2_bigint_0.015k 55.44ns 18.04M PrefixSort 99.994% 55.45ns 18.03M StdSort_2payloads_3_bigint_0.015k 61.17ns 16.35M PrefixSort 99.988% 61.18ns 16.35M StdSort_2payloads_4_bigint_0.015k 57.55ns 17.38M PrefixSort 99.895% 57.61ns 17.36M StdSort_2payloads_1_bigint_0.02k 47.93ns 20.86M PrefixSort 99.916% 47.97ns 20.85M StdSort_2payloads_2_bigint_0.02k 84.10ns 11.89M PrefixSort 100.38% 83.78ns 11.94M StdSort_2payloads_3_bigint_0.02k 126.53ns 7.90M PrefixSort 100.05% 126.47ns 7.91M StdSort_2payloads_4_bigint_0.02k 164.44ns 6.08M PrefixSort 99.935% 164.55ns 6.08M StdSort_2payloads_1_bigint_0.05k 77.86ns 12.84M PrefixSort 99.171% 78.51ns 12.74M StdSort_2payloads_2_bigint_0.05k 118.10ns 8.47M PrefixSort 100.53% 117.48ns 8.51M StdSort_2payloads_3_bigint_0.05k 152.74ns 6.55M PrefixSort 100.02% 152.71ns 6.55M StdSort_2payloads_4_bigint_0.05k 184.56ns 5.42M PrefixSort 99.925% 184.70ns 5.41M StdSort_2payloads_1_bigint_0.1k 88.01ns 11.36M PrefixSort 100.46% 87.60ns 11.42M StdSort_2payloads_2_bigint_0.1k 138.22ns 7.24M PrefixSort 159.92% 86.43ns 11.57M StdSort_2payloads_3_bigint_0.1k 187.49ns 5.33M PrefixSort 187.96% 99.75ns 10.03M StdSort_2payloads_4_bigint_0.1k 232.52ns 4.30M PrefixSort 188.98% 123.04ns 8.13M StdSort_no-payloads_1_varchar_1k 292.32ns 3.42M PrefixSort 100.47% 290.96ns 3.44M StdSort_no-payloads_2_varchar_1k 380.98ns 2.62M PrefixSort 98.623% 386.29ns 2.59M StdSort_no-payloads_3_varchar_1k 456.15ns 2.19M PrefixSort 98.302% 464.03ns 2.16M StdSort_no-payloads_4_varchar_1k 520.84ns 1.92M PrefixSort 98.186% 530.46ns 1.89M StdSort_no-payloads_1_varchar_10k 422.83ns 2.37M PrefixSort 99.186% 426.30ns 2.35M StdSort_no-payloads_2_varchar_10k 495.10ns 2.02M PrefixSort 98.218% 504.08ns 1.98M StdSort_no-payloads_3_varchar_10k 584.89ns 1.71M PrefixSort 99.079% 590.33ns 1.69M StdSort_no-payloads_4_varchar_10k 667.37ns 1.50M PrefixSort 98.887% 674.88ns 1.48M StdSort_no-payloads_1_varchar_100k 605.27ns 1.65M PrefixSort 99.425% 608.78ns 1.64M StdSort_no-payloads_2_varchar_100k 741.11ns 1.35M PrefixSort 99.107% 747.78ns 1.34M StdSort_no-payloads_3_varchar_100k 890.60ns 1.12M PrefixSort 99.089% 898.78ns 1.11M StdSort_no-payloads_4_varchar_100k 1.11us 903.14K PrefixSort 104.50% 1.06us 943.76K StdSort_no-payloads_1_varchar_1000k 1.22us 822.83K PrefixSort 99.534% 1.22us 818.99K StdSort_no-payloads_2_varchar_1000k 1.52us 656.78K PrefixSort 99.353% 1.53us 652.53K StdSort_no-payloads_3_varchar_1000k 1.78us 560.23K PrefixSort 98.862% 1.81us 553.86K StdSort_no-payloads_4_varchar_1000k 1.93us 519.34K PrefixSort 99.159% 1.94us 514.97K ``` Part of facebookincubator#6766 Pull Request resolved: facebookincubator#8146 Reviewed By: Yuhta Differential Revision: D54247347 Pulled By: mbasmanova fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923
Summary: According to the benchmark, for data sets larger than 0.5k, PrefixSort outperforms std::sort with performance improvements ranging from approximately 250% to over 500%. Here's a summary of the benchmark results: | Dataset Size | PrefixSort Improvement (No Payload) |PrefixSort Improvement(With Payload) | |--------------|-------------------------------------|-------------------------------------| | 0.5k | 248.97% - 287.43% | 249.71% - 289.74% | | 1k | 214.44% - 310.92% | 215.03% - 315.14% | | 10k | 216.21% - 255.38% | 217.88% - 256.88% | | 100k | 279.81% - 318.26% | 284.89% - 295.21% | | 1000k | 304.36% - 351.31% | 454.04% - 514.28% | follow-up #8350 Part of #6766 Pull Request resolved: #10946 Reviewed By: Yuhta Differential Revision: D62373593 Pulled By: mbasmanova fbshipit-source-id: b8594e05cc6aee736d09db1695db770b84e9d4bd
Description
We are replacing presto in the system with velox, but we have done a lot of self-research and optimization on performance based on presto. We found that there is a performance gap between directly using velox to replace presto and using our optimized presto. In the specific Sort scenario, we concluded after analysis that it may be caused by the misalignment of the processing methods in the following places::
Null value check
When the 'null value' in velox is stored in RowContainer, it is stored in the form of null flags + T() default value. When comparing, nullFlags will be read first, and then the data will be read.
The improvement we made is that we use the min/max of T according to compareFlags when storing, and directly read the data when comparing, eliminating the overhead of reading the null flag in one step.
Prefix sort
When compare(), the data is read directly from the underlying data structure. Whether it is Presto or Velox, there is some additional memory access overhead. In Presto, a compare requires reading data in different columns, while in Velox, the data Stored in a row-base manner, in addition to the sort key, the RowContainer also contains null flags, normalize data, and payload. In this way, the data required for sorting (sort key) is stored in a larger memory block than actually required. There will be additional cache misses.
The optimization we did was to introduce a prefix column in PageIndex/RowContainer, which is a vector< long > or vector< byte[] >:
When compare(), compare the prefix first. If the prefix is equal, then read the data from the underlying data structure. If the prefix has a higher selectivity, the performance of sort will be improved.
In our system, we previously implemented "range sort" based on the column-base data structure in PagesIndex (presto). The general logic is:
Suppose there is a sort statement order by a, b, c. First sort column a. If there are repeated data in column a, use the repeated data to construct several ranges, and then sort column b. When sorting, you only need to sort the items in the range, and then process column c in the same way.
However, this algorithm is based on the
column-base
data structure. Currently, RowContainer stores data inrow-base
. Can RowContainer support column-base storage?To summarize the issues :
Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing
The text was updated successfully, but these errors were encountered: