Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize sort #6766

Open
skadilover opened this issue Sep 27, 2023 · 38 comments
Open

Optimize sort #6766

skadilover opened this issue Sep 27, 2023 · 38 comments
Labels
enhancement New feature or request

Comments

@skadilover
Copy link
Contributor

skadilover commented Sep 27, 2023

Description

We are replacing presto in the system with velox, but we have done a lot of self-research and optimization on performance based on presto. We found that there is a performance gap between directly using velox to replace presto and using our optimized presto. In the specific Sort scenario, we concluded after analysis that it may be caused by the misalignment of the processing methods in the following places::

  1. Null value check
    When the 'null value' in velox is stored in RowContainer, it is stored in the form of null flags + T() default value. When comparing, nullFlags will be read first, and then the data will be read.
    The improvement we made is that we use the min/max of T according to compareFlags when storing, and directly read the data when comparing, eliminating the overhead of reading the null flag in one step.

  2. Prefix sort
    When compare(), the data is read directly from the underlying data structure. Whether it is Presto or Velox, there is some additional memory access overhead. In Presto, a compare requires reading data in different columns, while in Velox, the data Stored in a row-base manner, in addition to the sort key, the RowContainer also contains null flags, normalize data, and payload. In this way, the data required for sorting (sort key) is stored in a larger memory block than actually required. There will be additional cache misses.
    The optimization we did was to introduce a prefix column in PageIndex/RowContainer, which is a vector< long > or vector< byte[] >:

  1. For Byte/Integer/Bigint, it is directly stored in the prefix
  2. For Varchar, take the first 8 bytes of the string and store them in prefix
  3. For the case of multiple columns, use a fixed-length byte[] to store a prefix. If Byte/Integer/Bigint is written into the prefix, when the first Varchar is encountered, the bytes of the remaining length of the prefix are written into the prefix. Also ignore remaining sort keys
    When compare(), compare the prefix first. If the prefix is ​​equal, then read the data from the underlying data structure. If the prefix has a higher selectivity, the performance of sort will be improved.
  1. Range Sort
    In our system, we previously implemented "range sort" based on the column-base data structure in PagesIndex (presto). The general logic is:
    Suppose there is a sort statement order by a, b, c. First sort column a. If there are repeated data in column a, use the repeated data to construct several ranges, and then sort column b. When sorting, you only need to sort the items in the range, and then process column c in the same way.
    However, this algorithm is based on the column-base data structure. Currently, RowContainer stores data in row-base. Can RowContainer support column-base storage?

To summarize the issues :

  1. Are the above improvements to Sort reasonable?
  2. Currently, velox’s sort implementation is simply std::sort. Does velox have plans to optimize sort performance in the future?
  3. In practice, how should we quickly incorporate optimization suggestions into velox?

Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing

@skadilover skadilover added the enhancement New feature or request label Sep 27, 2023
@mbasmanova
Copy link
Contributor

CC: @oerling @pedroerp

@mbasmanova
Copy link
Contributor

Hi @skadilover, welcome to the community. Would you introduce yourself? Where are you from?

The optimizations you described above sound interesting. Are you interested in trying them out, iterating and eventually incorporating them into Velox?

@skadilover
Copy link
Contributor Author

skadilover commented Sep 28, 2023

@mbasmanova
I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.
Very glad that the community is interested in these ideas. Add something that was not mentioned in the above description:

  1. The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.
  2. Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

@mbasmanova
Copy link
Contributor

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

@mbasmanova
Copy link
Contributor

Based on "remove null check" and "prefix sort", I have completed a prototype. In a single column BIGINT scenario, E2E tested the sort operator with a 40% performance improvement. I am organizing the code in the hope of integrating it into the community.

Fantastic. Looking forward to the PR.

@mbasmanova
Copy link
Contributor

Finally, I currently need some help and guidance on how to benchmark the sort operator. I have not found any examples of benchmarking an operator. Can anyone give me some advice?

Check out velox/functions/prestosql/aggregates/benchmarks/ReduceAgg.cpp

This is a benchmark for reduce_agg aggregate function which benchmarks a mini query (table scan + aggregation). I suggest you write similar one to benchmark table scan + order-by. One may comment that this doesn't benchmark order by in isolation, but that's Ok. After all there is no point to speed a portion of the query that's not using significant amount of CPU time to begin with. E2e query speed up is what we are after.

Hope this helps.

@mbasmanova
Copy link
Contributor

The core idea of 'prefix sort' is similar to the 'Binary String Comparison' described in duckDB article(https://duckdb.org/2021/08/27/external-sorting.html#binary-string-comparison), 'It encodes all columns in the ORDER BY clause into a single binary sequence', in our case it is called 'prefix'.

This makes a lot of sense. Excited to see this optimization coming to Velox.

CC: @amitkdutta @aditi-pandit

@skadilover
Copy link
Contributor Author

I am an engineer from Alibaba, we are trying to use Velox to speed up Presto jobs.

Welcome. What's your name? I only see GItHub handle skadilover.

My name is Heng Jiang.

I'm curious whether the optimizations to Presto you mentioned earlier are available in the prestodb repo or if these are internal enhancements.

Not available in the prestodb repo, these are internal enhancements.

@skadilover
Copy link
Contributor Author

Fantastic. Looking forward to the PR.

Next week is our National Day holiday (10 days). After the holiday, I will continue to organize the code and submit the PR as soon as possible

@mbasmanova
Copy link
Contributor

@skadilover Nice to meet you, Heng. Enjoy the holiday. Looking forward to hearing again from you in a couple of weeks.

@zhouyuan
Copy link
Contributor

zhouyuan commented Oct 11, 2023

@mbasmanova @skadilover
here's one related patch on s/std::sort/timsort.
#6745
TimSort behaviors very well on real-world data, here's a good intro(https://hackernoon.com/timsort-the-fastest-sorting-algorithm-youve-never-heard-of-36b28417f399)
We got up to ~2x performance on customer workloads.

thanks, -yuan

@pedroerp
Copy link
Contributor

pedroerp commented Oct 23, 2023

Hi guys, sorry about the delay to follow up on this thread. I've recently helped review this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

A few specific questions:

  1. Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.

  2. "The improvement we made is that we use the min/max of T according to compareFlags when storing, "
    How does this work if you actually have the min/max values as part of the data?

  3. "Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

@skadilover
Copy link
Contributor Author

skadilover commented Oct 25, 2023

Hi guys, sorry about the delay to follow up on this thread. I've recently helped reviewed this great paper about sorting optimization (which eventually got published at TODS); linking it here for folks who are interested in the subject:

https://arxiv.org/pdf/2209.08420.pdf

@pedroerp Glad to hear your response for this issue!~ The paper mentioned above is very useful.

PrefixSort is not just a normalizekey technology: it is different from velox’s existing storage structure (RowContainer
) It hopes to encode the sort key separately like duckdb. At the same time, in order to avoid the performance degradation of the swap method in scenarios where the sort keys are particularly long, we only encode the first few columns in the form of prefix.

In order to facilitate communication, I submitted a PR #7230 (it is not 100% ready yet, some code will be adjusted based on my E2E test results), and the design document is also being prepared (it will take a few days).If have time, could you help me take a look at this PR first? @mbasmanova @pedroerp

@skadilover
Copy link
Contributor Author

skadilover commented Oct 25, 2023

  1. Have you guys also looked at offset-value coding? By reading the paper is seems like a great idea, but I have never seen a real system implementing it - the google folks at the paper claim they use it internally though.

@pedroerp Yes, for some specific columns we may refer to the offset-value encoding format, but this work may be done later. I hope to optimize the single-column bigint/varchar scenario first, especially in the bigint scenario. radix-sort

  1. "The improvement we made is that we use the min/max of T according to compareFlags when storing, "
    How does this work if you actually have the min/max values as part of the data?

Directly store min/max instead of T() (what velox does now)

  1. "Prefix sort" this sounds like the the same idea as what we have implemented for StringViews. Does that not help in this situation, or we only need a similar optimization for other variable-length types?

Like question 1, I think this is a encoding issue. I will discuss it scene by scene in the subsequent PRs.

@skadilover
Copy link
Contributor Author

This sounds all very interesting. It would be super helpful if as a first step we established a microbenchmark to be able to tweak and reproduce these optimizations.

@pedroerp
At present, I can only verify the optimization results through E2E testing (our framework is similar to presto-cpp). The sort optimization results are affected by the size of the data set. I am currently using a 1T tpch table. Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

@pedroerp
Copy link
Contributor

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

@skadilover
Copy link
Contributor Author

Regarding building similar scenarios microbenchmark, could you give me some reference examples? That would be very helpful in my current job

This is a good example. Maybe come up with something similar to this that generates and sorts a dataset, then we could experiment with your optimization and see how much they improve performance?

https://github.com/facebookincubator/velox/blob/main/velox/exec/benchmarks/MergeBenchmark.cpp

Thanks a lot, This example looks like it could work. I will try it soon

@mbasmanova
Copy link
Contributor

@skadilover Wondering if you had a chance to review Arrow Row format: https://docs.rs/arrow-row/latest/arrow_row/index.html

"Rows are normalized for sorting, and can therefore be very efficiently compared, using memcmp under the hood, or used in non-comparison sorts such as radix sort. This makes the row format ideal for implementing efficient multi-column sorting, grouping, aggregation, windowing and more, as described in more detail in this blog post."

@mbasmanova
Copy link
Contributor

@skadilover
Copy link
Contributor Author

skadilover commented Oct 30, 2023

Some additional references: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf

@mbasmanova Thanks for the references above.

The encoding tech (include "normalized key", "memory compare" etc. ) can be very efficiently , and we have implemented them in our inernal presto before. In PrefixSort optimization, I hope to implement them step by step. This week I plan to prepare a design document (including some research E2E tests and benchmarks) to describe my plan in detail.

@mbasmanova
Copy link
Contributor

@skadilover Looking forward to hearing more from you on this topic. Thanks.

@skadilover
Copy link
Contributor Author

skadilover commented Nov 8, 2023

Overview

PrefixSort's optimization plan includes two optimization points: normalized keys optimization and cpu cache optimization.

Normalized Key For Multi Sort Keys

The core idea of ​​normalized key is: assuming that sort keys can be encoded into a binary string and the sorting semantics remain unchanged, then all sort keys can be regarded as a whole for sorting (see [1]). At this time, memcmp can also use simd to speed up. In the submitted PR [https://github.com//pull/7230] implementation, in the scenario of two columns of BIGINT, we can get a performance improvement of 200%-300%.
image
The benchmark test results are as follows:
image

PrefixSort currently supports BIGINT, and the current research results of other fixed-length types are that they can be implemented. The VARCHAR scenario will be a little complicated. We also studied the implementation of duckDB, and the following somewhat special method was used in its implementation. Range sort is used to process varchar. Further research and testing of the encoding format will be conducted to improve this part of the design:
image

Memory Data Structure

In order to implement the above-mentioned Normalized Key technology in velox, we consider redesigning a data structure to store the normalized key for the following two reasons:

  1. The Sort algorithm requires an iterator with "Random Access Iterator" semantics see[3][4], and the RowContainer data currently uses non-contiguous allocation to store row data, which is implemented on the basis of non-contiguous memory. Random Access Iterator has performance problems and requires complex pointer mapping.
  2. Currently, the normalized field in RowContainer does not serve for sort and does not contain compareFlags semantic information. In addition, the processing of null values ​​will also affect the implementation of the compare method.

First, we review the current OrderBy operator implementation of velox, as shown in the figure below:
image

The general process of implementation is:

  1. Get the std::vector<char*> returnRows of the row pointer through the listRow() method of RowContainer. The reason for this step is that the sort algorithm requires a contiguous memory block to facilitate the implementation of the iterator.
  2. Give the begin/end iterator of returnRows to std::stort for sorting. The logic of the custom compare method provided for std::vector is: first get the row ptr from returnRows, and then use ptr to access the data in the RowContainer.
    From the perspective of CPU cache, the problem with this implementation is:
    Research on the sort algorithm basically has an assumption: the memory access operation is constant and low-time consuming. This assumption is wrong in engineering practice. In engineering practice, we need to consider the issue of memory access efficiency. Here is a classic example see [2]:
    image

Going back to the implementation of the OrderBy operator, we can find that the compare method in the sort loop needs to access two memory blocks (address vector and row container) at the same time. As the memory grows, memory access problems will significantly affect the efficiency of sort execution. . The solution to this problem is to store the sort key and row Num separately. Take duckDB as an example:
image
In this solution, the performance of the compare method is improved, while the performance of the swap method will decrease (swapping all sort keys will be slower than swapping only row addresses). When the number of sort columns increases to a certain extent, the overhead of swap will exceed the benefits of the compare method. , so there needs to be a rollback mechanism here. The design of PrefixSort is as follows:
image
Use row ptr to replace row num in the implementation of PrefixSort. The purpose of this is:

  1. There can be a fallback mechanism when the size of sort keys is greater than a certain value.
  2. At the same time, using row ptr can also better adapt to the current velox operator implementation and facilitate subsequent output operations.
  3. Data in RowConatiner can be reused without re-storing payload data.

In the E2E test of single column BIGINT, we can get 30%-40% gain. In fact, it will be faster if we use reintercept_cast to get the value, but for the unified implementation of normalize key, memcmp is used here for comparison. In long scenarios, memcmp is slower than reintercept_cast. In the implementation, template memcmp in duckDB is used to replace std::memcmp. There is some improvement in the effect, but it is still slower than the implementation of reintercept_cast.
In addition, directly using std::vecotr to store keys during prototype testing can achieve greater performance improvements. However, considering the maintenance cost of the overall code and the possibility of changing the subsequent sort algorithm to RadixSort, this is not currently possible to consider this optimization:

image

Conclusion and plan

By optimizing the memory data structure and using Normalize technology, PrefixSort will have performance improvements compared to the original implementation in both single-column and multi-column cases. It is planned to be divided into 4 stages to achieve complete functions.

  • [done] PR#1 a Implement PrefixSort framework code b Support BIGINT c Implement the BenchMark testing framework
  • [on going] PR#2 Supports all basic types including (INT/SORT/TIME type/DOUBLE...)
  • [on going] PR#3 Support VARCHAR type
  • [suspend] PR#4 Support Radix Sort

How To Reproduce Optimization

Just Run PrefixSortBenchmark.cpp && OrderByTest.cpp

reference:

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.1080&rep=rep1&type=pdf
[2] https://courses.cs.washington.edu/courses/cse373/21sp/lectures/lecture24.pdf
[3] https://en.cppreference.com/w/cpp/iterator/random_access_iterator
[4] https://en.cppreference.com/w/cpp/algorithm/sort

@skadilover
Copy link
Contributor Author

skadilover commented Nov 8, 2023

@mbasmanova updates for the topic.
also pr : #7230

@mbasmanova
Copy link
Contributor

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

@skadilover
Copy link
Contributor Author

@skadilover Thank you for detailed description of the proposed design. I have some questions to make sure my understanding is correct. Does this design involve making a copy of sorting keys and row pointers for all rows using continuous allocation? I'm thinking about the table with 2 columns (normalized key and row ptr) in the diagram.

yes ,there is a copy

@mbasmanova
Copy link
Contributor

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key. Also, can you extend the design to support complex types as well?

@skadilover
Copy link
Contributor Author

skadilover commented Nov 9, 2023

@skadilover Got it. What are your thoughts on the limit of the normalized key size? There can be lots of columns and some columns can be variable-width, hence, I assume we need to put a hard limit on the "prefix" of the normalized key.

  1. If the type of column that needs to be sorted can be normalized and is small fix size such as bigint, I think there is no need to limit the size of the Prefix. The reason is that as seen in the benchmark results, the execution time in this scenario does not increase significantly as the number of columns increases.

  2. If only one part can be normalized, such as String_view, the remaining parts need to be compared using pointers. From a design perspective, if the remaining columns can be normlized, we can continue memcmp after comparing using pointers.But this goes against our original intention of treating Prefix as a whole and using memcmp.In this scenario I think it would be better to only store the first part normalized column in prefix.When VARCHAR scenarios are supported in the future, choise will be made based on benchmark results.Currently in code design, we can flexibly define behavior through PrefixSortLayout(holding relevant information)
    For variable-width types , we have two choise as below , I prefer the plan B right now, the reason is that in real scenarios there may not be so many columns that need to be sorted
    image

  3. In some special scenarios, our customers know the data distribution. For example, the first column that needs to be sorted has high selectivity, and there is no need to store the remaining sorted columns in Prefix (especially if there are many columns left). , we hope that our customers can control the size of Prefix through SQL Hint, leaving space for optimization for specific scenarios, and this can also be applied to some adaptive technologies

  4. swap(unlike use row ptr, we swap the whole prefix), copy and continuous allocation may involve overhead,this also need to put a hard limit on the "prefix" of the normalized key.

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

Also, can you extend the design to support complex types as well?

If necessary, we can store some potential sorting prefixes in Prefix and use row ptr to compare(simply call RowContainer`s compare method). Like Varchar, if complex type cannot obtain benefits through memcmp, I feel that it is possible to use only pointers for comparison will be better .

@mbasmanova

@skadilover
Copy link
Contributor Author

skadilover commented Nov 9, 2023

@mbasmanova the pr closed above is not used, stil use #7230 to merge code.

@skadilover
Copy link
Contributor Author

@mbasmanova Is there more questions about the topic ? Could you help me to review the code?

@mbasmanova
Copy link
Contributor

@skadilover Thank you for working on this. Took a quick look and left some initial comments on the PR.

@mbasmanova
Copy link
Contributor

@skadilover

Therefore we`d better to provide a config option for users to set limit of prefix size, and in our practice there is no limit by default.

I think we need to have a sensible limit by default. I'm also interested in figuring out support for varchar and row() keys as these are common.

@skadilover
Copy link
Contributor Author

skadilover commented Nov 14, 2023

I think we need to have a sensible limit by default.

I will check it later.

I'm also interested in figuring out support for varchar and row() keys as these are common.

Our previous method of dealing with varchar was to 1. analyze the maximum length of the data before sorting and then process it according to a fixed size, 2. only retain the prefix for those that exceed the length limit. I hope to find a better way. Currently I am still doing relevant research work.

@mbasmanova

@skadilover
Copy link
Contributor Author

@mbasmanova
As we discussed before, I have started to split the original PR into 3 sub PRs and merge them according to the dependencies. Could you help me to review this PR first?

skadilover added a commit to skadilover/velox that referenced this issue Dec 14, 2023
Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods.
Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up.
[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
skadilover added a commit to skadilover/velox that referenced this issue Dec 14, 2023
Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods.
Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up.
[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
skadilover added a commit to skadilover/velox that referenced this issue Dec 14, 2023
1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods.
2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up.
[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.
facebook-github-bot pushed a commit that referenced this issue Dec 14, 2023
…lized keys (#7624)

Summary:
Part of #6766

1. Add PrefixSortAlgorithm and unit test. The reason for introducing a custom quick-sort implementation without using std::sort directly is that : the sort keys that can be normalized in 'PrefixSort' are placed in a continuous memory block, during the running process of the sorting algorithm, the data will be moved(not only data addrees to avoid losing data locality), and std::sort does not support custom swap methods.
2. Use quick-sort algorithm as described in “Engineering a Sort Function” paper, see[1]. May explore using radix sort as a follow-up.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8162&rep=rep1&type=pdf.

Pull Request resolved: #7624

Reviewed By: xiaoxmeng

Differential Revision: D52155526

Pulled By: mbasmanova

fbshipit-source-id: fa3f59c33f14f2fcf3dc80e1589426d142aa98d7
@mbasmanova mbasmanova changed the title Sort optimization suggestions Optimize sort Jan 19, 2024
facebook-github-bot pushed a commit that referenced this issue Feb 2, 2024
Summary:
When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding.
The changes :
1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method
2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp

Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing

Part of #6766

Pull Request resolved: #8350

Reviewed By: pedroerp

Differential Revision: D53348718

Pulled By: mbasmanova

fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62
FelixYBW pushed a commit to FelixYBW/velox that referenced this issue Feb 12, 2024
Summary:
When processing normalization-encoding in the PrefixSort, we cannot simply equate null to max-value or min-value. Because in this way, we cannot distinguish between null and max/min value in a null-first scenario, and we have to add a nullbyte in normalize encoding.
The changes :
1. Add CompareFlag: ascending nullsFirst and isNull to the encode() method
2. Add new method encodeNoNulls to handle cases where there are no nulls, suports: uint64_t/int64_t/uint32_t/int32_t/float/double/Timestamp

Design doc: https://docs.google.com/document/d/1wa1lbbR-bhf0eg1mSaH7JUzeG7vhwz94a6ElUTK0J8k/edit?usp=sharing

Part of facebookincubator#6766

Pull Request resolved: facebookincubator#8350

Reviewed By: pedroerp

Differential Revision: D53348718

Pulled By: mbasmanova

fbshipit-source-id: 6f7887fb9d09b17af6a786de82fc00e116156a62
mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this issue Feb 27, 2024
…cubator#8146)

Summary:
Add PrefixSort and ut & benchmarks for it :
1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix'  and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up.
2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc.

3. Benchmarks, std-sort vs prefix-sort :
`
============================================================================
[...]ec/benchmarks/PrefixSortBenchmark.cpp     relative  time/iter   iters/s
============================================================================
StdSort_Bigint100k                                         34.30ms     29.16
PrefixSort_Bigint100k                                      27.16ms     36.81
StdSort_Bigint1M                                          465.00ms      2.15
PrefixSort_Bigint1M                                       330.97ms      3.02
StdSort_Bigint10M                                            7.96s   125.68m
PrefixSort_Bigint10M                                         3.87s   258.28m
StdSort_Varchar100k                                        62.02ms     16.12
PrefixSort_Varchar100k                                     61.64ms     16.22
StdSort_Varchar1M                                            1.31s   763.74m
PrefixSort_Varchar1M                                         1.27s   785.51m
StdSort_Varchar10M                                          23.06s    43.36m
PrefixSort_Varchar10M                                       22.67s    44.11m
StdSort_BigintBigint100k                                   48.61ms     20.57
PrefixSort_BigintBigint100k                                30.70ms     32.58
StdSort_BigintBigint1M                                    631.11ms      1.58
PrefixSort_BigintBigint1M                                 379.40ms      2.64
StdSort_BigintBigint10M                                     11.38s    87.86m
PrefixSort_BigintBigint10M                                   4.42s   226.42m
StdSort_BigintVarchar100k                                  65.36ms     15.30
PrefixSort_BigintVarchar100k                               56.52ms     17.69
StdSort_BigintVarchar1M                                      1.29s   777.33m
PrefixSort_BigintVarchar1M                                   1.11s   901.87m
StdSort_BigintVarchar10M                                    22.03s    45.39m
PrefixSort_BigintVarchar10M                                 18.00s    55.56m
`

For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects  about 1x-2x up.

For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same.

One more thing is that we use std::memcmp in this PR,  if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious.

Part of facebookincubator#6766

Pull Request resolved: facebookincubator#8146

Differential Revision: D54247347

Pulled By: mbasmanova
mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this issue Feb 27, 2024
…cubator#8146)

Summary:
Add PrefixSort and ut & benchmarks for it :
1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix'  and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up.
2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc.

3. Benchmarks, std-sort vs prefix-sort :
`
============================================================================
[...]ec/benchmarks/PrefixSortBenchmark.cpp     relative  time/iter   iters/s
============================================================================
StdSort_Bigint100k                                         34.30ms     29.16
PrefixSort_Bigint100k                                      27.16ms     36.81
StdSort_Bigint1M                                          465.00ms      2.15
PrefixSort_Bigint1M                                       330.97ms      3.02
StdSort_Bigint10M                                            7.96s   125.68m
PrefixSort_Bigint10M                                         3.87s   258.28m
StdSort_Varchar100k                                        62.02ms     16.12
PrefixSort_Varchar100k                                     61.64ms     16.22
StdSort_Varchar1M                                            1.31s   763.74m
PrefixSort_Varchar1M                                         1.27s   785.51m
StdSort_Varchar10M                                          23.06s    43.36m
PrefixSort_Varchar10M                                       22.67s    44.11m
StdSort_BigintBigint100k                                   48.61ms     20.57
PrefixSort_BigintBigint100k                                30.70ms     32.58
StdSort_BigintBigint1M                                    631.11ms      1.58
PrefixSort_BigintBigint1M                                 379.40ms      2.64
StdSort_BigintBigint10M                                     11.38s    87.86m
PrefixSort_BigintBigint10M                                   4.42s   226.42m
StdSort_BigintVarchar100k                                  65.36ms     15.30
PrefixSort_BigintVarchar100k                               56.52ms     17.69
StdSort_BigintVarchar1M                                      1.29s   777.33m
PrefixSort_BigintVarchar1M                                   1.11s   901.87m
StdSort_BigintVarchar10M                                    22.03s    45.39m
PrefixSort_BigintVarchar10M                                 18.00s    55.56m
`

For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects  about 1x-2x up.

For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same.

One more thing is that we use std::memcmp in this PR,  if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious.

Part of facebookincubator#6766

Pull Request resolved: facebookincubator#8146

Differential Revision: D54247347

Pulled By: mbasmanova
mbasmanova pushed a commit to mbasmanova/velox-1 that referenced this issue Feb 27, 2024
…cubator#8146)

Summary:
Add PrefixSort and ut & benchmarks for it :
1. PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix'  and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.) as a basic PR, more types will add as a follow-up.
2. As a basic work, this PR add ut & benchmarks for scalar types includes single/multi keys and fuzz cases etc.

3. Benchmarks, std-sort vs prefix-sort :
`
============================================================================
[...]ec/benchmarks/PrefixSortBenchmark.cpp     relative  time/iter   iters/s
============================================================================
StdSort_Bigint100k                                         34.30ms     29.16
PrefixSort_Bigint100k                                      27.16ms     36.81
StdSort_Bigint1M                                          465.00ms      2.15
PrefixSort_Bigint1M                                       330.97ms      3.02
StdSort_Bigint10M                                            7.96s   125.68m
PrefixSort_Bigint10M                                         3.87s   258.28m
StdSort_Varchar100k                                        62.02ms     16.12
PrefixSort_Varchar100k                                     61.64ms     16.22
StdSort_Varchar1M                                            1.31s   763.74m
PrefixSort_Varchar1M                                         1.27s   785.51m
StdSort_Varchar10M                                          23.06s    43.36m
PrefixSort_Varchar10M                                       22.67s    44.11m
StdSort_BigintBigint100k                                   48.61ms     20.57
PrefixSort_BigintBigint100k                                30.70ms     32.58
StdSort_BigintBigint1M                                    631.11ms      1.58
PrefixSort_BigintBigint1M                                 379.40ms      2.64
StdSort_BigintBigint10M                                     11.38s    87.86m
PrefixSort_BigintBigint10M                                   4.42s   226.42m
StdSort_BigintVarchar100k                                  65.36ms     15.30
PrefixSort_BigintVarchar100k                               56.52ms     17.69
StdSort_BigintVarchar1M                                      1.29s   777.33m
PrefixSort_BigintVarchar1M                                   1.11s   901.87m
StdSort_BigintVarchar10M                                    22.03s    45.39m
PrefixSort_BigintVarchar10M                                 18.00s    55.56m
`

For the case that all keys can normlized, Bigint, BiginBigint, when row numbers bigger than 1M, you can see obvious optimization effects  about 1x-2x up.

For the case that can not normlized, Varchar , as binary string opt logic is skipped , the performance keep same.

One more thing is that we use std::memcmp in this PR,  if the cost is worth it, we can align the prefix-buffer to 4 or 8, so that we can use int compare or long compare, and the performance will be much more obvious.

Part of facebookincubator#6766

Pull Request resolved: facebookincubator#8146

Differential Revision: D54247347

Pulled By: mbasmanova
facebook-github-bot pushed a commit that referenced this issue Mar 13, 2024
Summary:
PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix'  and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up.

Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys.

The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance.

```
============================================================================
[...]ec/benchmarks/PrefixSortBenchmark.cpp     relative  time/iter   iters/s
============================================================================
StdSort_no-payload_1_bigint_0.01k                          49.40ns    20.24M
PrefixSort                                      100.00%    49.40ns    20.24M
StdSort_no-payload_2_bigint_0.01k                          87.85ns    11.38M
PrefixSort                                      99.904%    87.94ns    11.37M
StdSort_no-payload_3_bigint_0.01k                          65.04ns    15.37M
PrefixSort                                      99.350%    65.47ns    15.27M
StdSort_no-payload_4_bigint_0.01k                          69.19ns    14.45M
PrefixSort                                      99.566%    69.49ns    14.39M
StdSort_no-payload_1_bigint_0.015k                         47.93ns    20.86M
PrefixSort                                      100.02%    47.92ns    20.87M
StdSort_no-payload_2_bigint_0.015k                         54.22ns    18.44M
PrefixSort                                      99.921%    54.26ns    18.43M
StdSort_no-payload_3_bigint_0.015k                         61.00ns    16.39M
PrefixSort                                      99.958%    61.02ns    16.39M
StdSort_no-payload_4_bigint_0.015k                         57.38ns    17.43M
PrefixSort                                      99.870%    57.46ns    17.40M
StdSort_no-payload_1_bigint_0.02k                          47.82ns    20.91M
PrefixSort                                      99.914%    47.86ns    20.90M
StdSort_no-payload_2_bigint_0.02k                          83.94ns    11.91M
PrefixSort                                      100.00%    83.93ns    11.91M
StdSort_no-payload_3_bigint_0.02k                         128.43ns     7.79M
PrefixSort                                      100.17%   128.21ns     7.80M
StdSort_no-payload_4_bigint_0.02k                         165.29ns     6.05M
PrefixSort                                      99.997%   165.30ns     6.05M
StdSort_no-payload_1_bigint_0.05k                          77.34ns    12.93M
PrefixSort                                      99.425%    77.79ns    12.86M
StdSort_no-payload_2_bigint_0.05k                         113.79ns     8.79M
PrefixSort                                      99.786%   114.03ns     8.77M
StdSort_no-payload_3_bigint_0.05k                         152.81ns     6.54M
PrefixSort                                      99.696%   153.27ns     6.52M
StdSort_no-payload_4_bigint_0.05k                         185.93ns     5.38M
PrefixSort                                      100.02%   185.90ns     5.38M
StdSort_no-payload_1_bigint_0.1k                           87.39ns    11.44M
PrefixSort                                      99.499%    87.83ns    11.39M
StdSort_no-payload_2_bigint_0.1k                          139.93ns     7.15M
PrefixSort                                      162.24%    86.25ns    11.59M
StdSort_no-payload_3_bigint_0.1k                          186.27ns     5.37M
PrefixSort                                      186.72%    99.76ns    10.02M
StdSort_no-payload_4_bigint_0.1k                          234.01ns     4.27M
PrefixSort                                      187.97%   124.49ns     8.03M
StdSort_no-payloads_1_bigint_1k                           173.31ns     5.77M
PrefixSort                                      136.72%   126.76ns     7.89M
StdSort_no-payloads_2_bigint_1k                           249.77ns     4.00M
PrefixSort                                      199.49%   125.20ns     7.99M
StdSort_no-payloads_3_bigint_1k                           314.18ns     3.18M
PrefixSort                                      219.49%   143.14ns     6.99M
StdSort_no-payloads_4_bigint_1k                           348.38ns     2.87M
PrefixSort                                      203.28%   171.38ns     5.84M
StdSort_no-payloads_1_bigint_10k                          251.90ns     3.97M
PrefixSort                                      165.99%   151.76ns     6.59M
StdSort_no-payloads_2_bigint_10k                          363.09ns     2.75M
PrefixSort                                      253.07%   143.47ns     6.97M
StdSort_no-payloads_3_bigint_10k                          483.58ns     2.07M
PrefixSort                                      293.67%   164.67ns     6.07M
StdSort_no-payloads_4_bigint_10k                          593.29ns     1.69M
PrefixSort                                      312.83%   189.65ns     5.27M
StdSort_no-payloads_1_bigint_100k                         330.44ns     3.03M
PrefixSort                                      192.57%   171.59ns     5.83M
StdSort_no-payloads_2_bigint_100k                         470.79ns     2.12M
PrefixSort                                      293.67%   160.31ns     6.24M
StdSort_no-payloads_3_bigint_100k                         607.15ns     1.65M
PrefixSort                                      303.88%   199.80ns     5.01M
StdSort_no-payloads_4_bigint_100k                         706.03ns     1.42M
PrefixSort                                      315.15%   224.03ns     4.46M
StdSort_no-payloads_1_bigint_1000k                        452.05ns     2.21M
PrefixSort                                      204.92%   220.60ns     4.53M
StdSort_no-payloads_2_bigint_1000k                        645.35ns     1.55M
PrefixSort                                      306.42%   210.61ns     4.75M
StdSort_no-payloads_3_bigint_1000k                        818.78ns     1.22M
PrefixSort                                      328.01%   249.62ns     4.01M
StdSort_no-payloads_4_bigint_1000k                        981.65ns     1.02M
PrefixSort                                      343.79%   285.54ns     3.50M
StdSort_2payloads_1_bigint_1k                             177.21ns     5.64M
PrefixSort                                      139.94%   126.63ns     7.90M
StdSort_2payloads_2_bigint_1k                             248.46ns     4.02M
PrefixSort                                      199.08%   124.80ns     8.01M
StdSort_2payloads_3_bigint_1k                             313.66ns     3.19M
PrefixSort                                      218.48%   143.56ns     6.97M
StdSort_2payloads_4_bigint_1k                             359.17ns     2.78M
PrefixSort                                      208.57%   172.21ns     5.81M
StdSort_2payloads_1_bigint_10k                            254.83ns     3.92M
PrefixSort                                      168.28%   151.43ns     6.60M
StdSort_2payloads_2_bigint_10k                            363.92ns     2.75M
PrefixSort                                      254.35%   143.08ns     6.99M
StdSort_2payloads_3_bigint_10k                            475.61ns     2.10M
PrefixSort                                      288.96%   164.60ns     6.08M
StdSort_2payloads_4_bigint_10k                            594.78ns     1.68M
PrefixSort                                      314.12%   189.35ns     5.28M
StdSort_2payloads_1_bigint_100k                           349.99ns     2.86M
PrefixSort                                      205.28%   170.49ns     5.87M
StdSort_2payloads_2_bigint_100k                           489.03ns     2.04M
PrefixSort                                      307.77%   158.89ns     6.29M
StdSort_2payloads_3_bigint_100k                           607.88ns     1.65M
PrefixSort                                      305.62%   198.90ns     5.03M
StdSort_2payloads_4_bigint_100k                           715.90ns     1.40M
PrefixSort                                      321.73%   222.52ns     4.49M
StdSort_2payloads_1_bigint_1000k                          574.14ns     1.74M
PrefixSort                                      262.05%   219.10ns     4.56M
StdSort_2payloads_2_bigint_1000k                          796.05ns     1.26M
PrefixSort                                      377.73%   210.75ns     4.75M
StdSort_2payloads_3_bigint_1000k                            1.02us   975.65K
PrefixSort                                      411.02%   249.37ns     4.01M
StdSort_2payloads_4_bigint_1000k                            1.16us   858.48K
PrefixSort                                      408.37%   285.24ns     3.51M
StdSort_2payloads_1_bigint_0.01k                           49.37ns    20.26M
PrefixSort                                      99.124%    49.81ns    20.08M
StdSort_2payloads_2_bigint_0.01k                           89.08ns    11.23M
PrefixSort                                      99.958%    89.12ns    11.22M
StdSort_2payloads_3_bigint_0.01k                           64.40ns    15.53M
PrefixSort                                      99.991%    64.41ns    15.53M
StdSort_2payloads_4_bigint_0.01k                           86.56ns    11.55M
PrefixSort                                      100.34%    86.26ns    11.59M
StdSort_2payloads_1_bigint_0.015k                          48.17ns    20.76M
PrefixSort                                      100.11%    48.12ns    20.78M
StdSort_2payloads_2_bigint_0.015k                          55.44ns    18.04M
PrefixSort                                      99.994%    55.45ns    18.03M
StdSort_2payloads_3_bigint_0.015k                          61.17ns    16.35M
PrefixSort                                      99.988%    61.18ns    16.35M
StdSort_2payloads_4_bigint_0.015k                          57.55ns    17.38M
PrefixSort                                      99.895%    57.61ns    17.36M
StdSort_2payloads_1_bigint_0.02k                           47.93ns    20.86M
PrefixSort                                      99.916%    47.97ns    20.85M
StdSort_2payloads_2_bigint_0.02k                           84.10ns    11.89M
PrefixSort                                      100.38%    83.78ns    11.94M
StdSort_2payloads_3_bigint_0.02k                          126.53ns     7.90M
PrefixSort                                      100.05%   126.47ns     7.91M
StdSort_2payloads_4_bigint_0.02k                          164.44ns     6.08M
PrefixSort                                      99.935%   164.55ns     6.08M
StdSort_2payloads_1_bigint_0.05k                           77.86ns    12.84M
PrefixSort                                      99.171%    78.51ns    12.74M
StdSort_2payloads_2_bigint_0.05k                          118.10ns     8.47M
PrefixSort                                      100.53%   117.48ns     8.51M
StdSort_2payloads_3_bigint_0.05k                          152.74ns     6.55M
PrefixSort                                      100.02%   152.71ns     6.55M
StdSort_2payloads_4_bigint_0.05k                          184.56ns     5.42M
PrefixSort                                      99.925%   184.70ns     5.41M
StdSort_2payloads_1_bigint_0.1k                            88.01ns    11.36M
PrefixSort                                      100.46%    87.60ns    11.42M
StdSort_2payloads_2_bigint_0.1k                           138.22ns     7.24M
PrefixSort                                      159.92%    86.43ns    11.57M
StdSort_2payloads_3_bigint_0.1k                           187.49ns     5.33M
PrefixSort                                      187.96%    99.75ns    10.03M
StdSort_2payloads_4_bigint_0.1k                           232.52ns     4.30M
PrefixSort                                      188.98%   123.04ns     8.13M
StdSort_no-payloads_1_varchar_1k                          292.32ns     3.42M
PrefixSort                                      100.47%   290.96ns     3.44M
StdSort_no-payloads_2_varchar_1k                          380.98ns     2.62M
PrefixSort                                      98.623%   386.29ns     2.59M
StdSort_no-payloads_3_varchar_1k                          456.15ns     2.19M
PrefixSort                                      98.302%   464.03ns     2.16M
StdSort_no-payloads_4_varchar_1k                          520.84ns     1.92M
PrefixSort                                      98.186%   530.46ns     1.89M
StdSort_no-payloads_1_varchar_10k                         422.83ns     2.37M
PrefixSort                                      99.186%   426.30ns     2.35M
StdSort_no-payloads_2_varchar_10k                         495.10ns     2.02M
PrefixSort                                      98.218%   504.08ns     1.98M
StdSort_no-payloads_3_varchar_10k                         584.89ns     1.71M
PrefixSort                                      99.079%   590.33ns     1.69M
StdSort_no-payloads_4_varchar_10k                         667.37ns     1.50M
PrefixSort                                      98.887%   674.88ns     1.48M
StdSort_no-payloads_1_varchar_100k                        605.27ns     1.65M
PrefixSort                                      99.425%   608.78ns     1.64M
StdSort_no-payloads_2_varchar_100k                        741.11ns     1.35M
PrefixSort                                      99.107%   747.78ns     1.34M
StdSort_no-payloads_3_varchar_100k                        890.60ns     1.12M
PrefixSort                                      99.089%   898.78ns     1.11M
StdSort_no-payloads_4_varchar_100k                          1.11us   903.14K
PrefixSort                                      104.50%     1.06us   943.76K
StdSort_no-payloads_1_varchar_1000k                         1.22us   822.83K
PrefixSort                                      99.534%     1.22us   818.99K
StdSort_no-payloads_2_varchar_1000k                         1.52us   656.78K
PrefixSort                                      99.353%     1.53us   652.53K
StdSort_no-payloads_3_varchar_1000k                         1.78us   560.23K
PrefixSort                                      98.862%     1.81us   553.86K
StdSort_no-payloads_4_varchar_1000k                         1.93us   519.34K
PrefixSort                                      99.159%     1.94us   514.97K
```

Part of #6766

Pull Request resolved: #8146

Reviewed By: Yuhta

Differential Revision: D54247347

Pulled By: mbasmanova

fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923
@yingsu00
Copy link
Collaborator

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

@skadilover
Copy link
Contributor Author

@skadilover How does the prefix sort compared with radix sort and Google's SIMD quicksort? On integer types of fixed length, the combined MSB/LSB radix sort can easily be 3x faster than std::sort on the scale of millions and up numbers. Do you have benchmarks to compare the prefix sort with these algorithms? My current impression is that the Prefix sort may be good for sorting strings, and in general, compare based sorting algorithms can't beat the radix type sorting on fixed width integers on large scale.

The core idea of this opt is 'use binary string to opt mulit columns sort', use quick-sort just for speeding up things.

@jinchengchenghh
Copy link
Contributor

jinchengchenghh commented Jun 5, 2024

We will also try to optimize sort, initial proposal is:
Currently, the common usage is to request space for each row.
Firstly, we can request all the rows memory at once, and newRow is only for fixed size column, maybe we can also optimize string column at the same way.

for (auto row = 0; row < input->size(); ++row) {
    char* newRow = inputData_->newRow();

    for (auto i = 0; i < inputs_.size(); ++i) {
      inputData_->store(decodedInputs_[i], row, newRow, i);
    }

    addNewRow(groups[row], newRow);
  }

Secondly, currently we will spill input for memory pressure, we can also request all the memory in the addInput function, and test if we can get the memory before request, if not, try to sort by prefixsort and spill output as spark does, sort will not request any memory in that case.
What Spark does is:

  1. Insert record and check if the memory is enough, if not, sort and spill https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L491
  2. Spill and sort: https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L234

Will add a sort benchmark and sort spill benchmark to track the performance.
First PR is #10041
Please let me know if it is acceptable. Thanks!

@skadilover
Copy link
Contributor Author

@jinchengchenghh
I think this feature you proposed has no conflict with the optimization of prefix-sort.
I think you can file a separate issue and let the code maintainer know, which may make things progress faster.

@jinchengchenghh
Copy link
Contributor

Yes, because this title is Optimize sort, same topic with mine, so I add it here, And I would like to apply the second optimization after prefix sort merged. Otherwise it will make it hard to enable prefixsort because it requires sort does not require memory.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this issue Jun 7, 2024
…cubator#8146)

Summary:
PrefixSort is used to improve in-memory sort performance, using memcmp to compare binary string (normalized encoded sort keys called 'prefix') when sorting. This PR adds 'extract rows to prefix'  and 'sort' for fixed width types (int32, int64, float, double, timestamp etc.). More types will be added in a follow-up.

Add benchmark to compare std::sort vs. prefix sort using 1, 2, 3, and 4 bigint sorting keys.

The performance of sorting up to 1000 rows is the same. When sorting more than 1K rows prefix-sort is faster, the gains increase as the number of rows sorted and the number of sorting keys increases. The presence of the payload columns doesn't affect the performance.

```
============================================================================
[...]ec/benchmarks/PrefixSortBenchmark.cpp     relative  time/iter   iters/s
============================================================================
StdSort_no-payload_1_bigint_0.01k                          49.40ns    20.24M
PrefixSort                                      100.00%    49.40ns    20.24M
StdSort_no-payload_2_bigint_0.01k                          87.85ns    11.38M
PrefixSort                                      99.904%    87.94ns    11.37M
StdSort_no-payload_3_bigint_0.01k                          65.04ns    15.37M
PrefixSort                                      99.350%    65.47ns    15.27M
StdSort_no-payload_4_bigint_0.01k                          69.19ns    14.45M
PrefixSort                                      99.566%    69.49ns    14.39M
StdSort_no-payload_1_bigint_0.015k                         47.93ns    20.86M
PrefixSort                                      100.02%    47.92ns    20.87M
StdSort_no-payload_2_bigint_0.015k                         54.22ns    18.44M
PrefixSort                                      99.921%    54.26ns    18.43M
StdSort_no-payload_3_bigint_0.015k                         61.00ns    16.39M
PrefixSort                                      99.958%    61.02ns    16.39M
StdSort_no-payload_4_bigint_0.015k                         57.38ns    17.43M
PrefixSort                                      99.870%    57.46ns    17.40M
StdSort_no-payload_1_bigint_0.02k                          47.82ns    20.91M
PrefixSort                                      99.914%    47.86ns    20.90M
StdSort_no-payload_2_bigint_0.02k                          83.94ns    11.91M
PrefixSort                                      100.00%    83.93ns    11.91M
StdSort_no-payload_3_bigint_0.02k                         128.43ns     7.79M
PrefixSort                                      100.17%   128.21ns     7.80M
StdSort_no-payload_4_bigint_0.02k                         165.29ns     6.05M
PrefixSort                                      99.997%   165.30ns     6.05M
StdSort_no-payload_1_bigint_0.05k                          77.34ns    12.93M
PrefixSort                                      99.425%    77.79ns    12.86M
StdSort_no-payload_2_bigint_0.05k                         113.79ns     8.79M
PrefixSort                                      99.786%   114.03ns     8.77M
StdSort_no-payload_3_bigint_0.05k                         152.81ns     6.54M
PrefixSort                                      99.696%   153.27ns     6.52M
StdSort_no-payload_4_bigint_0.05k                         185.93ns     5.38M
PrefixSort                                      100.02%   185.90ns     5.38M
StdSort_no-payload_1_bigint_0.1k                           87.39ns    11.44M
PrefixSort                                      99.499%    87.83ns    11.39M
StdSort_no-payload_2_bigint_0.1k                          139.93ns     7.15M
PrefixSort                                      162.24%    86.25ns    11.59M
StdSort_no-payload_3_bigint_0.1k                          186.27ns     5.37M
PrefixSort                                      186.72%    99.76ns    10.02M
StdSort_no-payload_4_bigint_0.1k                          234.01ns     4.27M
PrefixSort                                      187.97%   124.49ns     8.03M
StdSort_no-payloads_1_bigint_1k                           173.31ns     5.77M
PrefixSort                                      136.72%   126.76ns     7.89M
StdSort_no-payloads_2_bigint_1k                           249.77ns     4.00M
PrefixSort                                      199.49%   125.20ns     7.99M
StdSort_no-payloads_3_bigint_1k                           314.18ns     3.18M
PrefixSort                                      219.49%   143.14ns     6.99M
StdSort_no-payloads_4_bigint_1k                           348.38ns     2.87M
PrefixSort                                      203.28%   171.38ns     5.84M
StdSort_no-payloads_1_bigint_10k                          251.90ns     3.97M
PrefixSort                                      165.99%   151.76ns     6.59M
StdSort_no-payloads_2_bigint_10k                          363.09ns     2.75M
PrefixSort                                      253.07%   143.47ns     6.97M
StdSort_no-payloads_3_bigint_10k                          483.58ns     2.07M
PrefixSort                                      293.67%   164.67ns     6.07M
StdSort_no-payloads_4_bigint_10k                          593.29ns     1.69M
PrefixSort                                      312.83%   189.65ns     5.27M
StdSort_no-payloads_1_bigint_100k                         330.44ns     3.03M
PrefixSort                                      192.57%   171.59ns     5.83M
StdSort_no-payloads_2_bigint_100k                         470.79ns     2.12M
PrefixSort                                      293.67%   160.31ns     6.24M
StdSort_no-payloads_3_bigint_100k                         607.15ns     1.65M
PrefixSort                                      303.88%   199.80ns     5.01M
StdSort_no-payloads_4_bigint_100k                         706.03ns     1.42M
PrefixSort                                      315.15%   224.03ns     4.46M
StdSort_no-payloads_1_bigint_1000k                        452.05ns     2.21M
PrefixSort                                      204.92%   220.60ns     4.53M
StdSort_no-payloads_2_bigint_1000k                        645.35ns     1.55M
PrefixSort                                      306.42%   210.61ns     4.75M
StdSort_no-payloads_3_bigint_1000k                        818.78ns     1.22M
PrefixSort                                      328.01%   249.62ns     4.01M
StdSort_no-payloads_4_bigint_1000k                        981.65ns     1.02M
PrefixSort                                      343.79%   285.54ns     3.50M
StdSort_2payloads_1_bigint_1k                             177.21ns     5.64M
PrefixSort                                      139.94%   126.63ns     7.90M
StdSort_2payloads_2_bigint_1k                             248.46ns     4.02M
PrefixSort                                      199.08%   124.80ns     8.01M
StdSort_2payloads_3_bigint_1k                             313.66ns     3.19M
PrefixSort                                      218.48%   143.56ns     6.97M
StdSort_2payloads_4_bigint_1k                             359.17ns     2.78M
PrefixSort                                      208.57%   172.21ns     5.81M
StdSort_2payloads_1_bigint_10k                            254.83ns     3.92M
PrefixSort                                      168.28%   151.43ns     6.60M
StdSort_2payloads_2_bigint_10k                            363.92ns     2.75M
PrefixSort                                      254.35%   143.08ns     6.99M
StdSort_2payloads_3_bigint_10k                            475.61ns     2.10M
PrefixSort                                      288.96%   164.60ns     6.08M
StdSort_2payloads_4_bigint_10k                            594.78ns     1.68M
PrefixSort                                      314.12%   189.35ns     5.28M
StdSort_2payloads_1_bigint_100k                           349.99ns     2.86M
PrefixSort                                      205.28%   170.49ns     5.87M
StdSort_2payloads_2_bigint_100k                           489.03ns     2.04M
PrefixSort                                      307.77%   158.89ns     6.29M
StdSort_2payloads_3_bigint_100k                           607.88ns     1.65M
PrefixSort                                      305.62%   198.90ns     5.03M
StdSort_2payloads_4_bigint_100k                           715.90ns     1.40M
PrefixSort                                      321.73%   222.52ns     4.49M
StdSort_2payloads_1_bigint_1000k                          574.14ns     1.74M
PrefixSort                                      262.05%   219.10ns     4.56M
StdSort_2payloads_2_bigint_1000k                          796.05ns     1.26M
PrefixSort                                      377.73%   210.75ns     4.75M
StdSort_2payloads_3_bigint_1000k                            1.02us   975.65K
PrefixSort                                      411.02%   249.37ns     4.01M
StdSort_2payloads_4_bigint_1000k                            1.16us   858.48K
PrefixSort                                      408.37%   285.24ns     3.51M
StdSort_2payloads_1_bigint_0.01k                           49.37ns    20.26M
PrefixSort                                      99.124%    49.81ns    20.08M
StdSort_2payloads_2_bigint_0.01k                           89.08ns    11.23M
PrefixSort                                      99.958%    89.12ns    11.22M
StdSort_2payloads_3_bigint_0.01k                           64.40ns    15.53M
PrefixSort                                      99.991%    64.41ns    15.53M
StdSort_2payloads_4_bigint_0.01k                           86.56ns    11.55M
PrefixSort                                      100.34%    86.26ns    11.59M
StdSort_2payloads_1_bigint_0.015k                          48.17ns    20.76M
PrefixSort                                      100.11%    48.12ns    20.78M
StdSort_2payloads_2_bigint_0.015k                          55.44ns    18.04M
PrefixSort                                      99.994%    55.45ns    18.03M
StdSort_2payloads_3_bigint_0.015k                          61.17ns    16.35M
PrefixSort                                      99.988%    61.18ns    16.35M
StdSort_2payloads_4_bigint_0.015k                          57.55ns    17.38M
PrefixSort                                      99.895%    57.61ns    17.36M
StdSort_2payloads_1_bigint_0.02k                           47.93ns    20.86M
PrefixSort                                      99.916%    47.97ns    20.85M
StdSort_2payloads_2_bigint_0.02k                           84.10ns    11.89M
PrefixSort                                      100.38%    83.78ns    11.94M
StdSort_2payloads_3_bigint_0.02k                          126.53ns     7.90M
PrefixSort                                      100.05%   126.47ns     7.91M
StdSort_2payloads_4_bigint_0.02k                          164.44ns     6.08M
PrefixSort                                      99.935%   164.55ns     6.08M
StdSort_2payloads_1_bigint_0.05k                           77.86ns    12.84M
PrefixSort                                      99.171%    78.51ns    12.74M
StdSort_2payloads_2_bigint_0.05k                          118.10ns     8.47M
PrefixSort                                      100.53%   117.48ns     8.51M
StdSort_2payloads_3_bigint_0.05k                          152.74ns     6.55M
PrefixSort                                      100.02%   152.71ns     6.55M
StdSort_2payloads_4_bigint_0.05k                          184.56ns     5.42M
PrefixSort                                      99.925%   184.70ns     5.41M
StdSort_2payloads_1_bigint_0.1k                            88.01ns    11.36M
PrefixSort                                      100.46%    87.60ns    11.42M
StdSort_2payloads_2_bigint_0.1k                           138.22ns     7.24M
PrefixSort                                      159.92%    86.43ns    11.57M
StdSort_2payloads_3_bigint_0.1k                           187.49ns     5.33M
PrefixSort                                      187.96%    99.75ns    10.03M
StdSort_2payloads_4_bigint_0.1k                           232.52ns     4.30M
PrefixSort                                      188.98%   123.04ns     8.13M
StdSort_no-payloads_1_varchar_1k                          292.32ns     3.42M
PrefixSort                                      100.47%   290.96ns     3.44M
StdSort_no-payloads_2_varchar_1k                          380.98ns     2.62M
PrefixSort                                      98.623%   386.29ns     2.59M
StdSort_no-payloads_3_varchar_1k                          456.15ns     2.19M
PrefixSort                                      98.302%   464.03ns     2.16M
StdSort_no-payloads_4_varchar_1k                          520.84ns     1.92M
PrefixSort                                      98.186%   530.46ns     1.89M
StdSort_no-payloads_1_varchar_10k                         422.83ns     2.37M
PrefixSort                                      99.186%   426.30ns     2.35M
StdSort_no-payloads_2_varchar_10k                         495.10ns     2.02M
PrefixSort                                      98.218%   504.08ns     1.98M
StdSort_no-payloads_3_varchar_10k                         584.89ns     1.71M
PrefixSort                                      99.079%   590.33ns     1.69M
StdSort_no-payloads_4_varchar_10k                         667.37ns     1.50M
PrefixSort                                      98.887%   674.88ns     1.48M
StdSort_no-payloads_1_varchar_100k                        605.27ns     1.65M
PrefixSort                                      99.425%   608.78ns     1.64M
StdSort_no-payloads_2_varchar_100k                        741.11ns     1.35M
PrefixSort                                      99.107%   747.78ns     1.34M
StdSort_no-payloads_3_varchar_100k                        890.60ns     1.12M
PrefixSort                                      99.089%   898.78ns     1.11M
StdSort_no-payloads_4_varchar_100k                          1.11us   903.14K
PrefixSort                                      104.50%     1.06us   943.76K
StdSort_no-payloads_1_varchar_1000k                         1.22us   822.83K
PrefixSort                                      99.534%     1.22us   818.99K
StdSort_no-payloads_2_varchar_1000k                         1.52us   656.78K
PrefixSort                                      99.353%     1.53us   652.53K
StdSort_no-payloads_3_varchar_1000k                         1.78us   560.23K
PrefixSort                                      98.862%     1.81us   553.86K
StdSort_no-payloads_4_varchar_1000k                         1.93us   519.34K
PrefixSort                                      99.159%     1.94us   514.97K
```

Part of facebookincubator#6766

Pull Request resolved: facebookincubator#8146

Reviewed By: Yuhta

Differential Revision: D54247347

Pulled By: mbasmanova

fbshipit-source-id: b358925e6b702de5bb2eee55df24827b51268923
facebook-github-bot pushed a commit that referenced this issue Sep 12, 2024
Summary:
According to the benchmark, for data sets larger than 0.5k, PrefixSort outperforms std::sort with performance improvements ranging from approximately 250% to over 500%.  Here's a summary of the benchmark results:

| Dataset Size | PrefixSort Improvement (No Payload) |PrefixSort Improvement(With Payload) |
|--------------|-------------------------------------|-------------------------------------|
| 0.5k         | 248.97% - 287.43%                   | 249.71% - 289.74%                   |
| 1k           | 214.44% - 310.92%                   | 215.03% - 315.14%                   |
| 10k          | 216.21% - 255.38%                   | 217.88% - 256.88%                   |
| 100k         | 279.81% - 318.26%                   | 284.89% - 295.21%                   |
| 1000k        | 304.36% - 351.31%                   | 454.04% - 514.28%                   |

follow-up #8350
Part of #6766

Pull Request resolved: #10946

Reviewed By: Yuhta

Differential Revision: D62373593

Pulled By: mbasmanova

fbshipit-source-id: b8594e05cc6aee736d09db1695db770b84e9d4bd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants