Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research performance improvements in N-way merging #2148

Closed
Tracked by #1568
yjshen opened this issue Apr 4, 2022 · 9 comments
Closed
Tracked by #1568

Research performance improvements in N-way merging #2148

yjshen opened this issue Apr 4, 2022 · 9 comments

Comments

@yjshen
Copy link
Member

yjshen commented Apr 4, 2022

No description provided.

@jackwener
Copy link
Member

I'm interested in this. Is there more information/context about it?

@yjshen
Copy link
Member Author

yjshen commented Apr 5, 2022

Hi @jackwener, currently @richox has two efforts trying to achieve this:

The first one intended to use peek_mut instead of pop in SortPreserveMerging for less comparison but seems suffering performance regression at the moment. #2134

The second attempt tries to use BTree instead of the current MinHeap and is still a work in progress. https://github.com/richox/arrow-datafusion/tree/sort_memory_2_peekmut_pop

Perhaps you guys could discuss this to see if it's possible to come up with a faster solution?

@yjshen
Copy link
Member Author

yjshen commented Apr 5, 2022

I created this issue from an item @alamb has listed in the umbrella PR. Maybe @alamb can also share some insights here.

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022 via email

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022

Thoughts on N-way merging:

We (Influxdb IOx in general and myself in particular) are very interested in this as well because the SortPreservingMerge is one of the key bottlenecks we see when sorting out data.

Here is what I was thinking about how to proceed:

  1. Create a benchmark for merging (including multi-column keys and variable length (Utf8) keys)
  2. Spike out some tests

Areas for investigation / things to spike out:

  1. Use row-format sort key (similar to what @yjshen has done in Buffer records in row format in memory for SortExec #2146) so that the comparisons are done by comparing [u8] rather than array access
  2. Use "Cascade Merge" rather than N-Way merge, as hinted at in the DuckDB blog: https://duckdb.org/2021/08/27/external-sorting.html
  3. Figure out how to parallelize both the merge (the DuckDB blog has some hints) as well as the creation of RecordBatches from the inputs. I have thought about this but need more time to think through how it would work.

cc @tustvold

@tustvold
Copy link
Contributor

tustvold commented Apr 5, 2022

One other thing to throw into the mix would be to optimise sorts of dictionary encoded columns. If the dictionary is sorted, the savings could be significant as you only need to compare the integer keys.

Even if the dictionary isn't sorted it might be faster to sort the dictionary first, and then sort the now sorted keys.

Just an idea as at least in the case of IOx, we will only be sorting on dictionary encoded string columns and not plain columns.

@yjshen
Copy link
Member Author

yjshen commented Apr 6, 2022

Use row-format sort key so that the comparisons are done by comparing [u8] rather than array access

I filed an issue for this #2150

@jackwener
Copy link
Member

jackwener commented Apr 6, 2022

During my school days, I tried to implement an external sorting algorithm using the loser tree, I don't know if this is a good idea because I didn't go to compare. In principle, it will perform better than min heap.

@tustvold
Copy link
Contributor

Closing this ticket as I believe it is not tracking anything anymore, feel free to reopen if I am mistaken.

SortPreservingMerge is now implemented as an n-way tournament tree making use of an order-preserving row encoding for multi-column sorts, and specialized cursors for single column sorts. I'm not aware of any major low-hanging fruit to make it run faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants