Research performance improvements in N-way merging #2148

yjshen · 2022-04-04T05:01:56Z

No description provided.

jackwener · 2022-04-05T13:03:07Z

I'm interested in this. Is there more information/context about it?

yjshen · 2022-04-05T13:47:59Z

Hi @jackwener, currently @richox has two efforts trying to achieve this:

The first one intended to use peek_mut instead of pop in SortPreserveMerging for less comparison but seems suffering performance regression at the moment. #2134

The second attempt tries to use BTree instead of the current MinHeap and is still a work in progress. https://github.com/richox/arrow-datafusion/tree/sort_memory_2_peekmut_pop

Perhaps you guys could discuss this to see if it's possible to come up with a faster solution?

yjshen · 2022-04-05T13:54:24Z

I created this issue from an item @alamb has listed in the umbrella PR. Maybe @alamb can also share some insights here.

alamb · 2022-04-05T14:09:56Z

Yes, I will provide some more coherent thoughts later today

…

On Tue, Apr 5, 2022 at 9:54 AM Yijie Shen ***@***.***> wrote: I created this issue from an item @alamb <https://github.com/alamb> has listed in the umbrella PR. Maybe @alamb <https://github.com/alamb> can also share some insights here. — Reply to this email directly, view it on GitHub <#2148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADXZMPFETOUOE7YTV26GMTVDRAZZANCNFSM5SOKD74A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

alamb · 2022-04-05T20:52:12Z

Thoughts on N-way merging:

We (Influxdb IOx in general and myself in particular) are very interested in this as well because the SortPreservingMerge is one of the key bottlenecks we see when sorting out data.

Here is what I was thinking about how to proceed:

Create a benchmark for merging (including multi-column keys and variable length (Utf8) keys)
Spike out some tests

Areas for investigation / things to spike out:

Use row-format sort key (similar to what @yjshen has done in Buffer records in row format in memory for SortExec #2146) so that the comparisons are done by comparing [u8] rather than array access
Use "Cascade Merge" rather than N-Way merge, as hinted at in the DuckDB blog: https://duckdb.org/2021/08/27/external-sorting.html
Figure out how to parallelize both the merge (the DuckDB blog has some hints) as well as the creation of RecordBatches from the inputs. I have thought about this but need more time to think through how it would work.

cc @tustvold

tustvold · 2022-04-05T21:50:11Z

One other thing to throw into the mix would be to optimise sorts of dictionary encoded columns. If the dictionary is sorted, the savings could be significant as you only need to compare the integer keys.

Even if the dictionary isn't sorted it might be faster to sort the dictionary first, and then sort the now sorted keys.

Just an idea as at least in the case of IOx, we will only be sorting on dictionary encoded string columns and not plain columns.

yjshen · 2022-04-06T02:52:52Z

Use row-format sort key so that the comparisons are done by comparing [u8] rather than array access

I filed an issue for this #2150

jackwener · 2022-04-06T14:51:08Z

During my school days, I tried to implement an external sorting algorithm using the loser tree, I don't know if this is a good idea because I didn't go to compare. In principle, it will perform better than min heap.

tustvold · 2023-04-29T19:46:12Z

Closing this ticket as I believe it is not tracking anything anymore, feel free to reopen if I am mistaken.

SortPreservingMerge is now implemented as an n-way tournament tree making use of an order-preserving row encoding for multi-column sorts, and specialized cursors for single column sorts. I'm not aware of any major low-hanging fruit to make it run faster

yjshen mentioned this issue Apr 4, 2022

[EPIC] Memory Limited Sort (Externalized / Spill) #1568

Closed

8 tasks

tustvold closed this as completed Apr 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research performance improvements in N-way merging #2148

Research performance improvements in N-way merging #2148

yjshen commented Apr 4, 2022

jackwener commented Apr 5, 2022

yjshen commented Apr 5, 2022

yjshen commented Apr 5, 2022

alamb commented Apr 5, 2022 via email

alamb commented Apr 5, 2022

tustvold commented Apr 5, 2022 •

edited

Loading

yjshen commented Apr 6, 2022

jackwener commented Apr 6, 2022 •

edited

Loading

tustvold commented Apr 29, 2023

Research performance improvements in N-way merging #2148

Research performance improvements in N-way merging #2148

Comments

yjshen commented Apr 4, 2022

jackwener commented Apr 5, 2022

yjshen commented Apr 5, 2022

yjshen commented Apr 5, 2022

alamb commented Apr 5, 2022 via email

alamb commented Apr 5, 2022

tustvold commented Apr 5, 2022 • edited Loading

yjshen commented Apr 6, 2022

jackwener commented Apr 6, 2022 • edited Loading

tustvold commented Apr 29, 2023

tustvold commented Apr 5, 2022 •

edited

Loading

jackwener commented Apr 6, 2022 •

edited

Loading