Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a merge sort kernel #4529

Closed
vincev opened this issue Jul 15, 2023 · 4 comments
Closed

Add a merge sort kernel #4529

vincev opened this issue Jul 15, 2023 · 4 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@vincev
Copy link

vincev commented Jul 15, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have an aggregation function that sorts a number of small arrays in parallel and then in the final aggregation step produces a big sorted array from all the smaller sorted arrays.

Currently I am doing a concat, that creates a big array from the smaller sorted arrays, followed by a sort that creates another array, it would be probably more efficient if there was a way to merge the already sorted arrays to a final array in one step.

Describe the solution you'd like

Would be nice to have a function like:

pub fn merge_sort(arrays: &[&dyn Array], options: Option<SortOption>) -> Result<ArrayRef, ArrowError>;

that takes a number of sorted arrays and produces a sorted array by merging them.

Describe alternatives you've considered

I am doing concat followed by sort.

Additional context

@vincev vincev added the enhancement Any new improvement worthy of a entry in the changelog label Jul 15, 2023
@tustvold
Copy link
Contributor

tustvold commented Jul 15, 2023

Perhaps we might upstream some of the SortPreservingMerge / SortExec logic from DataFusion 🤔

@vincev
Copy link
Author

vincev commented Jul 17, 2023

Perhaps we might upstream some of the SortPreservingMerge / SortExec logic from DataFusion 🤔

That looks interesting, I didn't know about tournament trees, if no one else has time to work on this I may give it a try later this week.

I was thinking it would also be good to have a limit parameter to do partial sort.

@vincev
Copy link
Author

vincev commented Jul 23, 2023

I did some testing comparing concat then sort against parallel sort then merge using the looser tree approach used in DataFusion:

Running 8 threads.
concat_then_sort: Sorted 100663296 values in 2048 arrays 2.179s
sort_then_merge:  Sorted 100663296 values in 2048 arrays 5.351s

concat then sort is more than twice as fast than sort then merge with UInt64Arrays.

I am going to close this issue but let me know if you would like more information.

@vincev vincev closed this as completed Jul 23, 2023
@tustvold
Copy link
Contributor

tustvold commented Jul 23, 2023

This is consistent with my findings in apache/datafusion#6163, thank you for investigating

@tustvold tustvold added the development-process Related to development process of arrow-rs label Jul 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants