Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable (lex)sort_to_indices #4545

Open
comphead opened this issue Jul 19, 2023 · 17 comments
Open

Unstable (lex)sort_to_indices #4545

comphead opened this issue Jul 19, 2023 · 17 comments
Labels

Comments

@comphead
Copy link
Contributor

comphead commented Jul 19, 2023

Describe the bug

Arrow ordering uses the unstable sort which doesn't guarantee sort order between same elements, this causing undesired behavior for systems that expects stable sort

To Reproduce

    #[test]
    fn test_sort_primitives2() {
        let n = 20;

        let mut arr: [(u32, i32); 20] = [(0, 0); 20];

        (0 .. n).map(|d| Some(d % 2)).enumerate().for_each(|v| arr[v.0] = (v.0.try_into().unwrap(), v.1.unwrap()));
        //println!("{:?}", arr);
        sort_unstable_by::<(u32, i32), _>(&mut arr, n as usize, |a, b| cmp(a.1, b.1).reverse());
        println!("{:?}", arr.iter().filter(|v| v.1 == 1).map(|v| v.0).collect::<Vec<_>>());
    }

The test outputs array indices for 1 and this is correct, in sync with input data

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

For n = 21 which shouldn't change the output, but instead indices are messed

[15, 1, 3, 5, 7, 9, 11, 13, 17, 19]

Expected behavior

Additional context

Related to #2871

@comphead comphead added the bug label Jul 19, 2023
@comphead
Copy link
Contributor Author

@sunchao @viirya

@tustvold you proposed to use row format in DF SortExec in #2871 which looks promising, appreciate any info how I can start on it

@comphead
Copy link
Contributor Author

@tustvold in this particular use case it is 1 sorted column and sort_to_indices is used. This method uses non stable sorting which behaves unexpectedly in our use case which expects stable sort, do you think this can improved?

@tustvold
Copy link
Contributor

tustvold commented Jul 21, 2023

Yes, I would consider this a bug that sort_to_indices is not stable. This should be a simple case of updating the comparator function to fallback to comparing the indices if equal. Hopefully this won't regress performance too significantly. I will have a play

Edit: I can't see an easy way to fix this without it significantly regressing performance...

tustvold added a commit to tustvold/arrow-rs that referenced this issue Jul 21, 2023
@comphead
Copy link
Contributor Author

comphead commented Jul 21, 2023

Yes, I would consider this a bug that sort_to_indices is not stable. This should be a simple case of updating the comparator function to fallback to comparing the indices if equal. Hopefully this won't regress performance too significantly. I will have a play

Edit: I can't see an easy way to fix this without it significantly regressing performance...

My naive approach was also to do the same with double sort, thinking if its possible to create a stable sort key but preserve single sorting.

like cmp(concat(b.1, b.0), concat(a.1, a0))

@tustvold tustvold changed the title Unstable sorting for i32 Unstable (lex)sort_to_indices Jul 21, 2023
@tustvold
Copy link
Contributor

I mean a short-term hack would be to include a row number column, and include this in the sort. The performance would not be ideal though...

@comphead
Copy link
Contributor Author

for the test above I was able to make correct behavior with comparator

        sort_unstable_by::<(u32, i32), _>(&mut arr, n, |a, b| {if a.0 != b.0 {a.0.cmp(&b.0)} else {cmp(a.1, b.1).reverse()}});

this looks like less expensive than double sort

@comphead
Copy link
Contributor Author

if you think this is something we can go with, I'll create a PR

@tustvold
Copy link
Contributor

tustvold commented Jul 21, 2023

for the test above I was able to make correct behavior with comparator

I experimented with this and other approaches, see tustvold@b9adfea

The problem is this regresses performance by a factor of 4...

@comphead
Copy link
Contributor Author

stable sort is expectedly more expensive than unstable

as alternatives can we consider:
stable sort methods along with unstable ones?
or change the sorting behavior by some input or global parameter or config?

@alamb
Copy link
Contributor

alamb commented Jul 22, 2023

I feel like someone (maybe @sundy-li ) originally changed the sort in arrow to be unstable precisely for performance reasons

I agree with @comphead that we should not force a slower stable sort on everyone and instead let users decide for themselves if they want faster unstable or slower stable sorts

Possibly related: #552 and #572 (documented it)

@comphead
Copy link
Contributor Author

comphead commented Jul 22, 2023

any way how to let users to decide? I havent found the arrow-rs config or runtime parameters to define the behavior based on user preferences

UPD: can we use different features in arrow-rs for stable and unstable sort?

@tustvold
Copy link
Contributor

Another aspect of this is that is sort_to_indices were stable, it would be relatively straightforward to use it to implement lexsort without needing to use DynComparator... 🤔

@alamb
Copy link
Contributor

alamb commented Sep 5, 2023

Update here is that @tustvold reports:

I couldn't devise a way to make it not regress performance

Thus I am not sure what to do with this issue for now

@comphead
Copy link
Contributor Author

comphead commented Sep 5, 2023

Update here is that @tustvold reports:

I couldn't devise a way to make it not regress performance

Thus I am not sure what to do with this issue for now

Probably the better is to move this stable sort implementations to datafusion level?
In Datafusion its easier to make another external config key and let user decide which sort mode to use.

@maxburke
Copy link
Contributor

maxburke commented Oct 13, 2023

Datafusion is a pretty heavy dependency to bring in for only a stable sort. Is the stability something that can maybe be added to SortOptions, to let users opt in when desired, defaulting to unstable when it isn't?

@tustvold
Copy link
Contributor

tustvold commented Oct 13, 2023

Would arrow-row work for your use-case?

I.e. https://docs.rs/arrow-row/latest/arrow_row/#lexsort

But using sort instead of sort_unstable.

We could consider adding stable variants, my major reservation is paying for the additional codegen

@maxburke
Copy link
Contributor

Would arrow-row work for your use-case?

I.e. https://docs.rs/arrow-row/latest/arrow_row/#lexsort

But using sort instead of sort_unstable

I think so? I'll give it a try...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants