-
Notifications
You must be signed in to change notification settings - Fork 770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup sort #4613
Cleanup sort #4613
Conversation
@@ -799,6 +799,15 @@ pub trait AsArray: private::Sealed { | |||
self.as_list_opt().expect("list array") | |||
} | |||
|
|||
/// Downcast this to a [`FixedSizeBinaryArray`] returning `None` if not possible | |||
fn as_fixed_size_binary_opt(&self) -> Option<&FixedSizeBinaryArray>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trait is sealed so this isn't a breaking change
|
||
/// Compare two `Array`s based on the ordering defined in [build_compare] | ||
fn cmp_array(a: &dyn Array, b: &dyn Array) -> Ordering { | ||
let cmp_op = build_compare(a, b).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was getting called per array comparison, doing all this dispatch and allocation logic every time! The performance would have been catastrophic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was called as part of sorting ListArray -- so that might become substantially faster
} | ||
}, | ||
DataType::Dictionary(_, _) => { | ||
let value_null_first = if options.descending { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is moved into a child_rank function
) -> Result<UInt32Array, ArrowError> { | ||
let rank = child_rank(array.values().as_ref(), options)?; | ||
let offsets = array.value_offsets(); | ||
let mut valids = value_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplication in extracting these valids arrays is somewhat unfortunate, I have some ideas of how to fix it, but wanted to keep the PR size down
@@ -1980,7 +1526,7 @@ mod tests { | |||
nulls_first: false, | |||
}), | |||
None, | |||
vec![2, 3, 1, 4, 5, 0], | |||
vec![2, 3, 1, 4, 0, 5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies #4603 to boolean arrays which I missed by accident, yet another benefit of consolidating the code 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0
and 5
here are the indexes of nulls. Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice consolidation @tustvold -- 👍
I found the github diff hard to review, but in the editor it is quite clear that the implementations were consolidated into a single type specialized function (sort_impl
) rather than manually copy/pasted.
} | ||
DataType::Duration(TimeUnit::Second) => { | ||
sort_primitive::<DurationSecondType, _>(values, v, n, cmp, &options, limit) | ||
let (v, n) = partition_validity(array); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since partition_validity
uses u32
for indexes, doesn't that mean this code can not sort arrays with more than 4B values? Maybe for things like LargeList
this could be a problem 🤔
However, it seems like this was the case prior to this PR as well, so it probably isn't an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but I think it somewhat inherent to the current design. FWIW acero also only makes use of 32-bit indices
.into_iter() | ||
.map(|index| (index, values.value(index as usize))) | ||
.collect::<Vec<(u32, bool)>>(); | ||
sort_impl(options, &mut valids, &null_indices, limit, |a, b| a.cmp(&b)).into() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 sort_impl
is very nice
|
||
/// Compare two `Array`s based on the ordering defined in [build_compare] | ||
fn cmp_array(a: &dyn Array, b: &dyn Array) -> Ordering { | ||
let cmp_op = build_compare(a, b).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was called as part of sorting ListArray -- so that might become substantially faster
@@ -1980,7 +1526,7 @@ mod tests { | |||
nulls_first: false, | |||
}), | |||
None, | |||
vec![2, 3, 1, 4, 5, 0], | |||
vec![2, 3, 1, 4, 0, 5], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0
and 5
here are the indexes of nulls. Makes sense
Which issue does this PR close?
Part of #4545
Rationale for this change
Cleans up the sort kernel to make the code easier to maintain.
Also changes list sorting to use the same rank-based approach used for dictionaries, which should be substantially faster, although I don't have benchmarks to quantify this.
The benchmarks are quite noisy, but this seems to possibly introduce a slight performance improvement.
What changes are included in this PR?
Are there any user-facing changes?