-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding _by, by_key, largest variants of k_smallest #654
Conversation
Hi there! Thanks for tackling this - I'm happy you went for it. I hope I don't miss the forest for the trees, but I am pondering whether Moreover I think that we could benefit from inlining some functions:
|
It would do for the current implementation, but in the future I wanted to be able to do something like, fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>
where
T: 'a,
I: Iterator<Item = T>,
F: Fn(&T, &T) -> std::cmp::Ordering,
{
let mut storage: [MaybeUninit<T>; K] = /* ... */;
let num_elements = capped_heapsort(iter, &mut storage, order);
storage
.into_iter()
.take(num_elements)
.map(|e| unsafe { e.assume_init() })
} and have it be usable without any dependency on
I generally try to write code that's robust against the caller making mistakes, although I'd agree with you that that's created overhead here, so I've changed the flow of creating and consuming the heap with the main differences being
|
Hi, thanks for the feedback.
Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Then, I'd imagine the algorithm
On top, I'd suggest more inlining:
|
Being generic over storage at the algorithm level is a neater strategy (see rust-itertools#654)
I think this is a much better strategy than what I had in mind, so I've restructured the heap logic to use a plain Vec for now with an eye to replacing it with a trait in the future. That also neatly eliminated all of the
While I don't have benchmarks, I can't imagine many circumstances where heapify would be slower, since it works in
Most of the reason for that check being placed in
I'm afraid I don't follow -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looks much better already.
I hope I'm not bothering you too much, but please do the following:
- Inline all auxiliary functions that are called exactly once (in particular
from_iter
,heapify
,push_pop
(which should not usecompare
, but just use the comparator directly without going throughOption
),unwrap_sorted
,pop
). Many of them manipulate internal data structures, and if they're inlined it's easier to grasp what they can and cannot assume about the current state. - After doing the above, I think we do not need
MaxHeap::len
anymore, and use a local variable instead. Maybe more possibilities for simplification arise. - Make
MaxHeap::storage
a&mut [T]
(that can be backed by aVec
,ArrayVec
,Array
, and possibly others). I think we want to have contiguous memory, but that's about it. If everything is inlined, we may even be able to avoidstruct MaxHeap
altogether, being able to simply deal with local variables.
Probably this should also include tests (maybe via quickcheck, comparing |
Being generic over storage at the algorithm level is a neater strategy (see rust-itertools#654)
I've been making updates to this as time allows, and I've just pushed the following changes
I've not added any new tests, but will see about doing so when I get more time. Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the
There is, AFAIK, no way to do this time efficiently using only a If we only have one of them, my inclination would be that it should be the first one, because I expect the use cases where callers care about the extra space required would be rare enough that it wouldn't be a burden to use the 2nd version explicitly. I also don't see the return type change being too much of a problem, because I expect that almost all uses of the function will want to iterate over the results rather than collect them. (And that Additionally, in the standard library, However, the call is up to you, and I'd appreciate if you could specify whether you want the first, second or both options included. |
Well described. Please go for the space efficient solution. |
Done. With regard to tests, would you be able to be more specific about what you think should be covered by quickcheck? I assume you don't mean all six variations separately, since they're mostly implemented in terms of each other. |
Maybe you can look what's already there and extend this. If there are no quicktests yet, you could e.g. collect into a vector, sort the vector, and get the first/last `k` items - and assert that the result is the same as if you'd done `k_smallest` (or one of its cousins).
I'd appreciate if you'd write tests for each variation. You're right, they are implemented in terms of each other, but the tests shouldn't assume this.
|
I've just added a quickcheck that tests all 6 variations against fully sorting the input data, let me know if you think there should be anything else added |
The build failed earlier because of an accidental reference to |
What else is this PR waiting on? |
Hi there, first of all: Thanks @ejmount for staying in touch. I saw that you implement Writing an own binary heap turns out to be tricky, so I'd lean to go with a safe implementation for now, sacrificing the holes for the more general variants, but possibly state this in the documentation. We can then try later to write a safe custom heap. That is, before merging I'd suggest:
Before continuing on this, I'd ask @jswrenn and/or @scottmcm for their opinion on this. |
This came back to my attention recently and I've made the changes you've suggested for AFAICT, the code as it stands is identical, in both functionalty and performance, from the POV of any existing users. Assuming my upate just now addresses your concerns in the previous post, are there any other blockers for this being merged? An earlier comment is still flagged as outstanding, but I don't think it's still applicable |
Great work! I was just looking for this functionality as I need |
@douweschulte I don't know the details about this one. It's definitely something I'd like to see merged at some point though. |
I would like to revive this (most-loved) PR, and update this myself if needed. Based on #654 (comment): |
I have some additional info now as well. Since the above comment I have copy pasted this PRs code into a couple of my own projects. I have only used the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. I skimmed this. I'd like some changes, mainly to make the code more compact to avoid jumping back and forth while reading (local functions, narrowly scope some variables).
On top, I suggest to return Vec
in all cases. (Seems canonical with what we already have.)
@Philippe-Cholet I'd like to hand this over to you, as I do not have enough capacity right now. Please let me know if that's not ok.
src/lib.rs
Outdated
Self: Sized, | ||
Self::Item: Ord, | ||
{ | ||
self.k_smallest_by(k, k_smallest::reverse_cmp(Self::Item::cmp)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this return self.k_smallest_by(k, |x| x)
? (Would probably allow us to use reverse_cmp
in only k_largest_by
.)
@phimuemue I assigned myself here so I agree to take this off your hands. @ejmount Hello, I was not sure you would still be around so it's nice to see you here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not adding much since phimuemue already did a bunch here.
I did not look the documentation of the new methods yet. I'll do it at the end.
Being generic over storage at the algorithm level is a neater strategy (see rust-itertools#654)
I've rebased all of this branch on top of the latest master, so that should clear up any formatting conflicts. (And that's very handy that I don't have to be careful not to check in unrelated formatting, since I run fmt by default) I've also done my best to address all the review comments you and @phimuemue have left, let me know if anything still needs to be changed. I'm not sure if GH understands what just happened to the history, but here is the combined diff since phimuemue's last comment for convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the changes here, I think we are good 🎉.
If it does not bother you, add some periods to sentences (I'm probably picky but it would help reading).
I'll update the documentation of Itertools
methods later (like links, formatting preferences).
GH history seems fine to me. However, at the end, I would prefer if you squashed your commits.
src/k_smallest.rs
Outdated
// Also avoids unexpected behaviour with restartable iterators | ||
iter.for_each(|val| { | ||
// `for_each` is potentially more performant for deeply nested iterators, see its docs. | ||
if is_less_than(&val, &mut storage[0]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
&storage[0]
is enough.
I thought clippy unnecessary_mut_passed
would catch that, but no. Maybe because it's a closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't even flag it if it the closure is changed to take &I::Item
specifically. Wonder if that's a clippy bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for the lint seems to deliberately ignore closures. I searched "closure unnecessary_mut_passed" but nothing helpful.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #654 +/- ##
==========================================
+ Coverage 94.38% 94.61% +0.22%
==========================================
Files 48 48
Lines 6665 6701 +36
==========================================
+ Hits 6291 6340 +49
+ Misses 374 361 -13 ☔ View full report in Codecov by Sentry. |
Github seems to want @phimuemue's original comment checked off before this can land - I'm not sure if there's anything I can do about that on my end given all their comments have been resolved in the meantime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this 🎉
@phimuemue It seems a previous review of yours is blocking us.
I'll update documentation of Itertools methods after this (just in case you review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Philippe-Cholet, thanks @ejmount. I have only capacity to skim this, but looks good to me.
OT: I saw you kept VecIntoIter
- fine with me, although I wonder why k_smallest
and min_set
should return different types. - Maybe we revisit this somewhen in the future.
@phimuemue Prior to this and now, EDIT: |
This PR resolves #586 by adding the following public functions
I used a custom binary heap to implement them (and rewrite
k_smallest
) so that the logic could be agnostic as to where the memory came from. This is because I was originally motivated to make this PR by implementingk_smallest
for const sizes so as to not require an allocator. Using a custom heap implementation makes that extension straightforward, but I've not included it just now to make this first PR more manageable. Unfortunately, this requires an MSRV of 1.36, although the clippy config seems to indicate that's already the case.(The heap type is not publicly exposed to crate users, it is only implementation detail.)
The quickcheck test for
k_smallest
is completely untouched and still passes, so I am confident the heap logic is correct. I've also added doctests to some but not all of the new functions, which also pass, Let me know if you feel more documentation (in- or externally facing) is useful.