-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize drain_filter #44355
Optimize drain_filter #44355
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
Do you have any benchmark numbers to indicate what kind of performance impact this has? |
No, but I'm happy to make some. |
Note: A similar optimization could have been made to Execute predicate on all elements -> Drop elements it was false for into Execute predicate on element -> Drop element immediately if false -> Execute predicate on next element -> ... |
Drop order isn't something I'd be particularly worried about changing IMO. |
@sfackler, wouldn't that be visible to users though if they catch panics? Suppose the predicate panics on the last element. Or is that not a typical concern for things like this? It's probably unspecified how this behaves, but I think changing behavior nevertheless means the perf increase has to be more significant. |
It could technically be a breaking change if somehow the predicate had different behavior based on whether or not elements in the vector had been dropped yet or not. I'm having a hard time thinking of a use case for such a construct though, not to mention it's kind of bad design in the first place as it relies on side effects that aren't guaranteed by any written documentation. |
@vitalyd "breaking change" is explicitly smaller in scope than "any change that is physically possible to observe". Otherwise we couldn't do reasonable seeming things like changing the size of a type or adding more information to a |
(Obligatory https://xkcd.com/1172/) |
@sfackler, FWIW I agree - it's a healthy approach that will make Rust better over the long haul (for an example of a language/ecosystem that's somewhat petrified of making changes, we can look at Java :)). At the same time, it's important to not alienate/frustrate users with "willy nilly" changes that break something but don't add any real value otherwise. Not saying that's proposed here, just a general comment. So a change that may break code at runtime, such as different drop order in a rare (and possibly untested) scenario, just means the benefit bar has to go up for that change. Anyway, sorry for the distraction - carry on :). |
Yeah for sure; it's always a bit of a judgement call. It's one of the reasons we have things like Crater and nightlies that let us look at the actual downstream impact to see if it's worse than we'd assumed. |
Yeah, I think crater/cargobomb are awesome tools for this space in general. But those are mostly (entirely?) about detecting compile-time breakage? |
Well at least we have the nightly channel, so that way users can report breakage they find themselves. |
Initial benchmarks weren't very promising, so I updated the pull request to grab the initial pointers more like the swap function. I've got a build and benchmark operation running overnight on my PC and I'll report back in a little less than 24 hours. |
Ok, so after the most recent commit these benchmarks look slightly better:
The time I would expect this to impact the most is Average time for removing one at beginning, which was reflected here. This is the program used to get these benchmarks: #![feature(drain_filter)]
use std::time::{Duration, Instant};
fn main() {
let mut total = Duration::new(0, 0);
for _ in 0..100 {
let mut v = Vec::new();
for i in 0..9999999 {
v.push(i);
}
let start = Instant::now();
assert_eq!(v.drain_filter(|e| *e == 0).count(), 1);
total += start.elapsed() / 1000;
}
println!("Average time for removing one at beginning: {:?}", total);
let mut total = Duration::new(0, 0);
for _ in 0..100 {
let mut v = Vec::new();
for i in 0..9999999 {
v.push(i);
}
let start = Instant::now();
assert_eq!(v.drain_filter(|e| *e == 9999998).count(), 1);
total += start.elapsed() / 1000;
}
println!("Average time for removing one at ending: {:?}", total);
let mut total = Duration::new(0, 0);
for _ in 0..100 {
let mut v = Vec::new();
for _ in 0..9999999 {
v.push(0);
}
let start = Instant::now();
assert_eq!(v.drain_filter(|e| *e == 0).count(), 9999999);
total += start.elapsed() / 1000;
}
println!("Average time for removing all elements: {:?}", total);
let mut total = Duration::new(0, 0);
for _ in 0..100 {
let mut v = Vec::new();
for _ in 0..9999999 {
v.push(0);
}
let start = Instant::now();
assert_eq!(v.drain_filter(|e| *e != 0).count(), 0);
total += start.elapsed() / 1000;
}
println!("Average time for removing no elements: {:?}", total);
} Overall assessment: It's a very slight improvement, not really worth investigating for |
I suppose that for the scenario of removing a single element at the front of a very long vector this did boost performance by about 20%. So maybe it would be worth it for |
review ping @aturon ! |
src/liballoc/vec.rs
Outdated
let del = self.del; | ||
let src: *const T = &v[i]; | ||
let dst: *mut T = &mut v[i - del]; | ||
ptr::copy_nonoverlapping(src, dst, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this requires a comment to explain why this is safe (for example in case of unwinding). The crutch is quite literally not local, which is that the Vec had its length set to 0.
Do the bounds checks optimize out here? Is there anything to gain from writing this entirely in raw pointers instead of using indices? We've seen before that the compiler doesn't optimize out indices and lengths into slices (or into slice iterators, not relevant here) entirely fluently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that comment. The first implementation did use raw pointers but they did not perform favorably in benchmarks. I believe this was because pointer::offset
required me to convert the usize to an isize. In order to prevent this I could just store isize values and instead make all usages of this use raw pointers. I'll try that and see how it goes in benchmarking.
This is a bit anti-optimization, but I really want us to make the best effort on unwinding and not just drop the ball and leak things. This issue is pre-existing and not due to this PR! Since this is sort of a successor to retain, I see the unwinding issue to be of higher importance here than with the regular drain; here we have a closure that the user might cause a panic in. |
Discussing with @frankmcsherry he pointed out that it does more or less handle panic ok -- if futher calls to the closure don't panic. The concerning thing is that if the filter closure panics, it will still be attempted to be called again in DrainFilter's drop. |
Some thoughts (from convo with @bluss):
|
f62d618
to
4de0cf1
Compare
So I added code that just used ptr::offset with isize values and despite this removing the bounds check the bench mark actually performed significantly worst than even the implementation that uses swap, so I removed that commit from this PR. I'll play around with @frankmcsherry's suggestion number 3 tomorrow. These benchmarks are being executed on the following CPU:
|
Side note: I'm not really sure what correct behavior during a panic is. If someone can just tell me what should be dropped when then I'm happy to write it. |
ping @aturon @rust-lang/libs! this needs a review please! |
Due to being unable to use nonoverlapping suggestion 3 has significant performance loss. I'll add the comment @bluss mentioned now. |
Ok this PR itself looks good to me, there's a benchmark showing a nice improvement in some cases, and a concern about panics/drops isn't exacerbated as it's not changing the behavior from what I can tell. Sounds like dealing with panics/drops is best done in a separate issue/PR, so I'm going to... @bors: r+ Thanks for being patient @Xaeroxe! |
📌 Commit 10384ab has been approved by |
Optimize drain_filter This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.
💔 Test failed - status-travis |
Looks like an unrelated Travis failure? Retry? |
@bors: retry
|
Optimize drain_filter This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.
Edit: 🤦 |
☀️ Test successful - status-appveyor, status-travis |
Can we also land this for retain? |
@arthurprs this has just been r+ed for retain in #48065 |
Apply optimization from rust-lang#44355 to retain As discussed in rust-lang#44355 this PR applies a similar optimization to `Vec::retain`. For `drain_filter`, a very similar function, this improved performance by up to 20%.
Optimize Vec::retain Use `copy_non_overlapping` instead of `swap` to reduce memory writes, like what we've done in rust-lang#44355 and `String::retain`. rust-lang#48065 already tried to do this optimization but it is reverted in rust-lang#67300 due to bad codegen of `DrainFilter::drop`. This PR re-implement the drop-then-move approach. I did a [benchmark](https://gist.github.com/oxalica/3360eec9376f22533fcecff02798b698) on small-no-drop, small-need-drop, large-no-drop elements with different predicate functions. It turns out that the new implementation is >20% faster in average for almost all cases. Only 2/24 cases are slower by 3% and 5%. See the link above for more detail. I think regression in may-panic cases is due to drop-guard preventing some optimization. If it's permitted to leak elements when predicate function of element's `drop` panic, the new implementation should be almost always faster than current one. I'm not sure if we should leak on panic, since there is indeed an issue (rust-lang#52267) complains about it before.
This PR cuts out two copies from each iteration of
drain_filter
by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.