Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize drain_filter #44355

Merged
merged 4 commits into from
Sep 20, 2017
Merged

Optimize drain_filter #44355

merged 4 commits into from
Sep 20, 2017

Conversation

Xaeroxe
Copy link
Contributor

@Xaeroxe Xaeroxe commented Sep 5, 2017

This PR cuts out two copies from each iteration of drain_filter by exchanging the swap operation for a copy_nonoverlapping function call instead. Since the data being swapped is not needed anymore we can just overwrite it instead.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@sfackler
Copy link
Member

sfackler commented Sep 5, 2017

Do you have any benchmark numbers to indicate what kind of performance impact this has?

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 5, 2017

No, but I'm happy to make some.

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 5, 2017

Note: A similar optimization could have been made to Vec::retain but at this point that would technically be a breaking change, as it would change the execution order from

Execute predicate on all elements -> Drop elements it was false for

into

Execute predicate on element -> Drop element immediately if false -> Execute predicate on next element -> ...

@sfackler
Copy link
Member

sfackler commented Sep 5, 2017

Drop order isn't something I'd be particularly worried about changing IMO.

@vitalyd
Copy link

vitalyd commented Sep 5, 2017

@sfackler, wouldn't that be visible to users though if they catch panics? Suppose the predicate panics on the last element. Or is that not a typical concern for things like this? It's probably unspecified how this behaves, but I think changing behavior nevertheless means the perf increase has to be more significant.

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 5, 2017

It could technically be a breaking change if somehow the predicate had different behavior based on whether or not elements in the vector had been dropped yet or not. I'm having a hard time thinking of a use case for such a construct though, not to mention it's kind of bad design in the first place as it relies on side effects that aren't guaranteed by any written documentation.

@sfackler
Copy link
Member

sfackler commented Sep 5, 2017

@vitalyd "breaking change" is explicitly smaller in scope than "any change that is physically possible to observe". Otherwise we couldn't do reasonable seeming things like changing the size of a type or adding more information to a Debug implementation. With behavior changes like this, it generally depends on if anyone would "reasonably" be depending on it, which I don't think is the case here.

@sfackler
Copy link
Member

sfackler commented Sep 5, 2017

(Obligatory https://xkcd.com/1172/)

@vitalyd
Copy link

vitalyd commented Sep 5, 2017

@sfackler, FWIW I agree - it's a healthy approach that will make Rust better over the long haul (for an example of a language/ecosystem that's somewhat petrified of making changes, we can look at Java :)).

At the same time, it's important to not alienate/frustrate users with "willy nilly" changes that break something but don't add any real value otherwise. Not saying that's proposed here, just a general comment. So a change that may break code at runtime, such as different drop order in a rare (and possibly untested) scenario, just means the benefit bar has to go up for that change.

Anyway, sorry for the distraction - carry on :).

@sfackler
Copy link
Member

sfackler commented Sep 5, 2017

Yeah for sure; it's always a bit of a judgement call. It's one of the reasons we have things like Crater and nightlies that let us look at the actual downstream impact to see if it's worse than we'd assumed.

@vitalyd
Copy link

vitalyd commented Sep 6, 2017

Yeah, I think crater/cargobomb are awesome tools for this space in general. But those are mostly (entirely?) about detecting compile-time breakage?

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 6, 2017

Well at least we have the nightly channel, so that way users can report breakage they find themselves.

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 6, 2017

Initial benchmarks weren't very promising, so I updated the pull request to grab the initial pointers more like the swap function. I've got a build and benchmark operation running overnight on my PC and I'll report back in a little less than 24 hours.

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 6, 2017

Ok, so after the most recent commit these benchmarks look slightly better:

# Current nightly as of 2017-09-05
Average time for removing one at beginning: Duration { secs: 0, nanos: 62956957 }
Average time for removing one at ending:    Duration { secs: 0, nanos: 42066468 }
Average time for removing all elements:     Duration { secs: 0, nanos: 66166225 }
Average time for removing no elements:      Duration { secs: 0, nanos: 41859578 }
info: default toolchain set to 'optimization'
# This PR
Average time for removing one at beginning: Duration { secs: 0, nanos: 52273292 }
Average time for removing one at ending:    Duration { secs: 0, nanos: 41589495 }
Average time for removing all elements:     Duration { secs: 0, nanos: 65309229 }
Average time for removing no elements:      Duration { secs: 0, nanos: 42410923 }

The time I would expect this to impact the most is Average time for removing one at beginning, which was reflected here. This is the program used to get these benchmarks:

#![feature(drain_filter)]

use std::time::{Duration, Instant};

fn main() {
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for i in 0..9999999 {
            v.push(i);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 0).count(), 1);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing one at beginning: {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for i in 0..9999999 {
            v.push(i);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 9999998).count(), 1);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing one at ending:    {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for _ in 0..9999999 {
            v.push(0);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e == 0).count(), 9999999);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing all elements:     {:?}", total);
    let mut total = Duration::new(0, 0);
    for _ in 0..100 {
        let mut v = Vec::new();
        for _ in 0..9999999 {
            v.push(0);
        }
        let start = Instant::now();
        assert_eq!(v.drain_filter(|e| *e != 0).count(), 0);
        total += start.elapsed() / 1000;
    }
    println!("Average time for removing no elements:      {:?}", total);
}

Overall assessment: It's a very slight improvement, not really worth investigating for Vec::retain but eh, it's already written for this so there's no harm in merging it.

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 6, 2017

I suppose that for the scenario of removing a single element at the front of a very long vector this did boost performance by about 20%. So maybe it would be worth it for Vec::retain?

@aidanhs aidanhs added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 7, 2017
@carols10cents
Copy link
Member

review ping @aturon !

let del = self.del;
let src: *const T = &v[i];
let dst: *mut T = &mut v[i - del];
ptr::copy_nonoverlapping(src, dst, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this requires a comment to explain why this is safe (for example in case of unwinding). The crutch is quite literally not local, which is that the Vec had its length set to 0.

Do the bounds checks optimize out here? Is there anything to gain from writing this entirely in raw pointers instead of using indices? We've seen before that the compiler doesn't optimize out indices and lengths into slices (or into slice iterators, not relevant here) entirely fluently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that comment. The first implementation did use raw pointers but they did not perform favorably in benchmarks. I believe this was because pointer::offset required me to convert the usize to an isize. In order to prevent this I could just store isize values and instead make all usages of this use raw pointers. I'll try that and see how it goes in benchmarking.

@bluss
Copy link
Member

bluss commented Sep 11, 2017

This is a bit anti-optimization, but I really want us to make the best effort on unwinding and not just drop the ball and leak things. This issue is pre-existing and not due to this PR!

Since this is sort of a successor to retain, I see the unwinding issue to be of higher importance here than with the regular drain; here we have a closure that the user might cause a panic in.

@bluss
Copy link
Member

bluss commented Sep 11, 2017

Discussing with @frankmcsherry he pointed out that it does more or less handle panic ok -- if futher calls to the closure don't panic. The concerning thing is that if the filter closure panics, it will still be attempted to be called again in DrainFilter's drop.

@frankmcsherry
Copy link
Contributor

frankmcsherry commented Sep 11, 2017

Some thoughts (from convo with @bluss):

  1. It looks like DrainFilter's drop method drains the iterator. This seems like it might be prone to finding more panics, leading to an abort. Might it be better to check if panicking (std::thread::panicking) and just drop the remaining region in that case. This would deviate from the intended behavior of drain_filter, but I'm not sure what the intent of drop in a panic is (e.g. get out as cleanly as possible; don't worry about maintaining application level invariants?).

  2. I'm not sure if you have one at hand, but the performance could plausibly be more significant on a multicore set-up. It can be hard to saturate your memory bandwidth with just one core, and one of the big changes is removing the need to write back the swapped memory, which eases memory bandwidth.

  3. I can't tell if the "remove first" case got faster because it is better, or because LLVM can see that this happens only once and optimizes lots of small memory moves into larger ones. If you really wanted to dork around with the performance, you could wait until the predicate succeeds to move, allowing you to shift blocks of ignored memory rather than single elements (but preventing copy_nonoverlapping).

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 12, 2017

So I added code that just used ptr::offset with isize values and despite this removing the bounds check the bench mark actually performed significantly worst than even the implementation that uses swap, so I removed that commit from this PR. I'll play around with @frankmcsherry's suggestion number 3 tomorrow. These benchmarks are being executed on the following CPU:

CPU:       Quad core Intel Core i7-6700HQ (-HT-MCP-) cache: 6144 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 20735
           clock speeds: min/max: 800/3500 MHz 1: 844 MHz 2: 825 MHz 3: 799 MHz 4: 862 MHz 5: 800 MHz
6: 800 MHz 7: 800 MHz 8: 822 MHz

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 12, 2017

Side note: I'm not really sure what correct behavior during a panic is. If someone can just tell me what should be dropped when then I'm happy to write it.

@alexcrichton
Copy link
Member

ping @aturon for a review!

Also @Xaeroxe want to add the comment that @bluss mentioned?

@carols10cents
Copy link
Member

ping @aturon @rust-lang/libs! this needs a review please!

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 18, 2017

Due to being unable to use nonoverlapping suggestion 3 has significant performance loss. I'll add the comment @bluss mentioned now.

@alexcrichton
Copy link
Member

Ok this PR itself looks good to me, there's a benchmark showing a nice improvement in some cases, and a concern about panics/drops isn't exacerbated as it's not changing the behavior from what I can tell. Sounds like dealing with panics/drops is best done in a separate issue/PR, so I'm going to...

@bors: r+

Thanks for being patient @Xaeroxe!

@bors
Copy link
Contributor

bors commented Sep 19, 2017

📌 Commit 10384ab has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented Sep 19, 2017

⌛ Testing commit 10384ab with merge 8bc544a...

bors added a commit that referenced this pull request Sep 19, 2017
Optimize drain_filter

This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead.  Since the data being swapped is not needed anymore we can just overwrite it instead.
@bors
Copy link
Contributor

bors commented Sep 19, 2017

💔 Test failed - status-travis

@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Sep 19, 2017

Looks like an unrelated Travis failure? Retry?

@alexcrichton
Copy link
Member

@bors: retry

  • yeah weird travis scheduling...

@bors
Copy link
Contributor

bors commented Sep 20, 2017

⌛ Testing commit 10384ab with merge 94a82ad...

bors added a commit that referenced this pull request Sep 20, 2017
Optimize drain_filter

This PR cuts out two copies from each iteration of `drain_filter` by exchanging the swap operation for a copy_nonoverlapping function call instead.  Since the data being swapped is not needed anymore we can just overwrite it instead.
@kennytm
Copy link
Member

kennytm commented Sep 20, 2017

dist-mips-linux (3 restarts) and x86_64-gnu-incremental (2 restarts) hit travis-ci/travis-ci#8429, expect one more hour before merging 🤦‍.

Edit: 🤦‍

@bors
Copy link
Contributor

bors commented Sep 20, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 94a82ad to master...

@bors bors merged commit 10384ab into rust-lang:master Sep 20, 2017
@arthurprs
Copy link
Contributor

Can we also land this for retain?

@Xaeroxe Xaeroxe deleted the optimize_drain_filter branch February 7, 2018 19:44
Xaeroxe added a commit to Xaeroxe/rust that referenced this pull request Feb 8, 2018
@Xaeroxe
Copy link
Contributor Author

Xaeroxe commented Feb 13, 2018

@arthurprs this has just been r+ed for retain in #48065

kennytm added a commit to kennytm/rust that referenced this pull request Feb 14, 2018
Apply optimization from rust-lang#44355 to retain

As discussed in rust-lang#44355 this PR applies a similar optimization to `Vec::retain`.  For `drain_filter`, a very similar function, this improved performance by up to 20%.
@oxalica oxalica mentioned this pull request Jan 17, 2021
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 11, 2021
Optimize Vec::retain

Use `copy_non_overlapping` instead of `swap` to reduce memory writes, like what we've done in rust-lang#44355 and `String::retain`.
rust-lang#48065 already tried to do this optimization but it is reverted in rust-lang#67300 due to bad codegen of `DrainFilter::drop`.

This PR re-implement the drop-then-move approach. I did a [benchmark](https://gist.github.com/oxalica/3360eec9376f22533fcecff02798b698) on small-no-drop, small-need-drop, large-no-drop elements with different predicate functions. It turns out that the new implementation is >20% faster in average for almost all cases. Only 2/24 cases are slower by 3% and 5%. See the link above for more detail.

I think regression in may-panic cases is due to drop-guard preventing some optimization. If it's permitted to leak elements when predicate function of element's `drop` panic, the new implementation should be almost always faster than current one.
I'm not sure if we should leak on panic, since there is indeed an issue (rust-lang#52267) complains about it before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.