Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak the default PartialOrd::{lt,le,gt,ge} #106065

Closed

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Dec 22, 2022

r? @saethlin
who noticed that #105840 was having trouble because of these default implementations.

That got me inspired to give this a shot, to see whether tweaking those defaults might actually improve things -- and hopefully make that PR easier to land. (And maybe even test, since this adds a codegen test that it would not want to regress.)

Specifically, I noticed in https://rust.godbolt.org/z/3fbve7eW7 that

new_cmp(x, y) < Ordering::Equal

did optimize as desired, whereas

new_cmp(x, y) == Ordering::Less

didn't. So this PR bases all the Ordering methods around comparisons against 0, rather than trying to match specific variants.

Let's see what perf says 🤞


EDIT: Also, credit to @joboet in #105840 (comment) who first pointed out that matching the variants directly isn't necessarily better.

@rustbot
Copy link
Collaborator

rustbot commented Dec 22, 2022

Failed to set assignee to saethlin: invalid assignee

Note: Only org members, users with write permissions, or people who have commented on the PR may be assigned.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 22, 2022
@rustbot

This comment was marked as resolved.

@scottmcm
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 22, 2022
@bors
Copy link
Contributor

bors commented Dec 22, 2022

⌛ Trying commit 6e1c3f0 with merge b6f32e9a3b254c2d1a3431d90ed5169aca532ea6...

use std::cmp::Ordering;

#[derive(PartialOrd, PartialEq)]
pub struct Foo(u16);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully this test will ensure that the problem you saw with BytePos won't happen again, and will be easier to catch if accidentally regressed.

@@ -1161,7 +1175,11 @@ pub trait PartialOrd<Rhs: ?Sized = Self>: PartialEq<Rhs> {
#[must_use]
#[stable(feature = "rust1", since = "1.0.0")]
fn gt(&self, other: &Rhs) -> bool {
matches!(self.partial_cmp(other), Some(Greater))
if let Some(ordering) = self.partial_cmp(other) {
ordering.is_gt()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now conceptually two checks, rather than just one, so it's possible it's not always better. None is currently 2 here, so the old code was hypothetically just c == 1, and now it's c != 2 && c > 0. (Of course lt ends up being c != 2 && c < 0, which obviously folds to c < 0, so that one's probably not impacted.)

My hope is that this is still better in practice:

  1. I would bet that most partial_cmps are actually cmps, and thus the optimizer will easily notice that the result is never None -- like happens in the codegen test.
  2. For things that can actually return None, hopefully jump-threading will usually notice that the None becomes false and will again bypass actually running this check at runtime.

I'll see if I can prove that out in a codegen test...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't manage to make a great codegen test for this, but I did in passing find two other things:

@bors
Copy link
Contributor

bors commented Dec 23, 2022

☀️ Try build successful - checks-actions
Build commit: b6f32e9a3b254c2d1a3431d90ed5169aca532ea6 (b6f32e9a3b254c2d1a3431d90ed5169aca532ea6)

@rust-timer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (b6f32e9a3b254c2d1a3431d90ed5169aca532ea6): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.0% [2.0%, 2.0%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) - - 0

Cycles

This benchmark run did not return any relevant results for this metric.

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 23, 2022
@scottmcm
Copy link
Member Author

scottmcm commented Dec 23, 2022

Well that's a whole lot of nothing in perf 😅

I saw your thumb,
r? @compiler-errors
so how do you feel about this change given it's perf-neutral?

I could also cut this back to just the codegen test, since it already passes, if that's useful but we don't want the core changes.

(Apparently highfive didn't like my previously-proposed reviewer.)

@saethlin
Copy link
Member

Yeah, I'm only t-miri, I can't approve anything in this repo.

I like this work, but I'm extremely wary of checking in subtle changes like this that aren't backed up by any kind of test. I'm very curious to know what an LLVM expert thinks of that issue. If this is another "oh we're missing a fold for that" situation, that would be awesome. But I kind of doubt it.

@scottmcm
Copy link
Member Author

scottmcm commented Dec 23, 2022

@saethlin I went to try making an assembly test that is_le gives setle and such, but LLVM does very different things for the different cases, so I opened llvm/llvm-project#59668 to see whether those make sense -- I wouldn't want to add a super-flaky test that would break on improved LLVM.

I'm hoping that the answer really is that there's just some fold or range logic missing. Alive2 proves that it's allowed to do it, so it's a matter of how/where to recognize it.

@saethlin
Copy link
Member

Wow that's very minimized. You're getting my hopes up...

@scottmcm
Copy link
Member Author

That new 59668 one is too minimized to help the original, though -- it's about the backend, and in IR (where the optimizations that #105840 cares about would happen) they're all just icmps so a change to the x64 codegen for comparisons against 0 wouldn't help.

@scottmcm
Copy link
Member Author

I'm going to close this since the lts for non-Ord aren't always better, so it's not obvious that this should happen without other stuff -- maybe #105840 or llvm/llvm-project#59666.

I've submitted the codegen test as #106100.

@scottmcm scottmcm closed this Dec 23, 2022
@compiler-errors
Copy link
Member

Sorry, didn't comment on this before it was closed, but I agree that given the lack of improvement these changes are not worth.

@scottmcm
Copy link
Member Author

@compiler-errors No worries! Thanks for commenting.

Given that it's the end of year I have no expectations that people would be looking at things for a while.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Dec 24, 2022
…=compiler-errors

Codegen test for derived `<` on trivial newtype [TEST ONLY]

I originally wrote this for rust-lang#106065, but the libcore changes there aren't necessarily a win.

So I pulled out this test to be its own PR since it's important (see rust-lang#105840 (comment)) and well-intentioned changes to core or the derive could accidentally break it without that being obvious (other than by massive unexplained perf changes).
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 14, 2024
Micro-optimize Ord::cmp for primitives

I originally started looking into this because in MIR, `PartialOrd::cmp` is _huge_ and even for trivial types like `u32` which are theoretically a single statement to compare, the `PartialOrd::cmp` impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two `cmp` instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran `llvm-mca` on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run.

The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the `perf annotate` for regressed functions, I was able to identify one source of the regression: `Ord::min` and `Ord::max` no longer optimize well. Tweaking them to bypass `Ord::cmp` removed some regressions, but not much.

Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for `Span`'s `hash_stable` regressed because `span_data_to_lines_and_cols` no longer inlined into it, because that function does a lot of `Range<BytePos>::contains`. The implementation of `Range::contains` uses `PartialOrd` multiple times, and we had massively regressed the codegen of `Range::contains`. The root problem here seems to be that `PartialOrd` is derived on `BytePos`, which is a simple wrapper around a `u32`. So for `BytePos`, `PartialOrd::{le, lt, ge, gt}` use the default impls, which go through `PartialOrd::cmp`, and LLVM fails to optimize these combinations of methods with the new `Ord::cmp` implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that `<Ord for u32>::cmp` is an elaborate way to compare two integers.

So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of `Ord::cmp` which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in `coercions`. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it.

For the moment I am hacking up the implementations on `BytePos` instead of modifying the code that `derive(PartialOrd, Ord)` expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient.

---

Hacking up the remainder of the `PartialOrd`/`Ord` methods on `BytePos` took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in `coercions`, so I'm sure this improved _something_ but whether it matters... hard to say. Based on the findings of `@joboet,` I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve `PartialOrd::lt` and `PartialOrd::ge` back to the original codegen, even when they are using our new `Ord::cmp` impl. If the remaining perf regressions are due to de-optimizing a `PartialOrd::lt` not on `BytePos`, this might be a further improvement.

---

Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on `coercions`.

I think the next thing to try here is modifying the implementation of `derive(PartialOrd)` to automatically emit the modifications that I made to `BytePos` (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on `Ordering`, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment.

r? `@ghost`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants