Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimize Ord::cmp for primitives #105840

Closed
wants to merge 1 commit into from

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Dec 17, 2022

I originally started looking into this because in MIR, PartialOrd::cmp is huge and even for trivial types like u32 which are theoretically a single statement to compare, the PartialOrd::cmp impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two cmp instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as #64082 (which I didn't know about at the time), I ran llvm-mca on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run.

The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the perf annotate for regressed functions, I was able to identify one source of the regression: Ord::min and Ord::max no longer optimize well. Tweaking them to bypass Ord::cmp removed some regressions, but not much.

Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for Span's hash_stable regressed because span_data_to_lines_and_cols no longer inlined into it, because that function does a lot of Range<BytePos>::contains. The implementation of Range::contains uses PartialOrd multiple times, and we had massively regressed the codegen of Range::contains. The root problem here seems to be that PartialOrd is derived on BytePos, which is a simple wrapper around a u32. So for BytePos, PartialOrd::{le, lt, ge, gt} use the default impls, which go through PartialOrd::cmp, and LLVM fails to optimize these combinations of methods with the new Ord::cmp implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that <Ord for u32>::cmp is an elaborate way to compare two integers.

So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of Ord::cmp which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in coercions. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it.

For the moment I am hacking up the implementations on BytePos instead of modifying the code that derive(PartialOrd, Ord) expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient.


Hacking up the remainder of the PartialOrd/Ord methods on BytePos took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in coercions, so I'm sure this improved something but whether it matters... hard to say. Based on the findings of @joboet, I'm going to cherry-pick #106065 onto this branch, because that strategy seems to improve PartialOrd::lt and PartialOrd::ge back to the original codegen, even when they are using our new Ord::cmp impl. If the remaining perf regressions are due to de-optimizing a PartialOrd::lt not on BytePos, this might be a further improvement.


Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on coercions.

I think the next thing to try here is modifying the implementation of derive(PartialOrd) to automatically emit the modifications that I made to BytePos (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on Ordering, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment.

r? @ghost

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 17, 2022
@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 17, 2022
@bors
Copy link
Contributor

bors commented Dec 17, 2022

⌛ Trying commit d409475 with merge 11078f1908ce4dce2ea59766cda464ec2ecae0b7...

@bors
Copy link
Contributor

bors commented Dec 17, 2022

☀️ Try build successful - checks-actions
Build commit: 11078f1908ce4dce2ea59766cda464ec2ecae0b7 (11078f1908ce4dce2ea59766cda464ec2ecae0b7)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (11078f1908ce4dce2ea59766cda464ec2ecae0b7): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.0% [0.3%, 2.9%] 182
Regressions ❌
(secondary)
1.1% [0.2%, 3.6%] 126
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.0% [0.3%, 2.9%] 182

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.1% [0.8%, 4.4%] 9
Regressions ❌
(secondary)
2.4% [0.7%, 4.7%] 87
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-1.5% [-1.5%, -1.4%] 2
All ❌✅ (primary) 3.1% [0.8%, 4.4%] 9

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.9% [1.6%, 2.3%] 2
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.2% [-2.2%, -2.2%] 1
All ❌✅ (primary) 1.9% [1.6%, 2.3%] 2

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Dec 18, 2022
@scottmcm
Copy link
Member

FWIW, I tried to do this back in #64082. Something might have changed in the past 3 years, though.

@saethlin
Copy link
Member Author

Thanks for the context. I'm running down the source of the regression. Currently it looks like I might be able to fix it, but now that this has attracted attention I suppose I should properly report my findings 🙈

@rust-log-analyzer

This comment has been minimized.

let mut res = 0i8;
res -= (*self < *other) as i8;
res += (*self > *other) as i8;
// SAFETY: The discriminants of Ord were chosen to permit this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should say "Ordering" rather than "Ord".

Should there be a comment near the discriminant values mentioning that safety of some impls depends on the values not being changed? Or a static assertion here that the discriminant values are the expected ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discriminants are visible on stable and it's explicitly #[repr(i8)], so we probably couldn't change them anyway. But noting that in the enum's documentation -- either /// or // -- seems quite reasonable.

@joboet
Copy link
Contributor

joboet commented Dec 19, 2022

While this improves the codegen for cmp, LLVM is unfortunately not smart enough to still optimize comparisons like is_lt():
https://godbolt.org/z/6dca8Y7jn

@joboet
Copy link
Contributor

joboet commented Dec 19, 2022

Found a fix! 😄

Ordering::is_lt() and friends are implemented by matching against the variants of Ordering. This is inefficient, however, as the value has to be compare with a non-zero value (which often times needs another register). If we change the Ordering::is_lt() method to

(*self as i8) < 0

the method is cleaner and LLVM can optimize the code in this PR:
https://godbolt.org/z/TbqTsKG57

Edit: This unfortunately only works for is_lt and is_ge. It doesn't fix the issue for is_le and is_gt as LLVM is probably not smart enough to realize that only one of the comparisons can return a non-zero value...

@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 21, 2022
@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Contributor

bors commented Dec 21, 2022

⌛ Trying commit 2f7cea4 with merge a3273402284853a92209f8cb8bfadfbdd9a76ba6...

@bors
Copy link
Contributor

bors commented Dec 21, 2022

☀️ Try build successful - checks-actions
Build commit: a3273402284853a92209f8cb8bfadfbdd9a76ba6 (a3273402284853a92209f8cb8bfadfbdd9a76ba6)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a3273402284853a92209f8cb8bfadfbdd9a76ba6): comparison URL.

Overall result: ❌ regressions - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.1% [0.3%, 2.9%] 157
Regressions ❌
(secondary)
1.2% [0.2%, 4.0%] 122
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 1.1% [0.3%, 2.9%] 157

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.8% [-4.8%, -4.8%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -4.8% [-4.8%, -4.8%] 1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [2.0%, 2.4%] 3
Regressions ❌
(secondary)
3.7% [3.6%, 3.8%] 2
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 2.2% [2.0%, 2.4%] 3

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 21, 2022
@saethlin
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 26, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 27, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@rust-log-analyzer

This comment has been minimized.

@scottmcm
Copy link
Member

cc llvm/llvm-project#73417 that might still be blocking this.

But also hopefully someone will do https://discourse.llvm.org/t/llvm-add-3-way-comparison-intrinsics/76807?u=scottmcm for GSoC that will let us stop worrying about all this stuff and make it LLVM's problem 🤞

@saethlin
Copy link
Member Author

Thanks for the update! I figured if the upstream blockers were all addressed you'd be visibly celebrating, but also I'm doing some git gardening and all the other links are now completed, so I was getting my hopes up.

@scottmcm
Copy link
Member

I did get excited in hopes that you knew something I didn't :)

Also, it's probably worth running perf on this anyway, just to see -- it might not optimize everywhere we want, but nether does the previous version, so as long as it's not worse...

@saethlin
Copy link
Member Author

You asked me to do this one so it doesn't count toward "Am I hogging the perf queue?"
@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 14, 2024
@bors
Copy link
Contributor

bors commented Feb 14, 2024

⌛ Trying commit dcea8b1 with merge b373f97...

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 14, 2024
Micro-optimize Ord::cmp for primitives

I originally started looking into this because in MIR, `PartialOrd::cmp` is _huge_ and even for trivial types like `u32` which are theoretically a single statement to compare, the `PartialOrd::cmp` impl doesn't inline. A significant contributor to the size of the implementation is that it has two comparisons. And this actually follows through to the final x86_64 codegen too, which is... strange. We don't need two `cmp` instructions in order to do a single Rust-level comparison. So I started tweaking the implementation, and came up with the same thing as rust-lang#64082 (which I didn't know about at the time), I ran `llvm-mca` on it per the issue which was linked in the code to establish that it looked better, and submitted it for a benchmark run.

The initial benchmark run regresses basically everything. By looking through the cachegrind diffs in the perf report then the `perf annotate` for regressed functions, I was able to identify one source of the regression: `Ord::min` and `Ord::max` no longer optimize well. Tweaking them to bypass `Ord::cmp` removed some regressions, but not much.

Diving back into the cachegrind diffs and disassembly, I found one huge widespread issue was that the codegen for `Span`'s `hash_stable` regressed because `span_data_to_lines_and_cols` no longer inlined into it, because that function does a lot of `Range<BytePos>::contains`. The implementation of `Range::contains` uses `PartialOrd` multiple times, and we had massively regressed the codegen of `Range::contains`. The root problem here seems to be that `PartialOrd` is derived on `BytePos`, which is a simple wrapper around a `u32`. So for `BytePos`, `PartialOrd::{le, lt, ge, gt}` use the default impls, which go through `PartialOrd::cmp`, and LLVM fails to optimize these combinations of methods with the new `Ord::cmp` implementation. At a guess, the new implementation makes LLVM totally loses track of the fact that `<Ord for u32>::cmp` is an elaborate way to compare two integers.

So I have low hopes for this overall, because my strategy (which is working) to recover the regressions is to avoid the "faster" implementation that this PR is based around. If we have to settle for an implementation of `Ord::cmp` which is on its own sub-optimal but is optimized better in combination with functions that use its return value in specific ways, so be it. However, one of the runs had an improvement in `coercions`. I don't know if that is jitter or relevant. But I'm still finding threads to pull here, so I'm going to keep at it.

For the moment I am hacking up the implementations on `BytePos` instead of modifying the code that `derive(PartialOrd, Ord)` expands to because that would be hard, and it would also mean that we would just expand to more code, perhaps regressing compile time for that reason, even if the generated assembly is more efficient.

---

Hacking up the remainder of the `PartialOrd`/`Ord` methods on `BytePos` took us down to 3 regressions and 6 improvements, which is interesting. All the improvements are in `coercions`, so I'm sure this improved _something_ but whether it matters... hard to say. Based on the findings of `@joboet,` I'm going to cherry-pick rust-lang#106065 onto this branch, because that strategy seems to improve `PartialOrd::lt` and `PartialOrd::ge` back to the original codegen, even when they are using our new `Ord::cmp` impl. If the remaining perf regressions are due to de-optimizing a `PartialOrd::lt` not on `BytePos`, this might be a further improvement.

---

Okay, that cherry-pick brought us down to 2 regressions but that might be noise. We still have the same 6 improvements, all on `coercions`.

I think the next thing to try here is modifying the implementation of `derive(PartialOrd)` to automatically emit the modifications that I made to `BytePos` (directly implementing all the methods for newtypes). But even if that works, I think the effect of this change is so mixed that it's probably not worth merging with current LLVM. What I'm afraid of is that this change currently pessimizes matching on `Ordering`, and that is the most natural thing to do with an enum. So I'm not closing this yet, but I think without a change from LLVM, I have other priorities at the moment.

r? `@ghost`
@bors
Copy link
Contributor

bors commented Feb 15, 2024

☀️ Try build successful - checks-actions
Build commit: b373f97 (b373f9721677b0fcc5781966e65def7fd20fbe3e)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (b373f97): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
1.0% [1.0%, 1.0%] 1
Regressions ❌
(secondary)
2.0% [0.3%, 3.0%] 8
Improvements ✅
(primary)
-0.7% [-1.0%, -0.3%] 3
Improvements ✅
(secondary)
-1.3% [-1.3%, -1.3%] 1
All ❌✅ (primary) -0.3% [-1.0%, 1.0%] 4

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.9% [1.7%, 8.1%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.9% [-9.6%, -2.8%] 4
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -1.1% [-9.6%, 8.1%] 7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.1% [-2.2%, -2.1%] 2
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 1.1%] 44
Regressions ❌
(secondary)
0.1% [0.0%, 2.3%] 37
Improvements ✅
(primary)
-0.2% [-0.5%, -0.1%] 17
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.0% [-0.5%, 1.1%] 61

Bootstrap: 636.575s -> 637.528s (0.15%)
Artifact size: 306.16 MiB -> 308.15 MiB (0.65%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 15, 2024
@Voultapher
Copy link
Contributor

@saethlin given the long-winded history of this issue, and the adjacent LLVM changes, I'd really appreciate it, if you could answer some questions about the current state of the proposed optimizations.

  1. How does the code impact code that only considers one of the comparisons results e.g. a.lt(b) ? From what I've seen this is a very common pattern, that should in my opinion not regress.
  2. How prevalent is the use of three-way branching on the outcome of cmp? https://grep.app might help in answering that.
  3. How does the suggested change affect debug performance? While the run-time of optimized builds is generally more important than that of debug builds, it too can have significant impacts on users. Many CI systems run tests for debug builds, and a for example 10x regression for debug builds in a foundational function like cmp or component like slice::sort_unstable could significantly impact such scenarios.
  4. What makes the compile-time regression suite, a representative impact analysis of this optimization?

For some background, I've spent a lot of time the past two years evaluating and implementing sort implementations.

@saethlin
Copy link
Member Author

What makes the compile-time regression suite, a representative impact analysis of this optimization?

It isn't. The compile time test suite has never been a good assessment for this, and that point is raised ad nauseum. That's why the perf suite now includes runtime benchmarks.

It is not worth my time to evaluate the run-time impacts of this change unless/until it provides a compile-time improvement, so I haven't.

@Voultapher
Copy link
Contributor

I see, thanks for clarifying. Based on the initial description I was under the impression the goal of this PR was to improve the run-time for Ord::cmp. It might help avoid confusion to more clearly communicate the intent of this optimization.

bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 24, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 24, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 2, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
bjorn3 pushed a commit to bjorn3/rust that referenced this pull request Apr 5, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
GuillaumeGomez pushed a commit to GuillaumeGomez/rust that referenced this pull request Jul 10, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@saethlin
Copy link
Member Author

I think this was obviated by #118310

@scottmcm please correct me if I'm wrong.

@saethlin saethlin closed this Aug 10, 2024
@scottmcm
Copy link
Member

I might say more by https://discourse.llvm.org/t/rfc-add-3-way-comparison-intrinsics/76685?u=scottmcm , but either way we'll hopefully not need to pick a form for this in the standard library any more, yup.

@saethlin saethlin deleted the ord-cmp branch August 10, 2024 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf-regression Performance regression. S-blocked Status: Marked as blocked ❌ on something else such as an RFC or other implementation work. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.