-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use try_fold instead of try_for_each to reduce compile time #64885
Conversation
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit fc87c00c5b527660779dbcea0fe4291177100616 with merge 4a55e7b6a6a7beddaf5a2f71ee4d06f3a829524e... |
☀️ Try build successful - checks-azure |
Queued 4a55e7b6a6a7beddaf5a2f71ee4d06f3a829524e with parent 488381c, future comparison URL. |
Finished benchmarking try commit 4a55e7b6a6a7beddaf5a2f71ee4d06f3a829524e, comparison URL. |
removes two functions to inline by combining the check functions and extra call to try_for_each
@nnethercote rebased and removed the second commit |
Thanks! @bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit 8737061 with merge 40a3c41fdfde051926f256564c247e2ce94a667e... |
Assuming we are using try_fold etc everywhere, we can still manually desugar to structs implementing FnMut instead of using closures. Not the best abstraction level, but doesn't it look like we could save one generic item per iterator method then? Where we currently have the check functions. |
☀️ Try build successful - checks-azure |
if f(x) { LoopState::Continue(()) } | ||
else { LoopState::Break(()) } | ||
} | ||
} | ||
|
||
self.try_for_each(check(f)) == LoopState::Continue(()) | ||
self.try_fold((), check(f)) == LoopState::Continue(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on equality check vs pattern matching here, can it have an effect or none at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a quick diff in godbolt and there is less code to inline so that is something that I can do
ZN72$LT$example..LoopState$LT$C$C$B$GT$$u20$as$u20$core..cmp..PartialEq$GT$2eq17h37dbcaf2df999e09E is a lot to inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would hope it has no effect, since LoopState<(),()>
is an i1
in LLVM...
...and it is in -O
, but very different in debug: https://rust.godbolt.org/z/LKOpZ7
Looks like the PartialEq::eq
that gets generated is pretty bad, and it's still bad removing the generics: https://rust.godbolt.org/z/o6Nuaw Could there be a "this is a field-less enum so just compare the discriminants" path in the derive? It looks, unfortunately, like as u8 == 1
is the shortest-emitted-IR way to do these checks. And we're avoiding the derives in other places too, like
Lines 632 to 638 in 702b45e
#[stable(feature = "rust1", since = "1.0.0")] | |
impl Ord for Ordering { | |
#[inline] | |
fn cmp(&self, other: &Ordering) -> Ordering { | |
(*self as i32).cmp(&(*other as i32)) | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting. So we could improve here just by implementing PartialEq manually, or even adding a separate method for just discriminant comparison. But then pattern matching works well too. Like, just a method for ".is_continue()"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the pattern match avoids having a function to inline completely as long as a function is used the llvm-ir will contain a call and a function
My rust-timer command above didn't work. Let's try doing it a different way: @rust-timer build 40a3c41fdfde051926f256564c247e2ce94a667e |
Queued 40a3c41fdfde051926f256564c247e2ce94a667e with parent 42ec683, future comparison URL. |
I do not understand can you show some example? |
@andjo403 I have an example of a before-after change like that, that I made as PoC. Rust -Zprint-mono-items=lazy tells me this uses 1 less generic function (Before we use Code here https://gist.github.com/b94c565bc5ba37206112c150b8b1cc20 It doesn't look great - maybe a macro could improve that? In fact the code looks so bad, I'm unsure we'd want to do that. 🙂 |
thanks @bluss for the example and yes that code was hard to understand |
It is equivalent to desugaring the original closure, without the "check(f) hack", but also without capturing extraneous type parameters. So a regular closure would be the same, when #46477 is fixed. |
Finished benchmarking try commit 40a3c41fdfde051926f256564c247e2ce94a667e, comparison URL. |
Crazy bots! I think I know what's wrong though, will try and fix in a bit, and silence bot for now. |
Finished benchmarking try commit 40a3c41fdfde051926f256564c247e2ce94a667e, comparison URL. |
The results are good: up to 7.5% win for |
I'm happy with this as-is (we can explore other things like #64885 (comment) in a follow-up PR), so @bors r+ |
📌 Commit 8737061 has been approved by |
use try_fold instead of try_for_each to reduce compile time as it was stated in rust-lang#64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold. as there is some gains with using the try_fold function this is maybe a way forward. when I tried to compile the clap-rs benchmark I get times gains only some % from rust-lang#64572 there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times can someone start a perf run? cc @nnethercote @scottmcm @bluss r? @ghost
Rollup of 11 pull requests Successful merges: - #64649 (Avoid ICE on return outside of fn with literal array) - #64722 (Make all alt builders produce parallel-enabled compilers) - #64801 (Avoid `chain()` in `find_constraint_paths_between_regions()`.) - #64805 (Still more `ObligationForest` improvements.) - #64840 (SelfProfiler API refactoring and part one of event review) - #64885 (use try_fold instead of try_for_each to reduce compile time) - #64942 (Fix clippy warnings) - #64952 (Update cargo.) - #64974 (Fix zebra-striping in generic dataflow visualization) - #64978 (Fully clear `HandlerInner` in `Handler::reset_err_count`) - #64979 (Update books) Failed merges: - #64959 (syntax: improve parameter without type suggestions) r? @ghost
as it was stated in #64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold.
as there is some gains with using the try_fold function this is maybe a way forward.
when I tried to compile the clap-rs benchmark I get times gains only some % from #64572
there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times
can someone start a perf run?
cc @nnethercote @scottmcm @bluss
r? @ghost