-
Notifications
You must be signed in to change notification settings - Fork 13.6k
flt2dec: replace for loop by iter_mut #144205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rustbot has assigned @workingjubilee. Use |
Performance related so isolating it for the usual reasons, though it's unclear how much it matters. @bors r+ rollup=never |
library/core/src/num/flt2dec/mod.rs
Outdated
for j in i + 1..d.len() { | ||
d[j] = b'0'; | ||
} | ||
d.iter_mut().skip(i + 1).for_each(|c| *c = b'0'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this is clearly better or worse than before. But how about d[i+1..].fill(b'0')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be even better. I added it as an option to my bench crate, and performance on eu dev machines looks like this:
For aarch64:
test bench_round_up_fill ... bench: 974,458.62 ns/iter (+/- 17,236.78)
test bench_round_up_for ... bench: 1,055,622.70 ns/iter (+/- 2,128.75)
test bench_round_up_iter ... bench: 855,721.80 ns/iter (+/- 2,159.27)
For x86_64:
test bench_round_up_fill ... bench: 730,473.60 ns/iter (+/- 393.90)
test bench_round_up_for ... bench: 730,497.57 ns/iter (+/- 894.02)
test bench_round_up_iter ... bench: 740,172.60 ns/iter (+/- 954.45)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On aarch the (+/-) is always strangely large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect your benchmarks are reading tea leaves, not nailing down meaningful differences, because at least the iter_mut().skip()
version should also be readily identifiable as equivalent to memset
. Consider checking if one variant calls memset and the other doesn’t — if memset vs inline loop makes a real difference then any changes here will be extremely fragile as they depend on loop idiom recognition working or not working for this loop (which may also differ between isolated benchmark of this one function vs. benchmark of the whole flt2dec rabbit hole).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I looked at the codegen in your godbolt link and both variants you tried already get turned into memsets. The reason why they benchmark differently is that your benchmark runs on a 100k character buffer filled with '9's so the cost is dominated by the initial "scan backwards for first non-9" part instead. I don't know why the phrasing of the memset makes a difference for codegen in that part of the function, but now I really don't trust the numbers nor do I have faith that this will at all reproduce in the context of the flt2dec routines. The actual buffer size is orders of magnitude smaller, effects on other arms of this match
aren't benchmarked at all, and if it's weird spooky action at a distance that affects the codegen for the rposition
loop, then inlining it into the callers may have similarly unpredictable effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these concerns could be avoided by benchmarking a full trip through the formatting machinery. I appreciate that it's difficult to find an input that hits the path you're interested in and makes that piece of the code hot enough to get a measurable signal. But the "easier" alternative of reducing to the smallest possible benchmark and putting it a microscope can waste your time in other ways!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I reduce length to 10, then I get:
Aarch64:
test bench_round_up_fill ... bench: 14.86 ns/iter (+/- 0.01)
test bench_round_up_for ... bench: 16.07 ns/iter (+/- 0.01)
test bench_round_up_iter ... bench: 16.07 ns/iter (+/- 0.06)
X86_64:
test bench_round_up_fill ... bench: 10.47 ns/iter (+/- 0.01)
test bench_round_up_for ... bench: 12.85 ns/iter (+/- 0.02)
test bench_round_up_iter ... bench: 9.84 ns/iter (+/- 0.07)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't know if those numbers are at all meaningful, but at least they no longer give a reason to not use the fill version for the readability win, I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill it is!
@bors r- |
67b272a
to
f147716
Compare
assuming this doesn't magically fail tidy @bors r+ |
flt2dec: replace for loop by iter_mut Perf is explored in #144118, which initially showed small losses, but then also showed significant gains. Both are real, but given the smallness of the losses, this seems a good change.
I think queue is very borked |
@bors retry r- (manual status refresh, maybe github outage yesterday?) |
@bors r=workingjubilee |
☀️ Test successful - checks-actions |
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 9748d87 (parent) -> c0b282f (this PR) Test differencesShow 3 test diffs3 doctest diffs were found. These are ignored, as they are noisy. Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard c0b282f0ccdab7523cdb8dfa41b23bed5573da76 --output-dir test-dashboard And then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
Finished benchmarking commit (c0b282f): comparison URL. Overall result: no relevant changes - no action needed@rustbot label: -perf-regression Instruction countThis benchmark run did not return any relevant results for this metric. Max RSS (memory usage)Results (secondary -1.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -2.5%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 465.536s -> 463.984s (-0.33%) |
Please remember to update PR descriptions and titles when the PR contents change. Too late for this one though, that's now permanently recorded in the git history. |
Perf is explored in #144118, which initially showed small losses, but then also showed significant gains. Both are real, but given the smallness of the losses, this seems a good change.