Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

Closed
wants to merge 1 commit into from

Conversation

eddyb
Copy link
Member

@eddyb eddyb commented Sep 24, 2020

When ASLR is disabled1, the default glibc allocator appears to behave deterministically2, but jemalloc does not, and its own non-determinism leaks into the addresses being allocated, effectively becoming a form of ASLR3.
1 either system-wide or by running rustc under setarch -R ...
2 at least in my limited testing, on libcore, check-only i.e. without any LLVM threads, though I should try single-threaded LLVM
3 any variation in addresses causes even more non-determinism because of how rustc uses (Fx)HashMaps keyed on types that hash "by address" (such as interned references)

(To be clear, this "non-determinism" doesn't affect compilation behavior/outputs, at least I'm not aware of a way that it does, but it does affect measurements such as instructions:u measured by perf.rust-lang.org)

Turns out that this behavior is caused by jemalloc defaulting to a 10 second delay before "purging"4 memory, and so for rustc executions taking longer than 10 seconds, purging will start taking place, in unpredictable5 patterns.
See jemalloc documentation: http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms.
4 using madvise to inform the kernel that it can reuse the physical backing memory - AFAIK, if userspace tries to use that memory again, the kernel will allocate zeroed physical memory to it, having lost the original data
5 the timer will not expire in the exact same place on every execution, so different parts of memory will be purged first, which cascades into variations in allocation/deallocation behavior in jemalloc

This PR sets both dirty_decay_ms and muzzy_decay_ms to 0, which is documented as removing the reliance on timers:

opt.dirty_decay_ms:
... A decay time of 0 causes all unused dirty pages to be purged immediately upon creation. ...
opt.muzzy_decay_ms:
... A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. ...

And I've confirmed that, at least for my testcase, the resulting jemalloc behavior is just as deterministic as the glibc allocator.

However, I don't know the full performance implications of these settings, hence opening this PR to test it. I also want to see whether we can observe a reduction in variation between identical runs on perf.rust-lang.org.

cc @Mark-Simulacrum

@rust-highfive
Copy link
Collaborator

r? @lcnr

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 24, 2020
@eddyb eddyb marked this pull request as draft September 24, 2020 18:36
@eddyb
Copy link
Member Author

eddyb commented Sep 24, 2020

@bors try (do not start perf run yet, @Mark-Simulacrum probably wants to tweak some settings before then)

@bors
Copy link
Contributor

bors commented Sep 24, 2020

⌛ Trying commit 8534cd7 with merge 138021fc09ebe36d0582611f85c5e3d45e75abad...

@jyn514 jyn514 added I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 24, 2020
@bors
Copy link
Contributor

bors commented Sep 24, 2020

☀️ Try build successful - checks-actions, checks-azure
Build commit: 138021fc09ebe36d0582611f85c5e3d45e75abad (138021fc09ebe36d0582611f85c5e3d45e75abad)

@Mark-Simulacrum
Copy link
Member

@rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6

This should give us 5 clean runs, 1 with self-profile, on a subset of the faster-running benchmarks. Hopefully.

@Mark-Simulacrum
Copy link
Member

Fixing some bugs...

@rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6

@Mark-Simulacrum
Copy link
Member

Fixing some bugs...

@rust-timer build 138021f exclude=style-servo,cargo,webrender runs=6

@rust-timer
Copy link
Collaborator

Queued 138021fc09ebe36d0582611f85c5e3d45e75abad with parent 87d262a, future comparison URL.

@rust-timer

This comment has been minimized.

1 similar comment
@rust-timer

This comment has been minimized.

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Sep 25, 2020

Ignore those messages -- had to fix some bugs -- but we should be collecting now.

I'm not quite sure how best to extract useful data from the results, though. We have raw values for the instructions stat for each of the runs (5 for the try build, previous runs usually have just 1-2, depending on benchmark).

My initial attempt is to do a max-min and compare that delta across parent/child here, which is relatively easy to do but doesn't appear to yield any significant change from this PR.

commit crate min delta to max delta % of min
old await-call-tree 142226064 138 0.0000970286
new await-call-tree 143875916 2018 0.0014025975
old clap-rs 13934991007 1735050 0.0124510306
new clap-rs 14005787982 588 0.0000041983
old coercions 4876220869 275 0.0000056396
new coercions 4886407139 81 0.0000016577
old cranelift-codegen 20134952163 3069804 0.0152461450
new cranelift-codegen 20263501485 1585 0.0000078219
old ctfe-stress-4 35409754465 66 0.0000001864
new ctfe-stress-4 35412566516 2752 0.0000077713
old deeply-nested 194857599 18521 0.0095048898
new deeply-nested 196423661 288 0.0001466218
old deeply-nested-async 7484462308 2474 0.0000330551
new deeply-nested-async 7486501536 63 0.0000008415
old deeply-nested-closures 97122332 409 0.0004211184
new deeply-nested-closures 98433662 147 0.0001493392
old deep-vector 5755149839 206155 0.0035820961
new deep-vector 5779922370 369 0.0000063842
old encoding 1485841091 51 0.0000034324
new encoding 1494587310 475 0.0000317813
old futures 2215714649 173 0.0000078079
new futures 2224494357 1391 0.0000625311
old helloworld 62976005 42 0.0000666921
new helloworld 64090195 4813 0.0075097291
old html5ever 7483052566 4825 0.0000644790
new html5ever 7517789564 621 0.0000082604
old hyper-2 7296201518 230383 0.0031575745
new hyper-2 7327350735 8320 0.0001135472
old inflate 6798213931 680 0.0000100026
new inflate 6816658326 23557 0.0003455799
old issue-46449 141639367 214 0.0001510879
new issue-46449 143367176 601 0.0004192033
old issue-58319 632634756 17 0.0000026872
new issue-58319 639875060 69 0.0000107834
old keccak 35373981332 32627 0.0000922345
new keccak 35691033769 327 0.0000009162
old many-assoc-items 6062124744 536 0.0000088418
new many-assoc-items 6078896032 495 0.0000081429
old match-stress-enum 16898597039 292 0.0000017280
new match-stress-enum 16908516530 75 0.0000004436
old packed-simd 47496035810 87 0.0000001832
new packed-simd 47602369498 537 0.0000011281
old piston-image 5636821378 0 0.0000000000
new piston-image 5662003514 536 0.0000094666
old regex 2829025614 2583 0.0000913035
new regex 2844242855 617 0.0000216929
old regression-31157 736005121 268 0.0000364128
new regression-31157 740246235 62 0.0000083756
old ripgrep 2363452423 6179 0.0002614396
new ripgrep 2376346690 1105 0.0000464999
old serde 17099290970 6975 0.0000407912
new serde 17152711877 1361 0.0000079346
old syn 2370544852 43763 0.0018461157
new syn 2383213637 798 0.0000334842
old token-stream-stress 267629267 684 0.0002555774
new token-stream-stress 268662975 11745 0.0043716482
old tokio-webpush-simple 935575101 11937 0.0012758997
new tokio-webpush-simple 941410628 441 0.0000468446
old tuple-stress 15547019473 0 0.0000000000
new tuple-stress 15603101109 922 0.0000059091
old ucd 30474409144 0 0.0000000000
new ucd 30618873892 574 0.0000018747
old unicode_normalization 7295186743 31589 0.0004330115
new unicode_normalization 7338088171 1630 0.0000222129
old unify-linearly 136852169 42 0.0000306901
new unify-linearly 138273997 18064 0.0130639169
old unused-warnings 1882155103 109 0.0000057912
new unused-warnings 1897278948 158 0.0000083277
old wf-projection-stress-65510 2722213258 1974 0.0000725145
new wf-projection-stress-65510 2726025260 193 0.0000070799
old wg-grammar 8509060628 425868 0.0050048768
new wg-grammar 8534777127 582 0.0000068192

And this is for "check", non-incremental builds. Will update with more data once it finishes. (Edit: collection done)

Collected with select case when aid = 5907 then 'new' else 'old' end as "commit", crate, min(value), max(value) - min(value) as "delta to max", round(cast((max(value) - min(value)) / min(value) as numeric) * 100, 10) as "delta % of min" from pstat join pstat_series on series = id where aid = any(select id from artifact where name = '138021fc09ebe36d0582611f85c5e3d45e75abad' or name = '87d262acb50200d767baa7115f30c650a13672ee') and statistic = 'instructions:u' and (select count(cid) from self_profile_query where self_profile_query.cid = pstat.cid) = 0 and cache = 'full' and crate = any(select distinct crate from pstat as pp join pstat_series on pp.series = id where pp.aid = 5907) and profile = 'check' group by (aid, crate) order by crate;

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit (138021fc09ebe36d0582611f85c5e3d45e75abad): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

@Mark-Simulacrum
Copy link
Member

I also added setarch -R, so maybe worth rerunning another run now to get that onto master as well. But not sure.

@eddyb
Copy link
Member Author

eddyb commented Sep 25, 2020

I'd split benchmarks where the variance decreases from those where the range increases.
Did you get around to disabling SpecLockMap on the collection server? Some of the remaining noise may be from that.

EDIT: oh and, well, we need to know which take over 10 seconds, I think there shouldn't be any benefits for anything shorter, as (supposedly) the delayed purging mechanism shouldn't be kicking in before then anyway.

@eddyb
Copy link
Member Author

eddyb commented Sep 25, 2020

I just checked and we only have two benchmarks that go over 10 seconds: packed-simd and style-servo.
But packed-simd only does so for incremental builds, so there's only one left (in terms of the table above).

This workaround is (unsurprisingly) also a performance degradation so we probably don't want to turn on by default anyway.

Let's see if this does anything:
@rust-timer build 138021f include=style-servo runs=6

@eddyb
Copy link
Member Author

eddyb commented Sep 25, 2020

Okay that does nothing, but I won't go through the trouble of doing another try build, especially with SpecLockMap still undisabled on the perf collection server (i.e. the Python script from https://github.com/mozilla/rr/wiki/Zen needs to be run).

@Mark-Simulacrum
Copy link
Member

The rr script has now been run on the collection server fwiw.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 30, 2020
@Mark-Simulacrum
Copy link
Member

@eddyb where you thinking that we should merge this PR? I forget what the status is...

bors added a commit to rust-lang-ci/rust that referenced this pull request Jun 14, 2022
Integrate measureme's hardware performance counter support.

*Note: this is a companion to rust-lang/measureme#143, and duplicates some information with it for convenience*

**(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open.

## Credits

I'd like to start by thanking `@alyssais,` `@cuviper,` `@edef1c,` `@glandium,` `@jix,` `@Mark-Simulacrum,` `@m-ou-se,` `@mystor,` `@nagisa,` `@puckipedia,` and `@yorickvP,` for all of their help with testing, and valuable insight and suggestions.
Getting here wouldn't have been possible without you!

(If I've forgotten anyone please let me know, I'm going off memory here, plus some discussion logs)

## Summary

This PR adds support to `-Z self-profile` for counting hardware events such as "instructions retired" (as opposed to being limited to time measurements), using the `rdpmc` instruction on `x86_64` Linux.

While other OSes may eventually be supported, preliminary research suggests some kind of kernel extension/driver is required to enable this, whereas on Linux any user can profile (at least) their own threads.

Supporting Linux on architectures other than x86_64 should be much easier (provided the hardware supports such performance counters), and was mostly not done due to a lack of readily available test hardware.
That said, 32-bit `x86` (aka `i686`) would be almost trivial to add and test once we land the initial `x86_64` version (as all the CPU detection code can be reused).

A new flag `-Z self-profile-counter` was added, to control which of the named `measureme` counters is used, and which defaults to `wall-time`, in order to keep `-Z self-profile`'s current functionality unchanged (at least for now).

The named counters so far are:
* `wall-time`: the existing time measurement
    * name chosen for consistency with `perf.rust-lang.org`
    * continues to use `std::time::Instant` for a nanosecond-precision "monotonic clock"
* `instructions:u`: the hardware performance counter usually referred to as "Instructions retired"
    * here "retired" (roughly) means "fully executed"
    * the `:u` suffix is from the Linux `perf` tool and indicates the counter only runs while userspace code is executing, and therefore counts no kernel instructions
        * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this isn't entirely true and why `instructions-minus-irqs:u` should be preferred instead*
* `instructions-minus-irqs:u`: same as `instructions:u`, except the count of hardware interrupts ("IRQs" here for brevity) is subtracted
    * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this should be preferred over `instructions:u`*
* `instructions-minus-r0420:u`: experimental counter, same as `instructions-minus-irqs:u` but subtracting an undocumented counter (`r0420:u`) instead of IRQs
    * the `rXXXX` notation is again from Linux `perf`, and indicates a "raw" counter, with a hex representation of the low-level counter configuration - this was picked because we still don't *really* know what it is
    * this only exists for (future) testing and isn't included/used in any comparisons/data we've put together so far
    * *see [Challenges/Zen's undocumented 420 counter](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter) for details on how this counter was found and what it does*

---

There are also some additional commits:
* ~~see [Challenges/Rebasing *shouldn't* affect the results, right?](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right) for details on the changes to `rustc_parse` and `rustc_trait_section` (the latter far more dubious, and probably shouldn't be merged, or not as-is)~~
  *  **EDIT**: the effects of these are no long quantifiable, the PR includes reverts for them
* ~~see [Challenges/`jemalloc`: purging will commence in ten seconds](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds) for details on the `jemalloc` change~~
  * this is also separately found in rust-lang#77162, and we probably want to avoid doing it by default, ideally we'd use the runtime control API `jemalloc` offers (assuming that can stop the timer that's already running, which I'm not sure about)
  * **EDIT**: until we can do this based on `-Z` flags, this commit has also been reverted
* the `proc_macro` change was to avoid randomized hashing and therefore ASLR-like effects

---

**(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open.

#### Write-up / report

Because of how extensive the full report ended up being, I've kept most of it [on `hackmd.io`](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view), but for convenient access, here are all the sections (with individual links):
<sup>(someone suggested I'd make a backup, so [here it is on the wayback machine](http://web.archive.org/web/20201127164748/https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view) - I'll need to remember to update that if I have to edit the write-up)</sup>

* [**Motivation**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Motivation)

* [**Results**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Results)
    * [**Overhead**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Overhead)
    *Preview (see the report itself for more details):*

    |Counter|Total<br>`instructions-minus-irqs:u`|Overhead from "Baseline"<br>(for all 1903881<br>counter reads)|Overhead from "Baseline"<br>(per each counter read)|
    |-|-|-|-|
    |Baseline|63637621286 ±6||
    |`instructions:u`|63658815885 ±2|&nbsp;&nbsp;+21194599 ±8|&nbsp;&nbsp;+11|
    |`instructions-minus-irqs:u`|63680307361 ±13|&nbsp;&nbsp;+42686075 ±19|&nbsp;&nbsp;+22|
    |`wall-time`|63951958376 ±10275|+314337090 ±10281|+165|

    * [**"Macro" noise (self time)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Macro”-noise-(self-time))
    *Preview (see the report itself for more details):*

    || `wall-time` (ns) | `instructions:u` | `instructions-minus-irqs:u`
    -: | -: | -: | -:
    `typeck` | 5478261360 ±283933373 (±~5.2%) | 17350144522 ±6392 (±~0.00004%) | 17351035832.5 ±4.5 (±~0.00000003%)
    `expand_crate` | 2342096719 ±110465856 (±~4.7%) | 8263777916 ±2937 (±~0.00004%) | 8263708389 ±0 (±~0%)
    `mir_borrowck` | 2216149671 ±119458444 (±~5.4%) | 8340920100 ±2794 (±~0.00003%) | 8341613983.5 ±2.5 (±~0.00000003%)
    `mir_built` | 1269059734 ±91514604 (±~7.2%) | 4454959122 ±1618 (±~0.00004%) | 4455303811 ±1 (±~0.00000002%)
    `resolve_crate` | 942154987.5 ±53068423.5 (±~5.6%) | 3951197709 ±39 (±~0.000001%) | 3951196865 ±0 (±~0%)

    * [**"Micro" noise (individual sampling intervals)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Micro”-noise-(individual-sampling-intervals))

* [**Caveats**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Caveats)
    * [**Disabling ASLR**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Disabling-ASLR)
    * [**Non-deterministic proc macros**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Non-deterministic-proc-macros)
    * [**Subtracting IRQs**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs)
    * [**Lack of support for multiple threads**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Lack-of-support-for-multiple-threads)

* [**Challenges**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Challenges)
    * [**How do we even read hardware performance counters?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#How-do-we-even-read-hardware-performance-counters)
    * [**ASLR: it's free entropy**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#ASLR-it’s-free-entropy)
    * [**The serializing instruction**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#The-serializing-instruction)
    * [**Getting constantly interrupted**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Getting-constantly-interrupted)
    * [**AMD patented time-travel and dubbed it `SpecLockMap`<br><sup>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;or: "how we accidentally unlocked `rr` on AMD Zen"</sup>**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#AMD-patented-time-travel-and-dubbed-it-SpecLockMapnbspnbspnbspnbspnbspnbspnbspnbspor-“how-we-accidentally-unlocked-rr-on-AMD-Zen”)
    * [**`jemalloc`: purging will commence in ten seconds**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds)
    * [**Rebasing *shouldn't* affect the results, right?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right)
    * [**Epilogue: Zen's undocumented 420 counter**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants