[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

eddyb · 2020-09-24T18:36:08Z

When ASLR is disabled¹, the default glibc allocator appears to behave deterministically², but jemalloc does not, and its own non-determinism leaks into the addresses being allocated, effectively becoming a form of ASLR³.
¹ either system-wide or by running rustc under setarch -R ...
² at least in my limited testing, on libcore, check-only i.e. without any LLVM threads, though I should try single-threaded LLVM
³ any variation in addresses causes even more non-determinism because of how rustc uses (Fx)HashMaps keyed on types that hash "by address" (such as interned references)

(To be clear, this "non-determinism" doesn't affect compilation behavior/outputs, at least I'm not aware of a way that it does, but it does affect measurements such as instructions:u measured by perf.rust-lang.org)

Turns out that this behavior is caused by jemalloc defaulting to a 10 second delay before "purging"⁴ memory, and so for rustc executions taking longer than 10 seconds, purging will start taking place, in unpredictable⁵ patterns.
See jemalloc documentation: http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms.
⁴ using madvise to inform the kernel that it can reuse the physical backing memory - AFAIK, if userspace tries to use that memory again, the kernel will allocate zeroed physical memory to it, having lost the original data
⁵ the timer will not expire in the exact same place on every execution, so different parts of memory will be purged first, which cascades into variations in allocation/deallocation behavior in jemalloc

This PR sets both dirty_decay_ms and muzzy_decay_ms to 0, which is documented as removing the reliance on timers:

opt.dirty_decay_ms:
... A decay time of 0 causes all unused dirty pages to be purged immediately upon creation. ...
opt.muzzy_decay_ms:
... A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. ...

And I've confirmed that, at least for my testcase, the resulting jemalloc behavior is just as deterministic as the glibc allocator.

However, I don't know the full performance implications of these settings, hence opening this PR to test it. I also want to see whether we can observe a reduction in variation between identical runs on perf.rust-lang.org.

cc @Mark-Simulacrum

…determinism.

rust-highfive · 2020-09-24T18:36:12Z

r? @lcnr

(rust_highfive has picked a reviewer for you, use r? to override)

eddyb · 2020-09-24T18:37:08Z

@bors try (do not start perf run yet, @Mark-Simulacrum probably wants to tweak some settings before then)

bors · 2020-09-24T18:37:19Z

⌛ Trying commit 8534cd7 with merge 138021fc09ebe36d0582611f85c5e3d45e75abad...

bors · 2020-09-24T19:26:11Z

☀️ Try build successful - checks-actions, checks-azure
Build commit: 138021fc09ebe36d0582611f85c5e3d45e75abad (138021fc09ebe36d0582611f85c5e3d45e75abad)

Mark-Simulacrum · 2020-09-24T20:14:12Z

@rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6

This should give us 5 clean runs, 1 with self-profile, on a subset of the faster-running benchmarks. Hopefully.

Mark-Simulacrum · 2020-09-24T20:39:23Z

Fixing some bugs...

@rust-timer build 138021fc09ebe36d0582611f85c5e3d45e75abad exclude=style-servo,cargo,webrender runs=6

Mark-Simulacrum · 2020-09-24T21:24:40Z

Fixing some bugs...

@rust-timer build 138021f exclude=style-servo,cargo,webrender runs=6

rust-timer · 2020-09-24T21:24:42Z

Queued 138021fc09ebe36d0582611f85c5e3d45e75abad with parent 87d262a, future comparison URL.

Mark-Simulacrum · 2020-09-25T00:52:58Z

Ignore those messages -- had to fix some bugs -- but we should be collecting now.

I'm not quite sure how best to extract useful data from the results, though. We have raw values for the instructions stat for each of the runs (5 for the try build, previous runs usually have just 1-2, depending on benchmark).

My initial attempt is to do a max-min and compare that delta across parent/child here, which is relatively easy to do but doesn't appear to yield any significant change from this PR.

commit	crate	min	delta to max	delta % of min
old	await-call-tree	142226064	138	0.0000970286
new	await-call-tree	143875916	2018	0.0014025975
old	clap-rs	13934991007	1735050	0.0124510306
new	clap-rs	14005787982	588	0.0000041983
old	coercions	4876220869	275	0.0000056396
new	coercions	4886407139	81	0.0000016577
old	cranelift-codegen	20134952163	3069804	0.0152461450
new	cranelift-codegen	20263501485	1585	0.0000078219
old	ctfe-stress-4	35409754465	66	0.0000001864
new	ctfe-stress-4	35412566516	2752	0.0000077713
old	deeply-nested	194857599	18521	0.0095048898
new	deeply-nested	196423661	288	0.0001466218
old	deeply-nested-async	7484462308	2474	0.0000330551
new	deeply-nested-async	7486501536	63	0.0000008415
old	deeply-nested-closures	97122332	409	0.0004211184
new	deeply-nested-closures	98433662	147	0.0001493392
old	deep-vector	5755149839	206155	0.0035820961
new	deep-vector	5779922370	369	0.0000063842
old	encoding	1485841091	51	0.0000034324
new	encoding	1494587310	475	0.0000317813
old	futures	2215714649	173	0.0000078079
new	futures	2224494357	1391	0.0000625311
old	helloworld	62976005	42	0.0000666921
new	helloworld	64090195	4813	0.0075097291
old	html5ever	7483052566	4825	0.0000644790
new	html5ever	7517789564	621	0.0000082604
old	hyper-2	7296201518	230383	0.0031575745
new	hyper-2	7327350735	8320	0.0001135472
old	inflate	6798213931	680	0.0000100026
new	inflate	6816658326	23557	0.0003455799
old	issue-46449	141639367	214	0.0001510879
new	issue-46449	143367176	601	0.0004192033
old	issue-58319	632634756	17	0.0000026872
new	issue-58319	639875060	69	0.0000107834
old	keccak	35373981332	32627	0.0000922345
new	keccak	35691033769	327	0.0000009162
old	many-assoc-items	6062124744	536	0.0000088418
new	many-assoc-items	6078896032	495	0.0000081429
old	match-stress-enum	16898597039	292	0.0000017280
new	match-stress-enum	16908516530	75	0.0000004436
old	packed-simd	47496035810	87	0.0000001832
new	packed-simd	47602369498	537	0.0000011281
old	piston-image	5636821378	0	0.0000000000
new	piston-image	5662003514	536	0.0000094666
old	regex	2829025614	2583	0.0000913035
new	regex	2844242855	617	0.0000216929
old	regression-31157	736005121	268	0.0000364128
new	regression-31157	740246235	62	0.0000083756
old	ripgrep	2363452423	6179	0.0002614396
new	ripgrep	2376346690	1105	0.0000464999
old	serde	17099290970	6975	0.0000407912
new	serde	17152711877	1361	0.0000079346
old	syn	2370544852	43763	0.0018461157
new	syn	2383213637	798	0.0000334842
old	token-stream-stress	267629267	684	0.0002555774
new	token-stream-stress	268662975	11745	0.0043716482
old	tokio-webpush-simple	935575101	11937	0.0012758997
new	tokio-webpush-simple	941410628	441	0.0000468446
old	tuple-stress	15547019473	0	0.0000000000
new	tuple-stress	15603101109	922	0.0000059091
old	ucd	30474409144	0	0.0000000000
new	ucd	30618873892	574	0.0000018747
old	unicode_normalization	7295186743	31589	0.0004330115
new	unicode_normalization	7338088171	1630	0.0000222129
old	unify-linearly	136852169	42	0.0000306901
new	unify-linearly	138273997	18064	0.0130639169
old	unused-warnings	1882155103	109	0.0000057912
new	unused-warnings	1897278948	158	0.0000083277
old	wf-projection-stress-65510	2722213258	1974	0.0000725145
new	wf-projection-stress-65510	2726025260	193	0.0000070799
old	wg-grammar	8509060628	425868	0.0050048768
new	wg-grammar	8534777127	582	0.0000068192

And this is for "check", non-incremental builds. Will update with more data once it finishes. (Edit: collection done)

Collected with select case when aid = 5907 then 'new' else 'old' end as "commit", crate, min(value), max(value) - min(value) as "delta to max", round(cast((max(value) - min(value)) / min(value) as numeric) * 100, 10) as "delta % of min" from pstat join pstat_series on series = id where aid = any(select id from artifact where name = '138021fc09ebe36d0582611f85c5e3d45e75abad' or name = '87d262acb50200d767baa7115f30c650a13672ee') and statistic = 'instructions:u' and (select count(cid) from self_profile_query where self_profile_query.cid = pstat.cid) = 0 and cache = 'full' and crate = any(select distinct crate from pstat as pp join pstat_series on pp.series = id where pp.aid = 5907) and profile = 'check' group by (aid, crate) order by crate;

rust-timer · 2020-09-25T02:07:32Z

Finished benchmarking try commit (138021fc09ebe36d0582611f85c5e3d45e75abad): comparison url.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying rollup- to bors.

Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up.

@bors rollup=never

Mark-Simulacrum · 2020-09-25T02:10:40Z

I also added setarch -R, so maybe worth rerunning another run now to get that onto master as well. But not sure.

eddyb · 2020-09-25T03:04:34Z

I'd split benchmarks where the variance decreases from those where the range increases.
Did you get around to disabling SpecLockMap on the collection server? Some of the remaining noise may be from that.

EDIT: oh and, well, we need to know which take over 10 seconds, I think there shouldn't be any benefits for anything shorter, as (supposedly) the delayed purging mechanism shouldn't be kicking in before then anyway.

eddyb · 2020-09-25T03:39:37Z

I just checked and we only have two benchmarks that go over 10 seconds: packed-simd and style-servo.
But packed-simd only does so for incremental builds, so there's only one left (in terms of the table above).

This workaround is (unsurprisingly) also a performance degradation so we probably don't want to turn on by default anyway.

Let's see if this does anything:
@rust-timer build 138021f include=style-servo runs=6

eddyb · 2020-09-25T03:48:41Z

Okay that does nothing, but I won't go through the trouble of doing another try build, especially with SpecLockMap still undisabled on the perf collection server (i.e. the Python script from https://github.com/mozilla/rr/wiki/Zen needs to be run).

Mark-Simulacrum · 2020-09-28T20:11:26Z

The rr script has now been run on the collection server fwiw.

Mark-Simulacrum · 2020-11-22T22:03:11Z

@eddyb where you thinking that we should merge this PR? I forget what the status is...

Integrate measureme's hardware performance counter support. *Note: this is a companion to rust-lang/measureme#143, and duplicates some information with it for convenience* **(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open. ## Credits I'd like to start by thanking `@alyssais,` `@cuviper,` `@edef1c,` `@glandium,` `@jix,` `@Mark-Simulacrum,` `@m-ou-se,` `@mystor,` `@nagisa,` `@puckipedia,` and `@yorickvP,` for all of their help with testing, and valuable insight and suggestions. Getting here wouldn't have been possible without you! (If I've forgotten anyone please let me know, I'm going off memory here, plus some discussion logs) ## Summary This PR adds support to `-Z self-profile` for counting hardware events such as "instructions retired" (as opposed to being limited to time measurements), using the `rdpmc` instruction on `x86_64` Linux. While other OSes may eventually be supported, preliminary research suggests some kind of kernel extension/driver is required to enable this, whereas on Linux any user can profile (at least) their own threads. Supporting Linux on architectures other than x86_64 should be much easier (provided the hardware supports such performance counters), and was mostly not done due to a lack of readily available test hardware. That said, 32-bit `x86` (aka `i686`) would be almost trivial to add and test once we land the initial `x86_64` version (as all the CPU detection code can be reused). A new flag `-Z self-profile-counter` was added, to control which of the named `measureme` counters is used, and which defaults to `wall-time`, in order to keep `-Z self-profile`'s current functionality unchanged (at least for now). The named counters so far are: * `wall-time`: the existing time measurement * name chosen for consistency with `perf.rust-lang.org` * continues to use `std::time::Instant` for a nanosecond-precision "monotonic clock" * `instructions:u`: the hardware performance counter usually referred to as "Instructions retired" * here "retired" (roughly) means "fully executed" * the `:u` suffix is from the Linux `perf` tool and indicates the counter only runs while userspace code is executing, and therefore counts no kernel instructions * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this isn't entirely true and why `instructions-minus-irqs:u` should be preferred instead* * `instructions-minus-irqs:u`: same as `instructions:u`, except the count of hardware interrupts ("IRQs" here for brevity) is subtracted * *see [Caveats/Subtracting IRQs](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) for why this should be preferred over `instructions:u`* * `instructions-minus-r0420:u`: experimental counter, same as `instructions-minus-irqs:u` but subtracting an undocumented counter (`r0420:u`) instead of IRQs * the `rXXXX` notation is again from Linux `perf`, and indicates a "raw" counter, with a hex representation of the low-level counter configuration - this was picked because we still don't *really* know what it is * this only exists for (future) testing and isn't included/used in any comparisons/data we've put together so far * *see [Challenges/Zen's undocumented 420 counter](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter) for details on how this counter was found and what it does* --- There are also some additional commits: * ~~see [Challenges/Rebasing *shouldn't* affect the results, right?](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right) for details on the changes to `rustc_parse` and `rustc_trait_section` (the latter far more dubious, and probably shouldn't be merged, or not as-is)~~ * **EDIT**: the effects of these are no long quantifiable, the PR includes reverts for them * ~~see [Challenges/`jemalloc`: purging will commence in ten seconds](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds) for details on the `jemalloc` change~~ * this is also separately found in rust-lang#77162, and we probably want to avoid doing it by default, ideally we'd use the runtime control API `jemalloc` offers (assuming that can stop the timer that's already running, which I'm not sure about) * **EDIT**: until we can do this based on `-Z` flags, this commit has also been reverted * the `proc_macro` change was to avoid randomized hashing and therefore ASLR-like effects --- **(much later) EDIT**: take any numbers with a grain of salt, they may have changed since initial PR open. #### Write-up / report Because of how extensive the full report ended up being, I've kept most of it [on `hackmd.io`](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view), but for convenient access, here are all the sections (with individual links): (someone suggested I'd make a backup, so [here it is on the wayback machine](http://web.archive.org/web/20201127164748/https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view) - I'll need to remember to update that if I have to edit the write-up) * [**Motivation**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Motivation) * [**Results**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Results) * [**Overhead**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Overhead) *Preview (see the report itself for more details):* |Counter|Total `instructions-minus-irqs:u`|Overhead from "Baseline" (for all 1903881 counter reads)|Overhead from "Baseline" (per each counter read)| |-|-|-|-| |Baseline|63637621286 ±6|| |`instructions:u`|63658815885 ±2|  +21194599 ±8|  +11| |`instructions-minus-irqs:u`|63680307361 ±13|  +42686075 ±19|  +22| |`wall-time`|63951958376 ±10275|+314337090 ±10281|+165| * [**"Macro" noise (self time)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Macro”-noise-(self-time)) *Preview (see the report itself for more details):* || `wall-time` (ns) | `instructions:u` | `instructions-minus-irqs:u` -: | -: | -: | -: `typeck` | 5478261360 ±283933373 (±~5.2%) | 17350144522 ±6392 (±~0.00004%) | 17351035832.5 ±4.5 (±~0.00000003%) `expand_crate` | 2342096719 ±110465856 (±~4.7%) | 8263777916 ±2937 (±~0.00004%) | 8263708389 ±0 (±~0%) `mir_borrowck` | 2216149671 ±119458444 (±~5.4%) | 8340920100 ±2794 (±~0.00003%) | 8341613983.5 ±2.5 (±~0.00000003%) `mir_built` | 1269059734 ±91514604 (±~7.2%) | 4454959122 ±1618 (±~0.00004%) | 4455303811 ±1 (±~0.00000002%) `resolve_crate` | 942154987.5 ±53068423.5 (±~5.6%) | 3951197709 ±39 (±~0.000001%) | 3951196865 ±0 (±~0%) * [**"Micro" noise (individual sampling intervals)**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#“Micro”-noise-(individual-sampling-intervals)) * [**Caveats**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Caveats) * [**Disabling ASLR**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Disabling-ASLR) * [**Non-deterministic proc macros**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Non-deterministic-proc-macros) * [**Subtracting IRQs**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Subtracting-IRQs) * [**Lack of support for multiple threads**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Lack-of-support-for-multiple-threads) * [**Challenges**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Challenges) * [**How do we even read hardware performance counters?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#How-do-we-even-read-hardware-performance-counters) * [**ASLR: it's free entropy**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#ASLR-it’s-free-entropy) * [**The serializing instruction**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#The-serializing-instruction) * [**Getting constantly interrupted**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Getting-constantly-interrupted) * [**AMD patented time-travel and dubbed it `SpecLockMap`         or: "how we accidentally unlocked `rr` on AMD Zen"**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#AMD-patented-time-travel-and-dubbed-it-SpecLockMapnbspnbspnbspnbspnbspnbspnbspnbspor-“how-we-accidentally-unlocked-rr-on-AMD-Zen”) * [**`jemalloc`: purging will commence in ten seconds**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#jemalloc-purging-will-commence-in-ten-seconds) * [**Rebasing *shouldn't* affect the results, right?**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Rebasing-*shouldn’t*-affect-the-results,-right) * [**Epilogue: Zen's undocumented 420 counter**](https://hackmd.io/sH315lO2RuicY-SEt7ynGA?view#Epilogue-Zen’s-undocumented-420-counter)

[experiment/perf] Disable jemalloc's time-delayed purging, for extra …

8534cd7

…determinism.

rust-highfive assigned lcnr Sep 24, 2020

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 24, 2020

eddyb marked this pull request as draft September 24, 2020 18:36

jyn514 added I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Sep 24, 2020

Mark-Simulacrum assigned Mark-Simulacrum and unassigned lcnr Sep 24, 2020

This comment has been minimized.

Sign in to view

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 30, 2020

eddyb mentioned this pull request Nov 5, 2020

Integrate measureme's hardware performance counter support. #78781

Merged

Mark-Simulacrum closed this Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

eddyb commented Sep 24, 2020 •

edited

Loading

rust-highfive commented Sep 24, 2020

eddyb commented Sep 24, 2020

bors commented Sep 24, 2020

bors commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

rust-timer commented Sep 24, 2020

This comment has been minimized.

This comment has been minimized.

Mark-Simulacrum commented Sep 25, 2020 •

edited

Loading

rust-timer commented Sep 25, 2020

Mark-Simulacrum commented Sep 25, 2020

eddyb commented Sep 25, 2020 •

edited

Loading

eddyb commented Sep 25, 2020

eddyb commented Sep 25, 2020

Mark-Simulacrum commented Sep 28, 2020

Mark-Simulacrum commented Nov 22, 2020

[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

[experiment/perf] Disable jemalloc's time-delayed purging, for extra determinism. #77162

Conversation

eddyb commented Sep 24, 2020 • edited Loading

rust-highfive commented Sep 24, 2020

eddyb commented Sep 24, 2020

bors commented Sep 24, 2020

bors commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

Mark-Simulacrum commented Sep 24, 2020

rust-timer commented Sep 24, 2020

This comment has been minimized.

This comment has been minimized.

Mark-Simulacrum commented Sep 25, 2020 • edited Loading

rust-timer commented Sep 25, 2020

Mark-Simulacrum commented Sep 25, 2020

eddyb commented Sep 25, 2020 • edited Loading

eddyb commented Sep 25, 2020

eddyb commented Sep 25, 2020

Mark-Simulacrum commented Sep 28, 2020

Mark-Simulacrum commented Nov 22, 2020

eddyb commented Sep 24, 2020 •

edited

Loading

Mark-Simulacrum commented Sep 25, 2020 •

edited

Loading

eddyb commented Sep 25, 2020 •

edited

Loading