Introduce OnceVec<T> primitive and use it for AllocId caches #136105

Mark-Simulacrum · 2025-01-26T21:49:36Z

This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch).

This primitive can likely replace the impls in sync/vec.rs AppendOnlyVec (which has a single spinlock for writes) and AppendOnlyIndexVec (rwlock) structures with better (at least) concurrent performance, too. It might also be an improvement on the current locking for Symbol interning; we should be able to make Symbol::as_str() lock-free with this structure.

r? ghost (for now, until we have perf results)

cc https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Miri.20not.20getting.20as.20much.20parallelism.20as.20expected

This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch).

rustbot · 2025-01-26T21:49:40Z

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri, @rust-lang/wg-const-eval

Mark-Simulacrum · 2025-01-26T21:50:31Z

@bors try @rust-timer queue

Introduce OnceVec<T> primitive and use it for AllocId caches This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch). This primitive can likely replace the impls in [sync/vec.rs](https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/sync/vec.rs) AppendOnlyVec (which has a single spinlock for writes) and AppendOnlyIndexVec (rwlock) structures with better (at least) concurrent performance, too. It might also be an improvement on the current locking for Symbol interning; we should be able to make Symbol::as_str() lock-free with this structure. r? ghost (for now, until we have perf results) cc https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Miri.20not.20getting.20as.20much.20parallelism.20as.20expected

bors · 2025-01-26T21:51:43Z

⌛ Trying commit ae51601 with merge d457f92...

rust-log-analyzer · 2025-01-26T21:59:20Z

The job mingw-check-tidy failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

info: removing rustup binaries
info: rustup is uninstalled
##[group]Image checksum input
mingw-check-tidy
# We use the ghcr base image because ghcr doesn't have a rate limit
# and the mingw-check-tidy job doesn't cache docker images in CI.
FROM ghcr.io/rust-lang/ubuntu:22.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
  g++ \
  make \
---

COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
COPY host-x86_64/mingw-check/validate-error-codes.sh /scripts/

# NOTE: intentionally uses python2 for x.py so we can test it still works.
# validate-toolstate only runs in our CI, so it's ok for it to only support python3.
ENV SCRIPT TIDY_PRINT_DIFF=1 python2.7 ../x.py test \
           --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
#    pip-compile --allow-unsafe --generate-hashes reuse-requirements.in
---
#12 3.049 Building wheels for collected packages: reuse
#12 3.051   Building wheel for reuse (pyproject.toml): started
#12 3.261   Building wheel for reuse (pyproject.toml): finished with status 'done'
#12 3.262   Created wheel for reuse: filename=reuse-4.0.3-cp310-cp310-manylinux_2_35_x86_64.whl size=132719 sha256=be6760d5849de4a58bbe52b85ca57a55f2b32b518b17029a5ad2e530db0d4303
#12 3.263   Stored in directory: /tmp/pip-ephem-wheel-cache-tfjhingx/wheels/3d/8d/0a/e0fc6aba4494b28a967ab5eaf951c121d9c677958714e34532
#12 3.265 Installing collected packages: boolean-py, binaryornot, tomlkit, reuse, python-debian, markupsafe, license-expression, jinja2, chardet, attrs
#12 3.665 Successfully installed attrs-23.2.0 binaryornot-0.4.4 boolean-py-4.0 chardet-5.2.0 jinja2-3.1.4 license-expression-30.3.0 markupsafe-2.1.5 python-debian-0.1.49 reuse-4.0.3 tomlkit-0.13.0
#12 3.666 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#12 4.202 Collecting virtualenv
#12 4.202 Collecting virtualenv
#12 4.265   Downloading virtualenv-20.29.1-py3-none-any.whl (4.3 MB)
#12 4.465      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.3/4.3 MB 21.6 MB/s eta 0:00:00
#12 4.531 Collecting platformdirs<5,>=3.9.1
#12 4.539   Downloading platformdirs-4.3.6-py3-none-any.whl (18 kB)
#12 4.563 Collecting distlib<1,>=0.3.7
#12 4.585      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 KB 43.0 MB/s eta 0:00:00
#12 4.625 Collecting filelock<4,>=3.12.2
#12 4.639   Downloading filelock-3.17.0-py3-none-any.whl (16 kB)
#12 4.639   Downloading filelock-3.17.0-py3-none-any.whl (16 kB)
#12 4.720 Installing collected packages: distlib, platformdirs, filelock, virtualenv
#12 4.906 Successfully installed distlib-0.3.9 filelock-3.17.0 platformdirs-4.3.6 virtualenv-20.29.1
#12 DONE 5.0s

#13 [7/8] COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
#13 DONE 0.0s
---
DirectMap4k:      124864 kB
DirectMap2M:     7215104 kB
DirectMap1G:    11534336 kB
##[endgroup]
Executing TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
+ TIDY_PRINT_DIFF=1 python2.7 ../x.py test --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
    Finished `dev` profile [unoptimized] target(s) in 0.05s
##[endgroup]
WARN: currently no CI rustc builds have rustc debug assertions enabled. Please either set `rust.debug-assertions` to `false` if you want to use download CI rustc or set `rust.download-rustc` to `false`.
downloading https://static.rust-lang.org/dist/2025-01-08/rustfmt-nightly-x86_64-unknown-linux-gnu.tar.xz
---
##[endgroup]
fmt check
fmt: checked 5814 files
tidy check
tidy error: `/checkout/compiler/rustc_data_structures/src/sync/once_vec/test.rs:3` contains `#[test]`; unit tests and benchmarks must be placed into separate files or directories named `tests.rs`, `benches.rs`, `tests` or `benches`
removing old virtual environment
removing old virtual environment
creating virtual environment at '/checkout/obj/build/venv' using 'python3.10' and 'venv'
creating virtual environment at '/checkout/obj/build/venv' using 'python3.10' and 'virtualenv'
Requirement already satisfied: pip in ./build/venv/lib/python3.10/site-packages (24.3.1)
  Downloading pip-25.0-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 67.0 MB/s eta 0:00:00
Installing collected packages: pip
---
All checks passed!
checking python file formatting
28 files already formatted
checking C++ file formatting
some tidy checks failed
Command has failed. Rerun with -v to see more details.
  local time: Sun Jan 26 21:59:05 UTC 2025
  network time: Sun, 26 Jan 2025 21:59:05 GMT
##[error]Process completed with exit code 1.
Post job cleanup.

bjorn3 · 2025-01-26T22:10:25Z

compiler/rustc_data_structures/src/sync/once_vec.rs

+/// This provides amortized, concurrent O(1) access to &T, expecting a densely numbered key space
+/// (all value slots are allocated up to the highest key inserted).
+pub struct OnceVec<T> {
+    // Provide storage for up to 2^35 elements, which we expect to be enough in practice -- but can


There can't be more than like 2^29 elements without proc macros as you only get 2^32 bytes for all source code combined and you need on average multiple characters for each identifier. In other words support for 2^35 elements is plenty.

I think that's probably not true for e.g. AllocIds, right? At least with miri, you can presumably run a program that runs indefinitely and so allocates filling up memory here. You'd probably run out of RAM, etc., so I'm not actually particularly worried though :)

In Miri (and CTFE), you can keep allocating and freeing memory and that way use up AllocId without using up more and mire RAM.

... but if you do that, those allocations don't end up in alloc_map; only the final value of a const/static is put there.

bors · 2025-01-26T23:43:11Z

☀️ Try build successful - checks-actions
Build commit: d457f92 (d457f9267fb98874ccfc8bcd906545e5979149e6)

RalfJung · 2025-01-27T00:02:23Z

compiler/rustc_middle/src/mir/interpret/mod.rs

-        );
-        next
+    fn reserve(&self) -> AllocId {
+        let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);


Suggested change

let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);

let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);

// Technically there is a window here where we overflow and then another thread

// increments `next_id` *again* and uses it before we panic and tear down the entire session.

// We consider this fine since such overflows cannot realistically occur.

The alternative would be to use a CAS loop to avoid ever storing 0 into next_id.

RalfJung · 2025-01-27T00:03:24Z

compiler/rustc_middle/src/mir/interpret/mod.rs

@@ -389,35 +390,32 @@ pub const CTFE_ALLOC_SALT: usize = 0;

 pub(crate) struct AllocMap<'tcx> {
    /// Maps `AllocId`s to their corresponding allocations.
-    alloc_map: FxHashMap<AllocId, GlobalAlloc<'tcx>>,
+    alloc_map: rustc_data_structures::sync::OnceVec<GlobalAlloc<'tcx>>,


Note that this map is very sparse in practice. Most AllocId don't end up in the final value of a const/static and hence they are never put in this map. A Vec seems like a poor representation choice for that.

Hm, so we call reserve and then don't actually insert it? I thought (I guess incorrectly) that we always did end up inserting.

If so, then it might be sufficient to just replace the reservation with a fetch_add + shard the map.

I guess we'll see how bad the max-rss regression is to help gauge.

Or maybe even not sharding the map - if it's rare enough to actually access it, just speeding up reservation might be good enough. Could even shard the counter then.

Hm, so we call reserve and then don't actually insert it?

Correct. We always reserve because it may be the case that we need to insert the allocation later, but most allocations are never inserted. (Well, in Miri we actually know it will never be inserted. Not sure if that is worth exploiting somehow.)

if it's rare enough to actually access it,

Reading it is more common, but Miri only accesses the map the first time a given interpreter session accesses some const/static. At that point the allocation is copied into interpreter-local state and then the global state is never accessed again.

So yeah I think just making the counter lock-free will give the same speedup.

Looking at density on the sysroot crates (libcore -> libtest) -- see numbers below -- I get pretty high densities in practice.

I guess those crates do not contain complicated consts with many intermediate allocations.

In a Miri run, this will be extremely sparse... or at least, at some point the counter will keep increasing but nothing new is added to the map. However, due to queries this can be interleaved I think, so if Miri first creates 10k allocations and then evaluates a const, there'll be a gap of 10k IDs in the map.

Hm, so I tried just the lock-free counter, and it has same-ish performance to without this patch entirely (8, maybe 9 cores).

Odd, I am not sure what keeps accessing that map after the allocations have been copied into local interpreter memory.

https://github.com/rust-lang/rust/blob/master/compiler/rustc_const_eval/src/interpret/memory.rs#L832

rust/src/tools/miri/src/borrow_tracker/mod.rs

Line 218 in 5545959

pub fn remove_unreachable_allocs(&mut self, allocs: &LiveAllocs<'_, '_>) {

I think this is called for all root allocids on provenance GC? It's not clear to me whether that means "all" in practice for the workloads in question.

Ah... yes that is called a lot. We should probably query the self.memory maps before querying the global map.

I did that in #136166.

RalfJung · 2025-01-27T00:04:38Z

compiler/rustc_middle/src/mir/interpret/mod.rs

+        // We just reserved, so should always be unique.
+        assert!(
+            self.alloc_map
+                .alloc_map


Where does the double-alloc_map come from?

rust-timer · 2025-01-27T01:17:49Z

Finished benchmarking commit (d457f92): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.3%	[-0.4%, -0.3%]	2
Improvements ✅ (secondary)	-1.0%	[-1.8%, -0.3%]	24
All ❌✅ (primary)	-0.3%	[-0.4%, -0.3%]	2

Max RSS (memory usage)

Results (primary -0.9%, secondary -4.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.5%	[2.5%, 2.5%]	1
Improvements ✅ (primary)	-0.9%	[-0.9%, -0.9%]	1
Improvements ✅ (secondary)	-4.5%	[-8.2%, -2.3%]	19
All ❌✅ (primary)	-0.9%	[-0.9%, -0.9%]	1

Cycles

Results (secondary -7.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-7.9%	[-9.8%, -5.8%]	15
All ❌✅ (primary)	-	-	0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 773.224s -> 773.199s (-0.00%)
Artifact size: 328.21 MiB -> 328.37 MiB (0.05%)

Shard AllocMap Lock This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too. Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally. r? `@RalfJung`

…=compiler-errors interpret: is_alloc_live: check global allocs last See rust-lang#136105 (comment). (A perf run makes no sense as this is only used by Miri.)

Rollup merge of rust-lang#136166 - RalfJung:interpet-is-alloc-live, r=compiler-errors interpret: is_alloc_live: check global allocs last See rust-lang#136105 (comment). (A perf run makes no sense as this is only used by Miri.)

…-errors interpret: is_alloc_live: check global allocs last See rust-lang/rust#136105 (comment). (A perf run makes no sense as this is only used by Miri.)

RalfJung · 2025-01-31T10:08:56Z

This is superseded by #136115, I think?

…lfJung Shard AllocMap Lock This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too. Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally. r? `@RalfJung`

Shard AllocMap Lock This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang/rust#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too. Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally. r? `@RalfJung`

Introduce OnceVec<T> primitive and use it for AllocId caches

ae51601

This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch).

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 26, 2025

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 26, 2025

bjorn3 reviewed Jan 26, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

RalfJung reviewed Jan 27, 2025

View reviewed changes

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 27, 2025

Mark-Simulacrum mentioned this pull request Jan 27, 2025

Shard AllocMap Lock #136115

Merged

RalfJung mentioned this pull request Jan 28, 2025

interpret: is_alloc_live: check global allocs last #136166

Merged

Mark-Simulacrum closed this Jan 31, 2025

-        let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
+        let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
+        // Technically there is a window here where we overflow and then another thread
+        // increments `next_id` *again* and uses it before we panic and tear down the entire session.
+        // We consider this fine since such overflows cannot realistically occur.

Introduce OnceVec<T> primitive and use it for AllocId caches #136105

Introduce OnceVec<T> primitive and use it for AllocId caches #136105

Uh oh!

Conversation

Mark-Simulacrum commented Jan 26, 2025

Uh oh!

rustbot commented Jan 26, 2025

Uh oh!

Mark-Simulacrum commented Jan 26, 2025

Uh oh!

This comment has been minimized.

bors commented Jan 26, 2025

Uh oh!

rust-log-analyzer commented Jan 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bors commented Jan 26, 2025

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rust-timer commented Jan 27, 2025

Overall result: ✅ improvements - no action needed

Uh oh!

RalfJung commented Jan 31, 2025

Uh oh!

Uh oh!