Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce OnceVec<T> primitive and use it for AllocId caches #136105

Closed
wants to merge 1 commit into from

Conversation

Mark-Simulacrum
Copy link
Member

This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch).

This primitive can likely replace the impls in sync/vec.rs AppendOnlyVec (which has a single spinlock for writes) and AppendOnlyIndexVec (rwlock) structures with better (at least) concurrent performance, too. It might also be an improvement on the current locking for Symbol interning; we should be able to make Symbol::as_str() lock-free with this structure.

r? ghost (for now, until we have perf results)

cc https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Miri.20not.20getting.20as.20much.20parallelism.20as.20expected

This significantly reduces contention when running miri under -Zthreads,
allowing us to scale to 30ish cores (from ~7-8 without this patch).
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 26, 2025
@rustbot
Copy link
Collaborator

rustbot commented Jan 26, 2025

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri, @rust-lang/wg-const-eval

@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 26, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 26, 2025
Introduce OnceVec<T> primitive and use it for AllocId caches

This significantly reduces contention when running miri under -Zthreads, allowing us to scale to 30ish cores (from ~7-8 without this patch).

This primitive can likely replace the impls in [sync/vec.rs](https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/sync/vec.rs) AppendOnlyVec (which has a single spinlock for writes) and AppendOnlyIndexVec (rwlock) structures with better (at least) concurrent performance, too. It might also be an improvement on the current locking for Symbol interning; we should be able to make Symbol::as_str() lock-free with this structure.

r? ghost (for now, until we have perf results)

cc https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Miri.20not.20getting.20as.20much.20parallelism.20as.20expected
@bors
Copy link
Contributor

bors commented Jan 26, 2025

⌛ Trying commit ae51601 with merge d457f92...

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check-tidy failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
info: removing rustup binaries
info: rustup is uninstalled
##[group]Image checksum input
mingw-check-tidy
# We use the ghcr base image because ghcr doesn't have a rate limit
# and the mingw-check-tidy job doesn't cache docker images in CI.
FROM ghcr.io/rust-lang/ubuntu:22.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
  g++ \
  make \
---

COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
COPY host-x86_64/mingw-check/validate-error-codes.sh /scripts/

# NOTE: intentionally uses python2 for x.py so we can test it still works.
# validate-toolstate only runs in our CI, so it's ok for it to only support python3.
ENV SCRIPT TIDY_PRINT_DIFF=1 python2.7 ../x.py test \
           --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
#    pip-compile --allow-unsafe --generate-hashes reuse-requirements.in
---
#12 3.049 Building wheels for collected packages: reuse
#12 3.051   Building wheel for reuse (pyproject.toml): started
#12 3.261   Building wheel for reuse (pyproject.toml): finished with status 'done'
#12 3.262   Created wheel for reuse: filename=reuse-4.0.3-cp310-cp310-manylinux_2_35_x86_64.whl size=132719 sha256=be6760d5849de4a58bbe52b85ca57a55f2b32b518b17029a5ad2e530db0d4303
#12 3.263   Stored in directory: /tmp/pip-ephem-wheel-cache-tfjhingx/wheels/3d/8d/0a/e0fc6aba4494b28a967ab5eaf951c121d9c677958714e34532
#12 3.265 Installing collected packages: boolean-py, binaryornot, tomlkit, reuse, python-debian, markupsafe, license-expression, jinja2, chardet, attrs
#12 3.665 Successfully installed attrs-23.2.0 binaryornot-0.4.4 boolean-py-4.0 chardet-5.2.0 jinja2-3.1.4 license-expression-30.3.0 markupsafe-2.1.5 python-debian-0.1.49 reuse-4.0.3 tomlkit-0.13.0
#12 3.666 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#12 4.202 Collecting virtualenv
#12 4.202 Collecting virtualenv
#12 4.265   Downloading virtualenv-20.29.1-py3-none-any.whl (4.3 MB)
#12 4.465      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.3/4.3 MB 21.6 MB/s eta 0:00:00
#12 4.531 Collecting platformdirs<5,>=3.9.1
#12 4.539   Downloading platformdirs-4.3.6-py3-none-any.whl (18 kB)
#12 4.563 Collecting distlib<1,>=0.3.7
#12 4.585      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 KB 43.0 MB/s eta 0:00:00
#12 4.625 Collecting filelock<4,>=3.12.2
#12 4.639   Downloading filelock-3.17.0-py3-none-any.whl (16 kB)
#12 4.639   Downloading filelock-3.17.0-py3-none-any.whl (16 kB)
#12 4.720 Installing collected packages: distlib, platformdirs, filelock, virtualenv
#12 4.906 Successfully installed distlib-0.3.9 filelock-3.17.0 platformdirs-4.3.6 virtualenv-20.29.1
#12 DONE 5.0s

#13 [7/8] COPY host-x86_64/mingw-check/validate-toolstate.sh /scripts/
#13 DONE 0.0s
---
DirectMap4k:      124864 kB
DirectMap2M:     7215104 kB
DirectMap1G:    11534336 kB
##[endgroup]
Executing TIDY_PRINT_DIFF=1 python2.7 ../x.py test            --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
+ TIDY_PRINT_DIFF=1 python2.7 ../x.py test --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp
    Finished `dev` profile [unoptimized] target(s) in 0.05s
##[endgroup]
WARN: currently no CI rustc builds have rustc debug assertions enabled. Please either set `rust.debug-assertions` to `false` if you want to use download CI rustc or set `rust.download-rustc` to `false`.
downloading https://static.rust-lang.org/dist/2025-01-08/rustfmt-nightly-x86_64-unknown-linux-gnu.tar.xz
---
##[endgroup]
fmt check
fmt: checked 5814 files
tidy check
tidy error: `/checkout/compiler/rustc_data_structures/src/sync/once_vec/test.rs:3` contains `#[test]`; unit tests and benchmarks must be placed into separate files or directories named `tests.rs`, `benches.rs`, `tests` or `benches`
removing old virtual environment
removing old virtual environment
creating virtual environment at '/checkout/obj/build/venv' using 'python3.10' and 'venv'
creating virtual environment at '/checkout/obj/build/venv' using 'python3.10' and 'virtualenv'
Requirement already satisfied: pip in ./build/venv/lib/python3.10/site-packages (24.3.1)
  Downloading pip-25.0-py3-none-any.whl.metadata (3.7 kB)
Downloading pip-25.0-py3-none-any.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 67.0 MB/s eta 0:00:00
Installing collected packages: pip
---
All checks passed!
checking python file formatting
28 files already formatted
checking C++ file formatting
some tidy checks failed
Command has failed. Rerun with -v to see more details.
  local time: Sun Jan 26 21:59:05 UTC 2025
  network time: Sun, 26 Jan 2025 21:59:05 GMT
##[error]Process completed with exit code 1.
Post job cleanup.

/// This provides amortized, concurrent O(1) access to &T, expecting a densely numbered key space
/// (all value slots are allocated up to the highest key inserted).
pub struct OnceVec<T> {
// Provide storage for up to 2^35 elements, which we expect to be enough in practice -- but can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can't be more than like 2^29 elements without proc macros as you only get 2^32 bytes for all source code combined and you need on average multiple characters for each identifier. In other words support for 2^35 elements is plenty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's probably not true for e.g. AllocIds, right? At least with miri, you can presumably run a program that runs indefinitely and so allocates filling up memory here. You'd probably run out of RAM, etc., so I'm not actually particularly worried though :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Miri (and CTFE), you can keep allocating and freeing memory and that way use up AllocId without using up more and mire RAM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but if you do that, those allocations don't end up in alloc_map; only the final value of a const/static is put there.

@bors
Copy link
Contributor

bors commented Jan 26, 2025

☀️ Try build successful - checks-actions
Build commit: d457f92 (d457f9267fb98874ccfc8bcd906545e5979149e6)

@rust-timer

This comment has been minimized.

);
next
fn reserve(&self) -> AllocId {
let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
let next_id = self.next_id.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
// Technically there is a window here where we overflow and then another thread
// increments `next_id` *again* and uses it before we panic and tear down the entire session.
// We consider this fine since such overflows cannot realistically occur.

The alternative would be to use a CAS loop to avoid ever storing 0 into next_id.

@@ -389,35 +390,32 @@ pub const CTFE_ALLOC_SALT: usize = 0;

pub(crate) struct AllocMap<'tcx> {
/// Maps `AllocId`s to their corresponding allocations.
alloc_map: FxHashMap<AllocId, GlobalAlloc<'tcx>>,
alloc_map: rustc_data_structures::sync::OnceVec<GlobalAlloc<'tcx>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this map is very sparse in practice. Most AllocId don't end up in the final value of a const/static and hence they are never put in this map. A Vec seems like a poor representation choice for that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so we call reserve and then don't actually insert it? I thought (I guess incorrectly) that we always did end up inserting.

If so, then it might be sufficient to just replace the reservation with a fetch_add + shard the map.

I guess we'll see how bad the max-rss regression is to help gauge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even not sharding the map - if it's rare enough to actually access it, just speeding up reservation might be good enough. Could even shard the counter then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so we call reserve and then don't actually insert it?

Correct. We always reserve because it may be the case that we need to insert the allocation later, but most allocations are never inserted. (Well, in Miri we actually know it will never be inserted. Not sure if that is worth exploiting somehow.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's rare enough to actually access it,

Reading it is more common, but Miri only accesses the map the first time a given interpreter session accesses some const/static. At that point the allocation is copied into interpreter-local state and then the global state is never accessed again.

So yeah I think just making the counter lock-free will give the same speedup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at density on the sysroot crates (libcore -> libtest) -- see numbers below -- I get pretty high densities in practice.

I guess those crates do not contain complicated consts with many intermediate allocations.

In a Miri run, this will be extremely sparse... or at least, at some point the counter will keep increasing but nothing new is added to the map. However, due to queries this can be interleaved I think, so if Miri first creates 10k allocations and then evaluates a const, there'll be a gap of 10k IDs in the map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so I tried just the lock-free counter, and it has same-ish performance to without this patch entirely (8, maybe 9 cores).

Odd, I am not sure what keeps accessing that map after the allocations have been copied into local interpreter memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/rust-lang/rust/blob/master/compiler/rustc_const_eval/src/interpret/memory.rs#L832

pub fn remove_unreachable_allocs(&mut self, allocs: &LiveAllocs<'_, '_>) {

I think this is called for all root allocids on provenance GC? It's not clear to me whether that means "all" in practice for the workloads in question.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... yes that is called a lot. We should probably query the self.memory maps before querying the global map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that in #136166.

// We just reserved, so should always be unique.
assert!(
self.alloc_map
.alloc_map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the double-alloc_map come from?

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d457f92): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.3% [-0.4%, -0.3%] 2
Improvements ✅
(secondary)
-1.0% [-1.8%, -0.3%] 24
All ❌✅ (primary) -0.3% [-0.4%, -0.3%] 2

Max RSS (memory usage)

Results (primary -0.9%, secondary -4.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
2.5% [2.5%, 2.5%] 1
Improvements ✅
(primary)
-0.9% [-0.9%, -0.9%] 1
Improvements ✅
(secondary)
-4.5% [-8.2%, -2.3%] 19
All ❌✅ (primary) -0.9% [-0.9%, -0.9%] 1

Cycles

Results (secondary -7.9%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-7.9% [-9.8%, -5.8%] 15
All ❌✅ (primary) - - 0

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 773.224s -> 773.199s (-0.00%)
Artifact size: 328.21 MiB -> 328.37 MiB (0.05%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 27, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 27, 2025
Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
fmease added a commit to fmease/rust that referenced this pull request Jan 29, 2025
…=compiler-errors

interpret: is_alloc_live: check global allocs last

See rust-lang#136105 (comment).

(A perf run makes no sense as this is only used by Miri.)
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Jan 29, 2025
Rollup merge of rust-lang#136166 - RalfJung:interpet-is-alloc-live, r=compiler-errors

interpret: is_alloc_live: check global allocs last

See rust-lang#136105 (comment).

(A perf run makes no sense as this is only used by Miri.)
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Jan 30, 2025
…-errors

interpret: is_alloc_live: check global allocs last

See rust-lang/rust#136105 (comment).

(A perf run makes no sense as this is only used by Miri.)
@RalfJung
Copy link
Member

This is superseded by #136115, I think?

bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 1, 2025
…lfJung

Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 2, 2025
…lfJung

Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 4, 2025
…lfJung

Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
RalfJung pushed a commit to RalfJung/miri that referenced this pull request Feb 5, 2025
Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang/rust#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
github-actions bot pushed a commit to rust-lang/rustc-dev-guide that referenced this pull request Feb 6, 2025
Shard AllocMap Lock

This improves performance on many-seed parallel (-Zthreads=32) miri executions from managing to use ~8 cores to using 27-28 cores, which is about the same as what I see with the data structure proposed in rust-lang/rust#136105 - I haven't analyzed but I suspect the sharding might actually work out better if we commonly insert "densely" since sharding would split the cache lines and the OnceVec packs locks close together. Of course, we could do something similar with the bitset lock too.

Either way, this seems like a very reasonable starting point that solves the problem ~equally well on what I can test locally.

r? `@RalfJung`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants