Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(perf): switch global allocator to mimalloc #14691

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

x-hgg-x
Copy link
Contributor

@x-hgg-x x-hgg-x commented Oct 15, 2024

What does this PR try to resolve?

Follow-up of Eh2406/pubgrub-crates-benchmark#6 (comment).

This PR uses mimalloc as the global allocator for cargo. This improves performance of the resolver by reducing allocation and deallocation cost. The performance gains are the same as with jemalloc, but mimalloc is compatible with Windows.

Comparison of performance using solana-core = "=1.0.5" in Cargo.toml:

allocator duration
system 66.3s
jemalloc 57.7s
mimalloc 58s

r? Eh2406

@rustbot rustbot added A-cli Area: Command-line interface, option parsing, etc. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 15, 2024
Cargo.toml Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me that since Cargo is no longer a workspace member in rust-lang/rust, we may have missed some optimization they've done in the past. We may also want to check what we should pick up again in Cargo's [profile].

Cargo.toml Show resolved Hide resolved
@Eh2406
Copy link
Contributor

Eh2406 commented Oct 15, 2024

I don't know have context about how different parts of rust releases are supposed to link to jemallocator. @ehuss how should we link and who do we need to talk to?

@bors r? ehuss

@rustbot rustbot assigned ehuss and unassigned Eh2406 Oct 15, 2024
@weihanglo
Copy link
Member

@x-hgg-x
Copy link
Contributor Author

x-hgg-x commented Oct 16, 2024

Comparison on Linux between system, jemalloc and mimalloc allocators with the last improvements on master, using solana-core = "=1.0.5" in Cargo.toml:
profiles.tar.gz

allocator duration
system 66.3s
jemalloc 57.7s
mimalloc 58s

Output of perf diff between system and jemalloc:

# Baseline  Delta Abs  Shared Object  Symbol                                                                                                                                                                                                                                                                                                                               
    10.33%     -3.56%  cargo          [.] cargo::core::resolver::RemainingCandidates::next
     3.28%     -3.28%  libc.so.6      [.] malloc
     3.10%     -3.10%  libc.so.6      [.] _int_malloc
     2.37%     -2.37%  libc.so.6      [.] _int_free
     5.60%     +1.84%  libc.so.6      [.] __memmove_avx_unaligned_erms
     2.67%     +1.80%  cargo          [.] core::hash::Hash::hash_slice
     3.10%     +1.20%  cargo          [.] alloc::rc::Rc<T,A>::make_mut
     7.71%     +1.06%  cargo          [.] <cargo::core::dependency::Dependency as core::hash::Hash>::hash
               +0.99%  cargo          [.] _rjem_sdallocx
     0.80%     -0.79%  libc.so.6      [.] _int_free_merge_chunk
     4.45%     +0.63%  cargo          [.] <alloc::rc::Rc<T,A> as core::ops::drop::Drop>::drop
               +0.57%  cargo          [.] _rjem_malloc
     3.59%     +0.47%  cargo          [.] <core::hash::sip::Hasher<S> as core::hash::Hasher>::write

Output of perf diff between system and mimalloc:

# Baseline  Delta Abs  Shared Object  Symbol                                                                                                                                                                                                                                                                                                                               
    10.33%     -3.76%  cargo          [.] cargo::core::resolver::RemainingCandidates::next
     3.28%     -3.27%  libc.so.6      [.] malloc
     3.10%     -3.10%  libc.so.6      [.] _int_malloc
     2.37%     -2.37%  libc.so.6      [.] _int_free
     2.67%     +1.84%  cargo          [.] core::hash::Hash::hash_slice
     5.60%     +1.64%  libc.so.6      [.] __memmove_avx_unaligned_erms
               +1.05%  cargo          [.] mi_free
     4.45%     +1.03%  cargo          [.] <alloc::rc::Rc<T,A> as core::ops::drop::Drop>::drop
     0.98%     -0.98%  libc.so.6      [.] unlink_chunk.isra.0
     3.10%     +0.98%  cargo          [.] alloc::rc::Rc<T,A>::make_mut
     7.71%     +0.72%  cargo          [.] <cargo::core::dependency::Dependency as core::hash::Hash>::hash
     6.21%     +0.66%  cargo          [.] <bitmaps::bitmap::Iter<Size> as core::iter::traits::iterator::Iterator>::next
               +0.62%  cargo          [.] _mi_malloc_generic
     0.59%     -0.59%  libc.so.6      [.] cfree@GLIBC_2.2.5

The main gain seems to be the memory access to the interned PackageId in the resolver hot loop.
Since jemalloc and mimalloc have about the same performance profiles, we can enable mimalloc for Windows and use either one for the other platforms.

@epage
Copy link
Contributor

epage commented Oct 16, 2024

Since jemalloc and mimalloc have about the same performance profiles, we can enable mimalloc for Windows and use either one for the other platforms.

If there is little performance gain to specialize allocators for a given platform, I would lean towards keeping things simple and using a single allocator across all of them.

@x-hgg-x
Copy link
Contributor Author

x-hgg-x commented Oct 16, 2024

Here is the comparison of maximum RSS between allocators, when running cargo generate-lockfile:

allocator maximum RSS
system 5171 Mo
jemalloc 5064 Mo
mimalloc 4948 Mo

@cuviper
Copy link
Member

cuviper commented Oct 16, 2024

What version of glibc was used for the system tests?

Also, could this be set as a feature so third parties like distros can more easily opt out? I think for integration, I'd rather keep using the system allocator.

@weihanglo
Copy link
Member

Also, could this be set as a feature so third parties like distros can more easily opt out? I think for integration, I'd rather keep using the system allocator.

If we're going to do it, we will, just like openssl and libgit2.

@x-hgg-x
Copy link
Contributor Author

x-hgg-x commented Oct 16, 2024

What version of glibc was used for the system tests?

I have glibc 2.39 from Ubuntu 24.04 LTS.

@x-hgg-x x-hgg-x force-pushed the resolver-perf-4 branch 2 times, most recently from adab712 to e40f1c5 Compare October 22, 2024 19:17
@x-hgg-x x-hgg-x changed the title fix(perf): switch global allocator to jemalloc fix(perf): switch global allocator to mimalloc Oct 22, 2024
Cargo.toml Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cli Area: Command-line interface, option parsing, etc. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants