Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of allocation profiling #287

Merged
merged 1 commit into from
Jul 26, 2023

Conversation

r1viollet
Copy link
Collaborator

@r1viollet r1viollet commented Jul 17, 2023

What does this PR do?

  • Adjust the pseudo random generator to be thread local
  • Change the generator to be smaller
  • Remove the lock from the allocation profiling path

Credit for this idea goes to @nsavoire

Motivation

Improve performance of allocation profiling. This really brings down the allocation profiling to numbers that are comparable.

Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
BM_ThreadedAllocations_NoTracking     343085 ns       172913 ns         4525
BM_ThreadedAllocations_Tracking       483263 ns       244706 ns         2886

The numbers were 4x worse prior to this change.

After this change I will need to focus on the deallocation code path.

Additional Notes

This needs extensive testing.
Risks are:

  • Accuracy
  • Thread safety

How to test the change?

The benchmark only tests parts of this. We will need to be careful when deploying this change.

- Adjust the pseudo random generator to be thread local
- Change the generator to be smaller
- Remove the lock from the allocation profiling path
@r1viollet r1viollet marked this pull request as ready for review July 17, 2023 09:00
@@ -194,6 +190,10 @@ void AllocationTracker::track_allocation(uintptr_t addr, size_t size,
free_on_consecutive_failures(success);

if (success && _state.track_deallocations) {
// \fixme{r1viollet} adjust set to be lock free

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably be replaced with a rather simple sparse bitset over the least significant 48 bits of the pointers, which would allow CAS operations on pages of the bitset rather than a lock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, especially as the impact of collisions are easy to manage. I'll probably make it smaller than that.

@@ -39,7 +44,7 @@ void perform_memory_operations(bool track_allocations,
std::mt19937 gen(rd());

for (auto _ : state) {
state.PauseTiming();
// state.PauseTiming();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was measuring the cost of deallocation. In this PR I was optimizing the allocation code path.
I think the correct thing to do is have different benchmarks. I am continuing this work and I'll revisit this.

Copy link

@richardstartin richardstartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me

@r1viollet r1viollet merged commit 7ff7238 into main Jul 26, 2023
@r1viollet r1viollet deleted the r1viollet/perf_alloc_profiling_v1 branch July 26, 2023 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants