Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance live heap profiling - deallocation code path #298

Merged
merged 21 commits into from
Oct 12, 2023

Conversation

r1viollet
Copy link
Collaborator

@r1viollet r1viollet commented Aug 28, 2023

What does this PR do?

Ensure the structure to keep the allocation addresses is lock free.
The proposal is to have a bitset (using atomic integers) instead of a hash map. There is some logic to map the address to a position in the bitset.

Motivation

Performances:
I get significantly better numbers in the benchmark (when focusing on the allocation code paths)

# New numbers are in the same ball park
BM_ShortLived_NoTracking/process_time/real_time     440168 ns       998071 ns         1349
BM_ShortLived_Tracking/process_time/real_time       490663 ns      1160473 ns         1505
BM_LongLived_NoTracking/process_time                345291 ns       218756 ns         3223
BM_LongLived_Tracking/process_time                  342495 ns       358709 ns         1840

# Previous numbers (with same config)
BM_ShortLived_NoTracking/process_time/real_time     448681 ns       946333 ns         1602
BM_ShortLived_Tracking/process_time/real_time      1860407 ns      5558747 ns          377
BM_LongLived_NoTracking/process_time                339814 ns       197875 ns         3370
BM_LongLived_Tracking/process_time                 1316344 ns      4319851 ns          142

Additional Notes

The bitset has a limited number of elements. This impacts the probability of capturing the allocation event.
This only occurs when live allocation is triggered.

How to test the change?

The improvement in performances can be observed through benchmarks.
The unit test cover the new data structure.
The existing tests already test live heap profiling.

Thanks to @richardstartin for mentioning this data structure.

src/lib/address_bitset.cc Outdated Show resolved Hide resolved
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch 3 times, most recently from c182378 to 7e7682a Compare August 30, 2023 14:44
@r1viollet r1viollet marked this pull request as ready for review August 30, 2023 14:55
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from 5a618aa to db98b47 Compare September 4, 2023 12:32
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/allocation_tracker.cc Outdated Show resolved Hide resolved
src/lib/allocation_tracker.cc Show resolved Hide resolved
include/lib/lib_logger.hpp Outdated Show resolved Hide resolved
include/lib/allocation_tracker.hpp Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
@r1viollet r1viollet requested a review from nsavoire September 6, 2023 08:16
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from 167264f to 69a4a8f Compare September 8, 2023 08:29
Comment on lines 53 to 67
uint32_t remove_lower_bits(uintptr_t h1) {
uint64_t intermediate = h1 >> _lower_bits_ignored;
uint32_t high = (uint32_t)(intermediate >> 32);
uint32_t low = (uint32_t)intermediate;
uint32_t res = high ^ low;
return res & _nb_bits_mask;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could get away with exactness here:

  • you can strip the top 16 bits out anyway, so you only need to actually store 48 bits, on top of the up to 4 you can remove for alignment.
  • if you only need to store 48 - alignment bits, you can consider a paginated structure, where you have up to 2^(32 - alignment) pages of 16 bit bitsets. This means you have an enormous array of null pointers, occasionally pointing to an 8KiB bitset whenever an address in the range [i, i + 2^16) has been encountered.
  • 2^(32 - alignment) is going to be too large (at least 2GiB with the max alignment), but allocation patterns will be sparse, especially in the high order bits, so what about a 3 level structure with up to 2^16 pages for the top 16 bits, each (there will be very few of these populated) indexing paginated bitsets of 2^(16 - alignment) pages of 16 bit bitsets. This pushes the alignment saving into the middle 12-16 bits which are likely to have higher cardinalities than the top 16.
    • The top level costs 512 KiB, most of it will be null pointers
    • Each populated range in the top level will cost 32-512 KiB according to alignment (32 for 16 byte alignment, 512 for no alignment)
    • Each range of 2^16 addresses after alignment costs 8KiB for the leaf level bitset.
    • E.g. for an 8GiB address space and 8 byte alignment, you end up with a fixed 512KiB, 2 * 64 KiB for the 2 populated middle pages, and 8KiB for any range of 2^16 consecutive addresses where you sample a pointer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you can trim the 512KiB a lot if you know the max address

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for exposing the math here.

Exactness can have value, especially for cases where the user puts a lower sampling interval and just wants to debug his program. I will delay trying this out as it will take me down a new cycles of experiments and measures.
On average the current approach is good to reduce memory impact. We do skew slightly the sampling intervals when a collision occurs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a ticket to internal board for now: https://datadoghq.atlassian.net/browse/PROF-8222

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first draft around this topic: https://github.com/DataDog/ddprof/pull/307/files
Perhaps we need to have some quick access path. It looks like the current performances are not yet good enough.

@r1viollet
Copy link
Collaborator Author

I checked that CI is fine (a deploy job could not start https://gitlab.ddbuild.io/DataDog/apm-reliability/ddprof-build/-/jobs/325597595)

include/lib/lib_logger.hpp Outdated Show resolved Hide resolved
src/lib/allocation_tracker.cc Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
include/lib/address_bitset.hpp Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/address_bitset.cc Outdated Show resolved Hide resolved
src/lib/allocation_tracker.cc Show resolved Hide resolved
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from cea730b to 29f932e Compare September 13, 2023 08:30
include/lib/lib_logger.hpp Outdated Show resolved Hide resolved
@r1viollet r1viollet requested a review from nsavoire September 14, 2023 12:13
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from b0ee8ce to e557dd1 Compare September 16, 2023 15:50
@r1viollet r1viollet marked this pull request as draft September 18, 2023 08:16
Fix the missing TLS storage initializations
- Removal of the lock
- Minor fixes around the allocation code paths
Adjust the usage of the address bitset.
- Adjust the hashing logic
Remove the dependency between bitset and sampling period.
Simplify the hashing logic.
Re-test the collision rate of the hash we use.
Improve the bitset set and unset flow to have a single atomic operation.
- Enforce power of two on size of bitset
- Change the log_once api to take into account the place where it is called from
- Naming updates
Improve the unit test for the LOG_ONCE API
While refactoring I introduced a regression on how indexes were computed
Rename local variable to avoid conflict with function name.
Ensure an allocation sample is pushed even if we encounter a
collision in the bitset.
Fix the logic around removal of bitset element
Add a stats on unmatched allocations which could be a source of increased CPU consumption
@r1viollet r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from a52c80c to aa4e705 Compare October 9, 2023 11:26
@r1viollet r1viollet marked this pull request as ready for review October 9, 2023 11:27
@pr-commenter
Copy link

pr-commenter bot commented Oct 10, 2023

Benchmarks

Parameters

Baseline Candidate
config baseline candidate
profiler-version ddprof 0.15.0+b7ebbb98.21262748 ddprof 0.15.0+4a0dabcb.21284940

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean execution_time
scenario:ddprof -S bench-bad-boggle-solver BadBoggleSolver_run work 1000 same

@pr-commenter
Copy link

pr-commenter bot commented Oct 10, 2023

Benchmark results for collatz

Parameters

Baseline Candidate
config baseline candidate
profiler-version ddprof 0.15.0+d078da20.21538818 ddprof 0.15.0+a2d602a3.21547478

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean execution_time
scenario:ddprof -S bench-collatz --preset cpu_only collatz_runner.sh same

@pr-commenter
Copy link

pr-commenter bot commented Oct 10, 2023

Benchmark results for BadBoggleSolver_run

Parameters

Baseline Candidate
config baseline candidate
profiler-version ddprof 0.15.0+d078da20.21538818 ddprof 0.15.0+a2d602a3.21547478

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean execution_time
scenario:ddprof -S bench-bad-boggle-solver BadBoggleSolver_run work 1000 same

Copy link
Collaborator

@nsavoire nsavoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@r1viollet r1viollet merged commit 53fa554 into main Oct 12, 2023
@r1viollet r1viollet deleted the r1viollet/perf_dealloc_code_path branch October 12, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants