Performance live heap profiling - deallocation code path #298

r1viollet · 2023-08-28T13:18:28Z

What does this PR do?

Ensure the structure to keep the allocation addresses is lock free.
The proposal is to have a bitset (using atomic integers) instead of a hash map. There is some logic to map the address to a position in the bitset.

Motivation

Performances:
I get significantly better numbers in the benchmark (when focusing on the allocation code paths)

# New numbers are in the same ball park
BM_ShortLived_NoTracking/process_time/real_time     440168 ns       998071 ns         1349
BM_ShortLived_Tracking/process_time/real_time       490663 ns      1160473 ns         1505
BM_LongLived_NoTracking/process_time                345291 ns       218756 ns         3223
BM_LongLived_Tracking/process_time                  342495 ns       358709 ns         1840

# Previous numbers (with same config)
BM_ShortLived_NoTracking/process_time/real_time     448681 ns       946333 ns         1602
BM_ShortLived_Tracking/process_time/real_time      1860407 ns      5558747 ns          377
BM_LongLived_NoTracking/process_time                339814 ns       197875 ns         3370
BM_LongLived_Tracking/process_time                 1316344 ns      4319851 ns          142

Additional Notes

The bitset has a limited number of elements. This impacts the probability of capturing the allocation event.
This only occurs when live allocation is triggered.

How to test the change?

The improvement in performances can be observed through benchmarks.
The unit test cover the new data structure.
The existing tests already test live heap profiling.

Thanks to @richardstartin for mentioning this data structure.

src/lib/address_bitset.cc

src/lib/allocation_tracker.cc

include/lib/address_bitset.hpp

src/lib/address_bitset.cc

src/lib/allocation_tracker.cc

include/lib/lib_logger.hpp

include/lib/allocation_tracker.hpp

src/lib/address_bitset.cc

src/lib/allocation_tracker.cc

richardstartin · 2023-09-08T08:45:29Z

include/lib/address_bitset.hpp

+  uint32_t remove_lower_bits(uintptr_t h1) {
+    uint64_t intermediate = h1 >> _lower_bits_ignored;
+    uint32_t high = (uint32_t)(intermediate >> 32);
+    uint32_t low = (uint32_t)intermediate;
+    uint32_t res = high ^ low;
+    return res & _nb_bits_mask;
+  }


I think you could get away with exactness here:

you can strip the top 16 bits out anyway, so you only need to actually store 48 bits, on top of the up to 4 you can remove for alignment.

if you only need to store 48 - alignment bits, you can consider a paginated structure, where you have up to 2^(32 - alignment) pages of 16 bit bitsets. This means you have an enormous array of null pointers, occasionally pointing to an 8KiB bitset whenever an address in the range [i, i + 2^16) has been encountered.

2^(32 - alignment) is going to be too large (at least 2GiB with the max alignment), but allocation patterns will be sparse, especially in the high order bits, so what about a 3 level structure with up to 2^16 pages for the top 16 bits, each (there will be very few of these populated) indexing paginated bitsets of 2^(16 - alignment) pages of 16 bit bitsets. This pushes the alignment saving into the middle 12-16 bits which are likely to have higher cardinalities than the top 16.

The top level costs 512 KiB, most of it will be null pointers

Each populated range in the top level will cost 32-512 KiB according to alignment (32 for 16 byte alignment, 512 for no alignment)

Each range of 2^16 addresses after alignment costs 8KiB for the leaf level bitset.

E.g. for an 8GiB address space and 8 byte alignment, you end up with a fixed 512KiB, 2 * 64 KiB for the 2 populated middle pages, and 8KiB for any range of 2^16 consecutive addresses where you sample a pointer.

And you can trim the 512KiB a lot if you know the max address

Thanks for exposing the math here.

Exactness can have value, especially for cases where the user puts a lower sampling interval and just wants to debug his program. I will delay trying this out as it will take me down a new cycles of experiments and measures.
On average the current approach is good to reduce memory impact. We do skew slightly the sampling intervals when a collision occurs.

Added a ticket to internal board for now: https://datadoghq.atlassian.net/browse/PROF-8222

A first draft around this topic: https://github.com/DataDog/ddprof/pull/307/files
Perhaps we need to have some quick access path. It looks like the current performances are not yet good enough.

src/lib/address_bitset.cc

r1viollet · 2023-09-08T10:08:53Z

I checked that CI is fine (a deploy job could not start https://gitlab.ddbuild.io/DataDog/apm-reliability/ddprof-build/-/jobs/325597595)

include/lib/lib_logger.hpp

src/lib/allocation_tracker.cc

include/lib/address_bitset.hpp

src/lib/address_bitset.cc

src/lib/allocation_tracker.cc

include/lib/lib_logger.hpp

src/lib/address_bitset.cc

include/lib/address_bitset.hpp

Fix the missing TLS storage initializations

- Removal of the lock - Minor fixes around the allocation code paths

Adjust the usage of the address bitset.

- Adjust the hashing logic

Remove the dependency between bitset and sampling period. Simplify the hashing logic.

Re-test the collision rate of the hash we use.

Improve the bitset set and unset flow to have a single atomic operation.

- Enforce power of two on size of bitset - Change the log_once api to take into account the place where it is called from - Naming updates

Improve the unit test for the LOG_ONCE API

While refactoring I introduced a regression on how indexes were computed

Rename local variable to avoid conflict with function name.

Ensure an allocation sample is pushed even if we encounter a collision in the bitset.

Fix the logic around removal of bitset element

Add a stats on unmatched allocations which could be a source of increased CPU consumption

include/lib/allocation_tracker.hpp

pr-commenter · 2023-10-10T08:29:22Z

Benchmarks

Parameters

	Baseline	Candidate
config	baseline	candidate
profiler-version	ddprof 0.15.0+b7ebbb98.21262748	ddprof 0.15.0+4a0dabcb.21284940

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results

scenario	Δ mean execution_time
scenario:ddprof -S bench-bad-boggle-solver BadBoggleSolver_run work 1000	same

pr-commenter · 2023-10-10T10:20:48Z

Benchmark results for collatz

Parameters

	Baseline	Candidate
config	baseline	candidate
profiler-version	ddprof 0.15.0+d078da20.21538818	ddprof 0.15.0+a2d602a3.21547478

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results

scenario	Δ mean execution_time
scenario:ddprof -S bench-collatz --preset cpu_only collatz_runner.sh	same

pr-commenter · 2023-10-10T10:20:51Z

Benchmark results for BadBoggleSolver_run

Parameters

	Baseline	Candidate
config	baseline	candidate
profiler-version	ddprof 0.15.0+d078da20.21538818	ddprof 0.15.0+a2d602a3.21547478

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

See unchanged results

scenario	Δ mean execution_time
scenario:ddprof -S bench-bad-boggle-solver BadBoggleSolver_run work 1000	same

nsavoire

👍

r1viollet commented Aug 30, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch 3 times, most recently from c182378 to 7e7682a Compare August 30, 2023 14:44

r1viollet commented Aug 30, 2023

View reviewed changes

src/lib/allocation_tracker.cc Outdated Show resolved Hide resolved

r1viollet marked this pull request as ready for review August 30, 2023 14:55

r1viollet requested review from sanchda and nsavoire as code owners August 30, 2023 14:55

r1viollet commented Sep 4, 2023

View reviewed changes

src/lib/allocation_tracker.cc Outdated Show resolved Hide resolved

r1viollet commented Sep 4, 2023

View reviewed changes

src/lib/allocation_tracker.cc Show resolved Hide resolved

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from 5a618aa to db98b47 Compare September 4, 2023 12:32

nsavoire reviewed Sep 4, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

r1viollet commented Sep 5, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

r1viollet commented Sep 5, 2023

View reviewed changes

src/lib/allocation_tracker.cc Outdated Show resolved Hide resolved

r1viollet assigned nsavoire Sep 5, 2023

r1viollet requested a review from nsavoire September 6, 2023 08:16

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from 167264f to 69a4a8f Compare September 8, 2023 08:29

richardstartin reviewed Sep 8, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

nsavoire reviewed Sep 11, 2023

View reviewed changes

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from cea730b to 29f932e Compare September 13, 2023 08:30

r1viollet commented Sep 13, 2023

View reviewed changes

include/lib/lib_logger.hpp Outdated Show resolved Hide resolved

r1viollet requested a review from nsavoire September 14, 2023 12:13

richardstartin reviewed Sep 14, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

richardstartin reviewed Sep 14, 2023

View reviewed changes

src/lib/address_bitset.cc Outdated Show resolved Hide resolved

r1viollet commented Sep 14, 2023

View reviewed changes

include/lib/address_bitset.hpp Outdated Show resolved Hide resolved

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from b0ee8ce to e557dd1 Compare September 16, 2023 15:50

r1viollet marked this pull request as draft September 18, 2023 08:16

r1viollet added 17 commits October 9, 2023 11:50

deallocation code path - bench fixes

924f832

Fix the missing TLS storage initializations

Deallocation code path optimisations

de35d6c

- Removal of the lock - Minor fixes around the allocation code paths

Minor - add licensing headers

41dae7f

Performances for deallocation tracking code path

c83e1f1

Adjust the usage of the address bitset.

Performance for deallocation code path

f93bef1

- Adjust the hashing logic

Allocation tracker performance code path

eff9ca8

Remove the dependency between bitset and sampling period. Simplify the hashing logic.

Deallocation code path performance - test hash function

24aed82

Re-test the collision rate of the hash we use.

Performance for Deallocation code path

3c1e30d

Improve the bitset set and unset flow to have a single atomic operation.

Deallocation codepath performances - minor adjustments

d6a0af9

- Enforce power of two on size of bitset - Change the log_once api to take into account the place where it is called from - Naming updates

Library logger - unit test

3abf3bb

Improve the unit test for the LOG_ONCE API

Address bitset - minor fix

5aa00fe

While refactoring I introduced a regression on how indexes were computed

cppcheck fix

27603a2

Rename local variable to avoid conflict with function name.

Live allocation - performances

e92f678

Ensure an allocation sample is pushed even if we encounter a collision in the bitset.

deallocation code path optim - fix

a06ce50

Fix the logic around removal of bitset element

Minor - Align the allocation tracker benchmark

ace2c9f

Live allocation - performance

5a59702

Add a stats on unmatched allocations which could be a source of increased CPU consumption

Minor fixes to comply with static code analysis checks

aa4e705

r1viollet force-pushed the r1viollet/perf_dealloc_code_path branch from a52c80c to aa4e705 Compare October 9, 2023 11:26

r1viollet marked this pull request as ready for review October 9, 2023 11:27

r1viollet commented Oct 9, 2023

View reviewed changes

include/lib/allocation_tracker.hpp Show resolved Hide resolved

Minor fix for jemalloc unit test

4a0dabc

r1viollet added 2 commits October 11, 2023 14:42

Merge branch 'main' into r1viollet/perf_dealloc_code_path

6f4421f

Merge branch 'main' into r1viollet/perf_dealloc_code_path

a2d602a

r1viollet mentioned this pull request Oct 12, 2023

Enable clang-tidy on headers #327

Merged

nsavoire approved these changes Oct 12, 2023

View reviewed changes

r1viollet merged commit 53fa554 into main Oct 12, 2023

r1viollet deleted the r1viollet/perf_dealloc_code_path branch October 12, 2023 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance live heap profiling - deallocation code path #298

Performance live heap profiling - deallocation code path #298

r1viollet commented Aug 28, 2023 •

edited

Loading

richardstartin Sep 8, 2023

richardstartin Sep 8, 2023

r1viollet Sep 8, 2023

r1viollet Sep 16, 2023

r1viollet Sep 18, 2023

r1viollet commented Sep 8, 2023

pr-commenter bot commented Oct 10, 2023 •

edited

Loading

pr-commenter bot commented Oct 10, 2023 •

edited

Loading

pr-commenter bot commented Oct 10, 2023 •

edited

Loading

nsavoire left a comment

Performance live heap profiling - deallocation code path #298

Performance live heap profiling - deallocation code path #298

Conversation

r1viollet commented Aug 28, 2023 • edited Loading

What does this PR do?

Motivation

Additional Notes

How to test the change?

richardstartin Sep 8, 2023

Choose a reason for hiding this comment

richardstartin Sep 8, 2023

Choose a reason for hiding this comment

r1viollet Sep 8, 2023

Choose a reason for hiding this comment

r1viollet Sep 16, 2023

Choose a reason for hiding this comment

r1viollet Sep 18, 2023

Choose a reason for hiding this comment

r1viollet commented Sep 8, 2023

pr-commenter bot commented Oct 10, 2023 • edited Loading

Benchmarks

Parameters

Summary

pr-commenter bot commented Oct 10, 2023 • edited Loading

Benchmark results for collatz

Parameters

Summary

pr-commenter bot commented Oct 10, 2023 • edited Loading

Benchmark results for BadBoggleSolver_run

Parameters

Summary

nsavoire left a comment

Choose a reason for hiding this comment

r1viollet commented Aug 28, 2023 •

edited

Loading

pr-commenter bot commented Oct 10, 2023 •

edited

Loading

pr-commenter bot commented Oct 10, 2023 •

edited

Loading

pr-commenter bot commented Oct 10, 2023 •

edited

Loading