-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance live heap profiling - deallocation code path #298
Conversation
c182378
to
7e7682a
Compare
5a618aa
to
db98b47
Compare
167264f
to
69a4a8f
Compare
include/lib/address_bitset.hpp
Outdated
uint32_t remove_lower_bits(uintptr_t h1) { | ||
uint64_t intermediate = h1 >> _lower_bits_ignored; | ||
uint32_t high = (uint32_t)(intermediate >> 32); | ||
uint32_t low = (uint32_t)intermediate; | ||
uint32_t res = high ^ low; | ||
return res & _nb_bits_mask; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could get away with exactness here:
- you can strip the top 16 bits out anyway, so you only need to actually store 48 bits, on top of the up to 4 you can remove for alignment.
- if you only need to store
48 - alignment
bits, you can consider a paginated structure, where you have up to2^(32 - alignment)
pages of 16 bit bitsets. This means you have an enormous array of null pointers, occasionally pointing to an 8KiB bitset whenever an address in the range [i, i + 2^16) has been encountered. 2^(32 - alignment)
is going to be too large (at least 2GiB with the max alignment), but allocation patterns will be sparse, especially in the high order bits, so what about a 3 level structure with up to 2^16 pages for the top 16 bits, each (there will be very few of these populated) indexing paginated bitsets of2^(16 - alignment)
pages of 16 bit bitsets. This pushes the alignment saving into the middle 12-16 bits which are likely to have higher cardinalities than the top 16.- The top level costs 512 KiB, most of it will be null pointers
- Each populated range in the top level will cost 32-512 KiB according to alignment (32 for 16 byte alignment, 512 for no alignment)
- Each range of 2^16 addresses after alignment costs 8KiB for the leaf level bitset.
- E.g. for an 8GiB address space and 8 byte alignment, you end up with a fixed 512KiB, 2 * 64 KiB for the 2 populated middle pages, and 8KiB for any range of 2^16 consecutive addresses where you sample a pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you can trim the 512KiB a lot if you know the max address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for exposing the math here.
Exactness can have value, especially for cases where the user puts a lower sampling interval and just wants to debug his program. I will delay trying this out as it will take me down a new cycles of experiments and measures.
On average the current approach is good to reduce memory impact. We do skew slightly the sampling intervals when a collision occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a ticket to internal board for now: https://datadoghq.atlassian.net/browse/PROF-8222
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first draft around this topic: https://github.com/DataDog/ddprof/pull/307/files
Perhaps we need to have some quick access path. It looks like the current performances are not yet good enough.
I checked that CI is fine (a deploy job could not start https://gitlab.ddbuild.io/DataDog/apm-reliability/ddprof-build/-/jobs/325597595) |
cea730b
to
29f932e
Compare
b0ee8ce
to
e557dd1
Compare
Fix the missing TLS storage initializations
- Removal of the lock - Minor fixes around the allocation code paths
Adjust the usage of the address bitset.
- Adjust the hashing logic
Remove the dependency between bitset and sampling period. Simplify the hashing logic.
Re-test the collision rate of the hash we use.
Improve the bitset set and unset flow to have a single atomic operation.
- Enforce power of two on size of bitset - Change the log_once api to take into account the place where it is called from - Naming updates
Improve the unit test for the LOG_ONCE API
While refactoring I introduced a regression on how indexes were computed
Rename local variable to avoid conflict with function name.
Ensure an allocation sample is pushed even if we encounter a collision in the bitset.
Fix the logic around removal of bitset element
Add a stats on unmatched allocations which could be a source of increased CPU consumption
a52c80c
to
aa4e705
Compare
BenchmarksParameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics. See unchanged results
|
Benchmark results for collatzParameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics. See unchanged results
|
Benchmark results for BadBoggleSolver_runParameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics. See unchanged results
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
What does this PR do?
Ensure the structure to keep the allocation addresses is lock free.
The proposal is to have a bitset (using atomic integers) instead of a hash map. There is some logic to map the address to a position in the bitset.
Motivation
Performances:
I get significantly better numbers in the benchmark (when focusing on the allocation code paths)
Additional Notes
The bitset has a limited number of elements. This impacts the probability of capturing the allocation event.
This only occurs when live allocation is triggered.
How to test the change?
The improvement in performances can be observed through benchmarks.
The unit test cover the new data structure.
The existing tests already test live heap profiling.
Thanks to @richardstartin for mentioning this data structure.