-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduced allocation v04 span representation #598
Conversation
93c4665
to
c7e268b
Compare
BenchmarksComparisonBenchmark execution time: 2024-09-19 15:25:19 Comparing candidate commit 7fc6dcf in PR branch Found 8 performance improvements and 34 performance regressions! Performance is the same for 9 metrics, 2 unstable metrics. scenario:benching deserializing traces from msgpack to their internal representation
scenario:credit_card/is_card_number/
scenario:credit_card/is_card_number/ 3782-8224-6310-005
scenario:credit_card/is_card_number/ 378282246310005
scenario:credit_card/is_card_number/37828224631
scenario:credit_card/is_card_number/378282246310005
scenario:credit_card/is_card_number/37828224631000521389798
scenario:credit_card/is_card_number/x371413321323331
scenario:credit_card/is_card_number_no_luhn/
scenario:credit_card/is_card_number_no_luhn/ 3782-8224-6310-005
scenario:credit_card/is_card_number_no_luhn/ 378282246310005
scenario:credit_card/is_card_number_no_luhn/37828224631
scenario:credit_card/is_card_number_no_luhn/378282246310005
scenario:credit_card/is_card_number_no_luhn/37828224631000521389798
scenario:credit_card/is_card_number_no_luhn/x371413321323331
scenario:normalization/normalize_name/normalize_name/bad-name
scenario:normalization/normalize_name/normalize_name/good
scenario:normalization/normalize_service/normalize_service/A0000000000000000000000000000000000000000000000000...
scenario:normalization/normalize_service/normalize_service/Test Conversion 0f Weird !@#$%^&**() Characters
scenario:normalization/normalize_service/normalize_service/[empty string]
scenario:redis/obfuscate_redis_string
scenario:sql/obfuscate_sql_string
scenario:tags/replace_trace_tags
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
BaselineOmitted due to size. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #598 +/- ##
==========================================
+ Coverage 73.05% 73.19% +0.13%
==========================================
Files 252 254 +2
Lines 36072 36312 +240
==========================================
+ Hits 26352 26578 +226
- Misses 9720 9734 +14
|
b3b1fdc
to
2626b37
Compare
I wonder whether we should put |
a3e38c1
to
f48e785
Compare
@bwoebi - That's a good idea. The NoAllocString is sufficiently coupled to tinybytes at the moment that it doesn't make sense to separate them. |
f48e785
to
bd5d7bb
Compare
bd5d7bb
to
3caff9e
Compare
1a18a73
to
1de65da
Compare
@ekump (and reviewers) Please consider 386a684 first. It is quite a refactor on the ByteWrapper approach, which hopefully increases soundness and is probably also a bit more straightforward. (see also commit description) To view the full diff of this PR plus my commit, please look at https://github.com/DataDog/libdatadog/compare/bob/refactor-bytes-wrapper-away. |
There is a reason why |
@bantonsson What actually needs to be immutable is the underlying content, which it still is. This commit just allows manipulating the range into the underlying slice. Which in itself doesn't violate Sync/Send constraints (as you still need a |
@bwoebi You're indeed right about |
Okay, brought my commit to this branch after talking to Edmund :-) |
@@ -10,8 +10,7 @@ use std::{ | |||
/// Immutable bytes type with zero copy cloning and slicing. | |||
#[derive(Clone)] | |||
pub struct Bytes { | |||
ptr: *const u8, | |||
len: usize, | |||
slice: &'static [u8], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the {ptr, len} is better, since the slice is not really static.
I think it'd be pretty easy to misuse the definition.
On a sidenote, NonNull<u8>
is better than *const since the type can then benefit from non null pointer optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer (ptr, len), but it doesn't really work with as_mut_slice()
. But I think 'static
is fine - it's ultimately no different to a bare pointer in terms of lifetime, whenever you reassemble it into a slice.
pub const fn len(&self) -> usize { | ||
self.len | ||
self.slice.len() | ||
} | ||
|
||
/// Returns `true` if the `Bytes` is empty. | ||
#[inline] | ||
pub const fn is_empty(&self) -> bool { | ||
self.len == 0 | ||
self.slice.is_empty() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but theses functions don't actually need to be implemented on Bytes since already present the type from impl Deref<[u8]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except they are not const through deref, so that would be one reason to keep them (doubt that it helps here, this is merely an educational mention).
|
||
let mut trace: Vec<Span> = Default::default(); | ||
(0..trace_count).try_fold( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really a small nit, but I think the for loops was clearer than the nested try_fold
pub unsafe fn as_mut_slice(&mut self) -> &mut &'static [u8] { | ||
&mut self.slice | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you added this function because you want to pass en &mut &[u8] to the rpm decode functions, but really there is a simpler and safer way to do that.
rpm reads from types implementing RmpRead, which is implemented by default on all types implementing io::Read.
So IMO the correct way to use Bytes with rpm is to just impl Read on it and advance the base pointer from how many bytes were read.
impl io::Read for Bytes {
fn read(&mut self, &mut buff) -> Result<u8> {
let bytes_read = min(buff.len, self.len);
std::slice::copy_from_slice(&self[..bytes_read], buff[..bytes_read]);
self.ptr += bytes_read;
self.len -= bytes_read;
Ok(bytes_read)
}
}
Then you could just do rmp::decode::read_array_len(&mut data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just as simple as that, In particular with the string reading:
read_string_ref_nomut(unsafe { buf.as_mut_slice() }).map(|(str, newbuf)| {
let string = BytesString::from_bytes_slice(buf, str);
*unsafe { buf.as_mut_slice() } = newbuf;
string
})
there's a dance around the ordering of executions: first, read the string. Then fetch the string slice. Then adjust the size.
Because Bytes
asserts that any read from it is within its bounds. So we'd then have to craft an unsafe method for that instead.
Also, your Read
impl is actually copying every single byte. Passing a slice directly does not incur that overhead. It's maybe a bit more elegant to use Read
, but actually not as performant. And this PR here is supposed to solve ... performance problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just been profiling the whole patch. There's currently still one copy happening (the whole data buffer at once) in SidecarServer::send_trace_v04
. That single copy accounts for 35% of the whole overhead. So copying here is most likely unacceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for sure we don't want to copy the strings we read, but for that you can make cleaner abstractions than using read_string_ref_nomut
.
We could very well rewrite it to use the following primitives to encapsulate the unsafe manipulation in the Bytes implementation
impl Bytes {
// make the bytes instance point to original_buffer[n..]
// returns None if the buffer is smaller than n
fn advance((&mut self, n: usize) -> Option<()> {
if n > self.len { return None }
self.ptr += n;
self.len -= n;
Ok(())
}
// make the bytes instance point to original_buffer[..n]
// returns None if the buffer is smaller than n
fn shrink((&mut self, n: usize) -> Option<()> {
if n > self.len { return None }
self.len = n;
Ok(())
}
// Yield the first n bytes and advance the offset from as much
fn split_front_ref(&mut self, n: usize) -> Option<&[u8]> {
let start = self.ptr;
self.advance(n)?
// Safe because the lifetime of the slice is smaller than the Bytes
// and n smaller than length
Ok(unsafe { slice:: from_raw_parts(start, n) })
}
fn split_front(&mut self, n: usize) -> Option<Bytes> {
if n > self.len { return None }
let mut front = self.clone();
self.advance(n)?;
front.shrink(n)?;
Ok(front)
}
}
fn read_string_ref(bytes: &'a mut Bytes) -> Result<&'a str, DecodeError> {
let len = rpm::decode::read_str_len(bytes)?;
let str_ref = self.split_front_ref(len).map_err(|| DecodeStringError::BufferSizeTooSmall(len))?;
str::from_utf8(str_ref).map_err(|_| DecodeStringError::InvalidUtf8())
}
fn read_string_bytes(bytes: &'a mut Bytes) -> Result<&'a str, DecodeError> {
let len = rpm::decode::read_str_len(bytes)?;
let str_bytes = self.split_front(len).map_err(|| DecodeStringError::BufferSizeTooSmall(len))?;
BytesString::from_bytes(str_bytes).map_err(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I'm curious were we spend time other than the serialization, can your share the profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwoebi and @paullegranddc - Is this conversation still ongoing? Any changes either of you require here? I don't have a strong opinion, so happy to implement whatever the group agrees on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to implement @paullegranddc suggestion with io::Read and the additional methods to Bytes. Should probably also fix the miri failure I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my end, I don't have a strong opinion. The code currently works, so I think we can merge it, but it'd be nice if you have time to refactor it. Either in this PR or latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paullegranddc @bwoebi - I do like the impl io::Read
solution as it seems to contain less sharp edges, but benchmarks wind up being around 30% slower compared to the as_mut_slice
implementation. It's possible I misunderstood the suggestion and implemented something incorrectly here. But if neither of you see any glaring errors I will revert to the as_mut_slice
implementation for now and create a ticket to follow up on this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not surprise me too much, at least it would need much more optimization by the compiler. So, if it's actually 30% slower, then please revert it.
7df2b5b
to
ec21732
Compare
c814f9b
to
7ca377c
Compare
Bytes itself is already a structure with a slice into some underlying bytes. Thus, we are adding a cheap way to directly modify the slice of the Bytes struct. To facilitate this, we now store the tuple (ptr, len) as a &[] slice within the Bytes struct. The representation of these is exactly identical to before, but it allows trivial manipulation of the slice via direct assignment or &mut &[u8] reference. The way the underlying field was exposed on the BytesWrapper struct was also unsound. Access to it was not wrapped within an unsafe method. The decoder code can now trivially clone() the given Bytes struct and carry it around, directly shrinking the slice size of the underlying Bytes instance as the data is being processed. Additionally, there now exist two helper methods for numbers and strings (read_string_bytes and read_number_bytes) which handle a bit of the boilerplate. Sadly it is necessary to always write "unsafe { buf.as_mut_slice() }", but this is necessary for soundness, we only can actually guarantee safety when calling the actual decode functions, as long as we want to carry Bytes around. Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
Signed-off-by: Bob Weinand <bob.weinand@datadoghq.com>
a9211b7
to
e0f127e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, thanks a lot for the effort in optimizing this!
What does this PR do?
Deserialize V.04 Spans into a new internal representation that attempts to minimize new String allocations. Using the existing protobuf definitions leads to a level of performance that tracer libraries are/will be uncomfortable with.
Introduce a
BytesString
that borrows from a tinybytes::Bytes buffer slice and aBufferWrapper
for keeping track of where in the bytes buffer we are slicing from.This will impact only v04 Spans at the moment.
Motivation
The PHP Language Platform team observed an unacceptable level of increased memory footprint when testing the use of the Sidecar for sending traces to the agent.
Additional Notes
The use of non-static
&str
orCow
types was not possible as once spans are decoded they may need to be passed to async code paths inSendData
on the way to the agent and this caused several lifetime issues. The use oftinybytes
is necessary asBytes
doesn't support&[u8]
input, which is necessary when we can't own the data (i.e. shared memory from the tracer using IPC).How to test the change?
Unit and integration tests have been updated.