Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: improve read reservations to efficiently handle multiple slices #14054

Merged
merged 59 commits into from
Feb 1, 2021

Conversation

ggreenway
Copy link
Contributor

@ggreenway ggreenway commented Nov 17, 2020

Signed-off-by: Greg Greenway ggreenway@apple.com

Commit Message: Enable reading larger chunks from sockets in a single call without drastically increasing memory waste by implementing a system where reservations of multiple slices are made, and unused slices (after the read operation) are put into a small cache for re-use by the next read operation.

The largest read operation changed from 16k to 128k.

Watermark buffer limits are still enforced; large reads only happen if buffer limits allow and space is available.

This improves performance in some high-throughput use cases.

Additional Description:
Risk Level: Medium (bugs could result in more memory used than configuration should allow)
Testing: Added tests; all existing tests pass
Docs Changes: None; internal only change
Release Notes: added
Platform Specific Features: None
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

This still needs tests written, but I wanted to get some feedback on the design first.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some top level comments. Let me know how I can help further improve the buffer API and its use in transports/etc.

source/common/network/raw_buffer_socket.cc Outdated Show resolved Hide resolved
source/common/network/raw_buffer_socket.cc Show resolved Hide resolved
source/extensions/transport_sockets/tls/ssl_socket.cc Outdated Show resolved Hide resolved
include/envoy/buffer/buffer.h Show resolved Hide resolved
@ggreenway
Copy link
Contributor Author

@antoniovicente PTAL. This isn't tested yet, but I think this will keep us only going over the high watermark by up to 16k (same as before). Does this seem like the right approach?

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not looking at this earlier. I think you're hitting some weird edge cases that are not fully covered by existing e2e or performance tests.

source/common/buffer/watermark_buffer.h Outdated Show resolved Hide resolved
source/common/buffer/watermark_buffer.cc Show resolved Hide resolved
source/common/buffer/watermark_buffer.cc Outdated Show resolved Hide resolved
source/extensions/transport_sockets/tls/ssl_socket.cc Outdated Show resolved Hide resolved
Signed-off-by: Greg Greenway <ggreenway@apple.com>
limit it

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

@antoniovicente I'm working on merging in the changes around Slice layout into this PR. One oddity is what to do with Reservation::owned_slices_. There's no longer a type in buffer.h to put in here. Given that the type isn't known in this context, should it just be a void*, plus function to properly free it? Or declare a base type (with no members) in buffer.h to have Slice inherit from, just to put something better than void* here? In either case, actually free'ing the held pointer is delegated to something setup in reserve().

@antoniovicente
Copy link
Contributor

antoniovicente commented Dec 10, 2020

@antoniovicente I'm working on merging in the changes around Slice layout into this PR. One oddity is what to do with Reservation::owned_slices_. There's no longer a type in buffer.h to put in here. Given that the type isn't known in this context, should it just be a void*, plus function to properly free it? Or declare a base type (with no members) in buffer.h to have Slice inherit from, just to put something better than void* here? In either case, actually free'ing the held pointer is delegated to something setup in reserve().

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

@ggreenway
Copy link
Contributor Author

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

The problem is that Slice is defined in buffer_impl.h, but Reservation is defined in buffer.h. Reservation is part of the interface, and Slice is now strictly part of the implementation.

@antoniovicente
Copy link
Contributor

I think that the replacement for SliceDataPtr in the original PR is std::unique_ptr<Slice>

The problem is that Slice is defined in buffer_impl.h, but Reservation is defined in buffer.h. Reservation is part of the interface, and Slice is now strictly part of the implementation.

Hmmm. Ok, will think about it and get back to you.

@ggreenway
Copy link
Contributor Author

It could hold a std::unique_ptr<SliceData>, but then I'd have to allocate a SliceDataImpl for each slice. I'd prefer to avoid the extra temporary allocation.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@antoniovicente
Copy link
Contributor

There are few options that come to mind:

  1. Have the buffer own the slices in the reservation instead of having the reservation own them. This is similar to what the old reserve implementation does.
  2. Have reserve return ReservationPtr and allow OwnedImpl to implement its own override of the Reservation mechanism
  3. Have the Reservation store a pointer to a cleanup function that should be run if the Reservation is not committed to the buffer.
  4. Have reservations hold a parallel array of std::unique_ptr<uint8_t> for the storage associated with the raw slices. Commit would need to know how to create OwnedImpl::Slice from a uint8_t[], data size and capacity.
  5. Similar to the previous option, but have the reservation be single region only.

I can see option 4 being attractive. Option 1 would also be efficient and not compromise much on ease of use.

It may be useful to have a buffer AP method
void Buffer::add(std::unique_ptr<uint8_t> data, uint64_t data_start, uint64_t data_end, uint64_t capacity) that allows the buffer to take ownership of externally allocated memory regions.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
instead of 1 block per slice

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page.

Current benchmarks:
with freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               799 ns          799 ns     52535503
bufferReserveCommitPartial/131072        289 ns          289 ns    145588343

without freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               750 ns          750 ns     55794513
bufferReserveCommitPartial/131072        407 ns          407 ns    103863598

@ggreenway
Copy link
Contributor Author

I have mixed feelings about the freelist. In an ideal world, it would be better to let malloc take care of this, and it shows a slight perf degradation when all the slices get used by the read, but it is quite a bit faster for small reads.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving freelist conversation here (from #14054 (comment)) because github buries the current one everytime I load the page.

Current benchmarks:
with freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               799 ns          799 ns     52535503
bufferReserveCommitPartial/131072        289 ns          289 ns    145588343

without freelist

----------------------------------------------------------------------------
Benchmark                                  Time             CPU   Iterations
----------------------------------------------------------------------------
bufferReserveCommit/131072               750 ns          750 ns     55794513
bufferReserveCommitPartial/131072        407 ns          407 ns    103863598

I think these results show a clear benefit from freelisting. Thanks for bearing along until we got there. I think that the slight extra cost in the case where the freelist is fully drained each time is fine given that it provides a benefit in the more common partial commit case.

Give me a bit to look over the changes in 9f3a717

reservation_slices.push_back(slice.reserve(size));
slices_owner->owned_slices_.emplace_back(std::move(slice));
bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining);
reserved += reservation_slices.back().len_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't reservation_slices.back().len_ == size in the previous 2 statements? accessing the slice size via .back().len_ probably isn't free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think right now they are equal, but the API leaves room for Slice::reserve() to return a different size. But I'll stop getting via back().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@antoniovicente
Copy link
Contributor

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14054 (comment) was created by @antoniovicente.

see: more, trace.

antoniovicente
antoniovicente previously approved these changes Jan 27, 2021
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this optimization.

test/common/buffer/owned_impl_test.cc Show resolved Hide resolved
slices[i].len_ = 0;
}
buf.commit(slices, allocated_slices);
{ auto reservation = buf.reserveSingleSlice(1280); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reservation.commit(0); may be interesting, although I know is a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a regression test, I don't want to change it more than I have to. But I added a commit(0) test earlier (line 916 of this file).

reservation_slices.push_back(slice.reserve(size));
slices_owner->owned_slices_.emplace_back(std::move(slice));
bytes_remaining -= std::min<uint64_t>(reservation_slices.back().len_, bytes_remaining);
reserved += reservation_slices.back().len_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. I think that in this case len_ == size since size is passed in to the constructor in the previous line.

antoniovicente
antoniovicente previously approved these changes Jan 27, 2021
This test passed with default TCMalloc, but only because of
implementation details of how TCMalloc worked, and it failed on ASAN
and Windows.

The reserve-single API no longer uses the free-list (unlike previous
revisions of this PR), and unused slices are no longer stored in
OwnedImpl as empty slices (unlike the code before this PR).

Therefore, there are no guarantees about which specific slice memory
is used between un-committed reservations; the result is determined by
the malloc implementation, and there is no good reason to write a test
for what that behavior may be.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

@antoniovicente there was a real test failure on both ASAN and Windows. Slightly different failure on the two, but both for the same reason. I deleted the offending test; explanation in commit message.

@ggreenway
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #14054 (comment) was created by @ggreenway.

see: more, trace.

Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing test that attempted to exercise undefined behavior

@ggreenway
Copy link
Contributor Author

Mac CI passed all tests; some cleanup task of the CI job appears to have timed out.

@ggreenway ggreenway merged commit 241a955 into envoyproxy:main Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants