Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share System.IO.Pipelines.BufferSegments (make BufferSegmentStack static) across instances of Pipe #49259

Open
cobalthex opened this issue Mar 6, 2021 · 23 comments
Assignees
Labels
Milestone

Comments

@cobalthex
Copy link

cobalthex commented Mar 6, 2021

One of the benefits of System.IO.Pipelines.Pipe is its ability to be re-used and otherwise conserve memory through usage of ArrayPools internally.

Pipes create BufferSegments internally to store the memory segments it uses, which are further stored in a ReadOnlySequence.
While the buffer segments are re-used through BufferSegmentStack, this is a unique instance on every Pipe and cannot be shared.

This creates a potential issue where if using a pool of Pipes. Each pipe can have its stack grow ("boundlessly") and maybe end up with many un/under-used segments in a long-lived pool.

I would propose that having the buffer segments be responsible for owning/renting the contained array/memory and that the Pipe is only responsible for renting segments.

@cobalthex cobalthex added the tenet-performance Performance related issue label Mar 6, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO.Pipelines untriaged New issue has not been triaged by the area owner labels Mar 6, 2021
@cobalthex cobalthex changed the title Share System.IO.Pipelines.BufferSegments (make BufferSegmentStack static) across instances Share System.IO.Pipelines.BufferSegments (make BufferSegmentStack static) across instances of Pipe Mar 6, 2021
@davidfowl
Copy link
Member

Each pipe can have its stack grow ("boundlessly") and maybe end up with many un/under-used segments in a long-lived pool

How? There's a maximum amount of segments that are pooled.

Can you show how you're using the pipe and also how you're measuring the memory usage?

@cobalthex
Copy link
Author

cobalthex commented Mar 6, 2021

hmm, missed the part about restricted growth

As for an example, it would be for use in re-usable buffers for handling socket receives:

var pipe = pipePool.Rent();
while (socket is open)
{
  var memory = pipe.Writer.GetMemory();
  await socket.Receive(memory);
  pipe.Writer.Advance(...);
  if (is end of message)
  { 
    pipe.Writer.Flush();
    var reader = pipe.Reader;
    // do work here either with reader.ReadAsync() or reader.AsStream()
    pipe.Reader.AdvanceTo(...);
  }
}
pipePool.Return(pipe);

@davidfowl
Copy link
Member

You reset the pipe during the use of the socket? Why? I'm confused on the usage pattern and savings here. What are you seeing and what are you trying to reduce?

@cobalthex
Copy link
Author

sorry, pseudocoding, I will update the example

@cobalthex
Copy link
Author

cobalthex commented Mar 6, 2021

basically, incoming messages can be of varying sizes, so using a pipe to take advantage of its segmenting is useful. However, when re-using these pipes as sockets open/close, it would be nice if they could share internal buffer segments so that after running for a while, all of these re-used pipes don't each end up holding onto an internal stack of buffers. Especially if most messages only need 1 segment.

@davidfowl
Copy link
Member

I still don't understand what you'd be saving. Do you have some back of the napkin math? Do you have a memory profile? I'd like to understand how much you think you'd save if the segments were shared across pipes.

Presumably if each pipe rented an array from the array pool. It wouldn't change how much memory was allocated unless your workload was constantly dropping and adding connections. If you had lots of concurrent connections, there'd be no savings...

@cobalthex
Copy link
Author

cobalthex commented Mar 6, 2021

So this would be the case of many constantly dropped/added. I cannot provide specifics unfortunately but I was thinking if there are (tens/hundreds of) thousands of connections coming and going it might end up with 75% of the segments unused most of the time.

This may be a case of me wanting to over-optimize, idk

@davidfowl
Copy link
Member

I'd like to replicate the scenario so I can measure and play with changes. The more data you can provide the better (and would have a better likelihood of success). How many pipes are you pooling? I'm guessing you don't pool tens of thousands of pipes (that would be counter productive I think)?

I've thought about doing this in the past but haven't seen the scenario that has convinced me we should optimized for this. I could be convinced with the right evidence.

@cobalthex
Copy link
Author

cobalthex commented Mar 6, 2021

Unfortunately I cannot provide too much detail but 10k+ connections at any one time, some short-lived, some long, The size of each socket receive would vary in size from one segment to multiple (but likely usually one segment).

Maybe pooling all of those pipes is not a good idea? I am not sure, the goal is to avoid re-allocating them if they will likely be needed again (which is obv. dependent on the number of incoming sockets)

@davidfowl
Copy link
Member

Maybe pooling all of those pipes is not a good idea? I am not sure, the goal is to avoid re-allocating them if they will likely be needed again (which is obv. dependent on the number of incoming sockets)

Pooling pipes per request is fine if you were going to pool per request. We pool pipes for HTTP/2 requests (we reuse the request object and reset the pipe on them). We don't pool pipes at the connection level because the benefits are less clear. You'd need to have a good idea of the socket churn as that's the only time you'll get reuse. The pool becomes less effective if there are lots of long lived connections so you only benefit from not allocating, if your pool size matches the churn you expect.

@cobalthex
Copy link
Author

In my case, as these are continuous sockets, not one off (http) requests, it makes more sense (imo) to hold the pipe for the duration of the socket and return it to the pool after the socket closes. Holding onto a large array of pools I agree might not make sense, but for a ~10k object pool, overhead seems pretty minor.
These sockets are probably likely to live a few minutes on average, not sure if you consider that long lived, in which case maybe a pool doesn't make sense, though I'd prefer holding a pool of pointers rather than having to alloc on demand

@davidfowl
Copy link
Member

The Pipe today is around 368 bytes * 2 = 736 bytes per socket (assuming you're using a duplex pipe and a Pipe implementation for each side of the socket). BufferSegments are each about 80 bytes and one segment maps to 4K of memory (~2% overhead). Memory is returned before the Pipe is reset so the thing you're focused on in this issue is the BufferSegment itself (that 2% overhead).

The amount of space that should be used by the BufferSegments at any one time should be 2% of buffered data stored in the Pipe. That roughly amounts to how you're using the pipe higher up the stack to parse messages. This currently doesn't take the minimum segment size into account (it assumes 4K and that's a gap) but by default, I'd be surprised if you saved much here.

If you can, take a memory dump from your running application and spelunk around the object graph to see what kinds of numbers you see in your scenario. That'll give more me some more data to work off of.

PS: I'm currently looking at shrinking the size of the Pipe because it's massive.

@cobalthex
Copy link
Author

cobalthex commented Mar 8, 2021

For me, it is more of the perf cost + memory churn/traffic of creating those segment.
As you say b/c they are samall, just holding on to a lot of them is probably not a huge deal, just seems inefficient (in my use case at least)

@BrennanConroy
Copy link
Member

If you can, take a memory dump from your running application and spelunk around the object graph to see what kinds of numbers you see in your scenario. That'll give more me some more data to work off of.

Have you had time to try this?

@cobalthex
Copy link
Author

I have not yet, I might have time today or tomorrow

@cobalthex
Copy link
Author

cobalthex commented Mar 11, 2021

So I need to do some further testing, but some quick numbers were probably 1.5x (on the high end) the number of buffers compared to in-use. So for a test where I had 1000 active connections, I had about 1200-1500 allocated buffers.

For comparison, I am actually currently working with a custom solution that shares a pool of segments across all sockets, and fetches/links new segments to a sequence per socket as needed.

In my use case (unfortunately I cannot give more details at the moment), I had an overhead of about 5-10 buffers for 1000 connections. The times when an extra buffer is needed is very low per concurrent receive, even though all of the clients will perform the same operation, requiring multiple segments.

@davidfowl
Copy link
Member

Can you give some concrete numbers (MB sizes) or pictures from the dumps of the Pipe types?

@davidfowl
Copy link
Member

I'll look into using a per PipeOptions pool for segments

@davidfowl
Copy link
Member

I did a prototype of this an I'm not sure about it as yet. The tricky part about sharing segments across all pipes by default are the following:

  • How do we determine the maximum memory this shared pool use as the pool? I chose 5MB based on the size of the segment, this is approximately 50K segments for all pipe instances.
  • With the default 4K buffer size, it's possible for pipes to buffer a large amount of content per pipe and blow through the global pool. For example, consider a connection that explicitly buffers 1MB messages before parsing. That'll eat 256 segments. If lots of connections do this you may still end up allocating.
  • Do we share this poll across Pipe, StreamPipeReader, and StreamPipeWriter? Or do they get their own pool?

I ran a couple of tests:

  • 90K buffers writes every 10 seconds to 10,000 pipes resulted in ~25,000 BufferSegments at steady state.
  • 4K writes every 10 seconds, 10,000 pipes resulted in ~800 BufferSegments at steady state.

In my changes, I also currently fallback to the per Pipe stack that gets used if you exceed 256 segments in an individual pipe (or if the global pool is empty). I added this because of the scenario above where a single pipe tries to buffer lots of segments (this number could be configurable).

Of course we don't need to be perfect here but it's really hard to pick sizes for global pools like this because they depend so much on usage and also because shrinking them is hard. We could also let you configure the pool size but that's also tricky especially with a static so this needs more thought.

@cobalthex In your scenario, what's the maximum number of concurrent connections you expect, and how much do they have buffered by default? How big are your message sizes?

@davidfowl
Copy link
Member

cc @AArnott

@cobalthex
Copy link
Author

cobalthex commented Mar 11, 2021

I am expecting at least 10k ccu with each only holding onto one buffer. Message sizes vary, currently buffers are set to 4k. However that may change in the future (likely to shrink). There is currently a maximum receive size set at 1MB, but that is likely to shrink dramatically (needs more profiling)

@cobalthex
Copy link
Author

For pooling: if it is shared, could it not behave similarly to how ArrayPool.Shared works? (I have not dug into that code) Perhaps growing as much as necessary, and performing trim if memory pressure gets high (or no longer pooling)

If the pipe/read/writer had separate pools wouldn't that just mean after writing to a pipe, it dumps those segments into the reader pool if you read from the pipe (maybe I'm misunderstanding)

Perhaps this could be like ArrayPool where you can default to Shared or pass in your own via PipeOptions as Brennan suggested? This would allow you to tune the buffer sizes, max growth/other behaviors

@davidfowl
Copy link
Member

For pooling: if it is shared, could it not behave similarly to how ArrayPool.Shared works? (I have not dug into that code) Perhaps growing as much as necessary, and performing trim if memory pressure gets high (or no longer pooling)

It could, but it's unclear how much you're saving. The array pool is pooling big buffers and those usually matter. This pool is saving 96KB of overhead per segment. I'm not sure they are comparable...

Perhaps this could be like ArrayPool where you can default to Shared or pass in your own via PipeOptions as Brennan suggested? This would allow you to tune the buffer sizes, max growth/other behaviors

I think it could be done somewhere for sure but I'm not convinced that the benefit is there for the complexity introduced.

@BrennanConroy BrennanConroy removed the untriaged New issue has not been triaged by the area owner label Mar 26, 2021
@davidfowl davidfowl added this to the Future milestone May 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants