-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739
Reduce memory allocations for seriesChunksSet.series and its chunks slices #3739
Conversation
I've tested this PR in dev cluster ( How to read the results:
Scenario 1: few large requests (about 50-100k series per request)As a counter proof, here is a profiling comparison for the whole test range. Look at the total memory allocated by Scenario 2: many very small requests (about 10 series per request)ConclusionNot every metric has the same weight. The test cluster is deployed in read-write mode, so the backend deployment runs other components too (be aware compactor was not running during the test). I think the most reliable metrics here are allocations/sec and the profile data (even if profile is sampled, so not a 100% representation of the reality). My conclusion is that, as expected, this optimization can reduce about 10% of memory allocations (bytes) when the store-gateway is under heavy load due to few but heavy requests, while it's not significative for the case of small requests. |
// If it's releasable then we try to reuse a slice from the pool. | ||
if seriesReleasable { | ||
if reused := seriesEntrySlicePool.Get(); reused != nil { | ||
prealloc = *(reused.(*[]seriesEntry)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Microoptimization, feel free to ignore, but I think it's worth mentioning: we're optimizing allocations here but we're throwing away a pointer here, and then the reused slice escapes again:
reuse := b.series[:0]
seriesEntrySlicePool.Put(&reuse)
What if keep the reused pointer in seriesChunkSet
and put it back to the pool later after overwriting the value, like:
// here:
b.releasableSeries = reused.(*[]seriesEntry)
prealloc = *(b.releasableSeries)
// ... later in release():
*b.releasableSeries := b.series[:0] // TODO: Unless pointer is nil
seriesEntrySlicePool.Put(b.releasableSeries)
pkg/storegateway/series_chunks.go
Outdated
for c := range b.series[i].chks { | ||
b.series[i].chks[c].MinTime = 0 | ||
b.series[i].chks[c].MaxTime = 0 | ||
b.series[i].chks[c].Raw = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we reusing .chks
? Because I can see them being set to nil
below, and if so, why bother zeroing them here? Is that something that chunkReleaser
does? In that case it would worth a comment because this code looks useless to me from this point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool
can be used later without problem.
If that's correct, how about we change pool.SlabPool
to
type Resetable interface {
Reset()
}
type SlabPool[T Resetable] struct {
delegate Interface
slabSize int
slabs []*[]T
}
and also change SlabPool.Release
to
func (b *SlabPool[T]) Release() {
for _, slab := range b.slabs {
for _, item := range slab {
item.Reset()
}
// The slab length will be reset to 0 in the Get().
b.delegate.Put(slab)
}
b.slabs = b.slabs[:0]
}
storepb.AggrChunk
already has Reset()
implemented by protogen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although if we go with my suggestion, the using SlabPool instead of BatchBytes will be more difficult because we don't want to call .Reset()
on each byte individually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I can see how it's reused in b.seriesChunksPool.Release()
, my fault. I would bring this zeroing code closer to the place where it's reused. I like your suggestion too, but I'm not sure that we want to force the items being resettable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a big preference. I simplified the reset code in 59937e9 and I think that's just fine. I would add further indirection.
@@ -176,3 +176,71 @@ func (b *BatchBytes) Get(sz int) ([]byte, error) { | |||
*slab = (*slab)[:len(*slab)+sz] | |||
return (*slab)[len(*slab)-sz : len(*slab) : len(*slab)], nil | |||
} | |||
|
|||
// SlabPool is NOT concurrency safe. | |||
type SlabPool[T any] struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we write a comment explaining what SlablPool does? You can ask ChatGPT if you want :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a very difficult requested. I attempted it in a114772.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty cool :) I really like SlabPool. I left some suggestions and questions, but overall LGTM
// The capacity MUST be guaranteed. If it's smaller, then we forget it and will be | ||
// reallocated. | ||
if cap(prealloc) < seriesCapacity { | ||
prealloc = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return it to the pool instead of throwing it away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it and I have mixed feelings. Putting back to the pool means the next time we'll get it again from the pool it may still be smaller than seriesCapacity
and so we're loosing an opportunity to reuse another slice which could be big enough.
For the specific use case of this PR, this should never happen because our batch size is currently fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch size for seriesChunksSet
may change when we do memory quotas. hope we don't forget about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hope we don't forget about this
Agree. When we'll add quotas, we'll have to revisit all pooling. My suggestion is to do bucketed batch sizes (e.g. batch sizes rounded to a fixed and low number of buckets, and then having buckets pool) but we can discuss it once we'll be there.
pkg/storegateway/series_chunks.go
Outdated
for c := range b.series[i].chks { | ||
b.series[i].chks[c].MinTime = 0 | ||
b.series[i].chks[c].MaxTime = 0 | ||
b.series[i].chks[c].Raw = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that we zero each chunk so that the slice we put back into seriesChunksSlicePool
can be used later without problem.
If that's correct, how about we change pool.SlabPool
to
type Resetable interface {
Reset()
}
type SlabPool[T Resetable] struct {
delegate Interface
slabSize int
slabs []*[]T
}
and also change SlabPool.Release
to
func (b *SlabPool[T]) Release() {
for _, slab := range b.slabs {
for _, item := range slab {
item.Reset()
}
// The slab length will be reset to 0 in the Get().
b.delegate.Put(slab)
}
b.slabs = b.slabs[:0]
}
storepb.AggrChunk
already has Reset()
implemented by protogen
|
||
// If it's releasable then we try to reuse a slice from the pool. | ||
if seriesReleasable { | ||
if reused := seriesEntrySlicePool.Get(); reused != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a thought more than a suggestion: I liked how in other places we inject the pool. Like the chunk bytes pool for the chunks Raw data. It will make tests less dependant on each other (if you write one bad test you can fail a lot of them), slightly more consistent, and a bit more clear that we're managing our memory.
I don't have experience with memory management, so if it's generally better to make this pooling as invisible as possible, then your approach is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seriesChunksSet
uses two pools and passing them around is a bit annoying. I would do it only if there's a real justification. Currently I've been able to test pooling in a clean way anyway.
…lices Signed-off-by: Marco Pracucci <marco@pracucci.com>
44500fc
to
59d459e
Compare
No performance difference: name old time/op new time/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 819µs ± 2% 802µs ± 1% ~ (p=0.221 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_100-12 4.16ms ± 1% 4.10ms ± 3% ~ (p=0.392 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_1000-12 37.6ms ± 2% 37.3ms ± 3% ~ (p=0.769 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 372ms ± 2% 371ms ± 5% ~ (p=0.880 n=3+3) name old alloc/op new alloc/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 160kB ± 0% 160kB ± 0% ~ (p=0.693 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_100-12 160kB ± 0% 160kB ± 0% ~ (p=0.699 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_1000-12 208kB ± 2% 206kB ± 0% ~ (p=0.362 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 2.79MB ± 0% 2.79MB ± 0% ~ (p=0.473 n=3+3) name old allocs/op new allocs/op delta LoadingSeriesChunksSetIterator/batch_size:_10-12 2.40k ± 0% 2.40k ± 0% ~ (zero variance) LoadingSeriesChunksSetIterator/batch_size:_100-12 2.40k ± 0% 2.40k ± 0% ~ (zero variance) LoadingSeriesChunksSetIterator/batch_size:_1000-12 2.81k ± 0% 2.81k ± 0% ~ (p=0.423 n=3+3) LoadingSeriesChunksSetIterator/batch_size:_10000-12 3.18k ± 0% 3.17k ± 0% ~ (p=0.426 n=3+3) Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…ator Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
What this PR does
Looking at streaming store-gateway profiles, I can see about 10% of memory allocated by
loadingSeriesChunksSetIterator.Next()
. There are two main memory allocations there:make([]seriesEntry, nextUnloaded.len())
make([]storepb.AggrChunk, len(s.chunks))
In this PR I propose to use a memory pool for both of them (separate pools). To increase the chances the slice capacity is large enough to be reused, I've adopted the following strategies:
[]seriesEntry
allocate it based on the batch size (all real batches will have that size except the last one)[]storepb.AggrChunk
use a slab (here the number of chunks is not predictable, so we can't use a good default)I've introduced
pool.SlabPool
which is likepool.BatchBytes
but works with generics and can never returnerror
(so it's easier to use).If this PR is accepted and once we feel comfortable with
pool.SlabPool
, I will replacepool.BatchBytes
withpool.SlabPool
.Benchmarks:
Notes:
I don't trust very much the worse performance reported by
BenchmarkBucket_Series
. Why? Because pooling is not effective if GC triggers continuously. TheBenchmarkBucket_Series
is even worse than the production scenario, because it runs both the store-gateway client and server, so a lot of memory is used to unmarshal the response and keep it in memory for assertions (which is not done by a real store-gateway server). For this reason, I will test it in a real Mimir cluster too.Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]