Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental support to force compact TSDB head when ingester in-memory series are above a threshold #5371

Merged
merged 25 commits into from
Jun 30, 2023

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jun 28, 2023

What this PR does

In this PR I propose to add the experimental support to force compact the per-tenant TSDB Head when the ingester in-memory series are above a configured threshold and the ingester estimates that compacting the head up until "now - active series idle timeout" can significantly reduce the number of in-memory series.

How it works:

  • When enabled, the ingester checks if in-memory series are above the configured threshold every -blocks-storage.tsdb.head-compaction-interval.
  • When above the threshold, for each TSDB compares the in-memory series vs the active series. If the inactive active series % (vs the total in-memory series in the ingester) are above a configured % threshold, it force the compaction of that per-tenant TSDB.
  • While forcing the compaction, the ingester allows requests containing samples newer than "now - active series idle timeout", while will reject requests containing older samples. Once the compaction is completed, the TSDB Head "min time" will be set to "now - active series idle timeout", so the in-order head will allow to only accept samples newer than that. If OOO ingestion is enabled, older samples will be allowed to be ingesters via OOO Head.

Notes about forced compaction:

  • I haven't added a configuration option to set the maximum frequency because in practice it should be limited by:
    • -blocks-storage.tsdb.head-compaction-interval
    • -blocks-storage.tsdb.forced-head-compaction-min-estimated-series-reduction-percentage

Notes about TSDB:

  • Talking to @codesome, he mentioned that he didn't remember if appending to a chunk while being compacted was safe (assuming that we append a sample with timestamp greater than the head compaction range). I checked the compaction logic and, from the head chunks reading perspective, it shouldn't be different than any other query (at least I haven't found any proof of the opposite yet).

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@pracucci pracucci force-pushed the compact-tsdb-head-before-reaching-max-series-limit branch 3 times, most recently from 0184c4e to 953dce6 Compare June 28, 2023 12:13
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any major blockers in the PR, changes make sense to me, and it's worth giving it a try.

pkg/ingester/user_tsdb.go Outdated Show resolved Hide resolved
return u.db.CompactOOOHead()
}

// nextForcedHeadCompactionRange computes the next TSDB head range to compact when a forced compaction
// is triggered. If the returned isValid is false, then the returned range should not be compacted.
func nextForcedHeadCompactionRange(blockDuration, headMinTime, headMaxTime, forcedMaxTime int64) (minTime, maxTime int64, isValid, isLast bool) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isLast bool looks like unnecessary optimization to save one call to the function, and it can be wrong (if forcedMaxTime is aligned on block boundary). I'd drop it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an optimization. The problem is that when compacting a partial head (so for the case of this PR) without that flag we may end up in a infinite loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since no pushes are allowed that would modify maxTime, why would we end up in an infinite loop? Are you talking about the case when minTime == maxTime?

pkg/ingester/ingester.go Show resolved Hide resolved
@pracucci pracucci force-pushed the compact-tsdb-head-before-reaching-max-series-limit branch from 953dce6 to 692dc45 Compare June 29, 2023 08:51
@pracucci
Copy link
Collaborator Author

No good. Just got a panic in CI. Will investigate:

panic: corruption in head chunk file /tmp/TestIngester_compactBlocksToReduceInMemorySeries_ConcurrencyRun_31016170653/001/1/chunks_head/000000: head chunk file index 0 does not exist on disk

goroutine 1224 [running]:
github.com/prometheus/prometheus/tsdb.(*memSeries).chunk(0xc0021ba0d0, 0x0, {0x3d2cb40, 0xc0000da2d0}, 0xc000fd4c02?)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_read.go:393 +0x4f0
github.com/prometheus/prometheus/tsdb.(*headChunkReader).chunk(0xc0039ac020, {0x2e000000, {0x0, 0x0}, 0x189065c0425, 0x189065c049c, 0x0, 0x0, 0x0}, 0x0)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_read.go:329 +0x1c5
github.com/prometheus/prometheus/tsdb.(*headChunkReader).Chunk(0xc00102c[52](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:53)0?, {0x2e000000, {0x0, 0x0}, 0x189065c0425, 0x189065c049c, 0x0, 0x0, 0x0})
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_read.go:306 +0x105
github.com/prometheus/prometheus/tsdb.(*populateWithDelGenericSeriesIterator).next(0xc000afeb40, 0x1)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:601 +0x778
github.com/prometheus/prometheus/tsdb.(*populateWithDelChunkSeriesIterator).Next(0xc000afeb40)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/querier.go:736 +0x52
github.com/prometheus/prometheus/tsdb.DefaultBlockPopulator.PopulateBlock({}, {0x3d1eb48, 0xc000601450}, 0xc000419a40, {0x3d087a0, 0xc000601360}, {0x3d12b80, 0xc003e89200}, 0xc003f3a030?, {0x1, ...}, ...)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:1095 +0x2166
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc003e89680, {0xc000102360, 0x[56](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:57)}, {0xc001db7360?, 0x1, 0x1}, {0x3d0e200, 0x4c1a018}, {0xc00029b870, 0x1, ...})
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:817 +0x1083
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Write(0xc003e89680, {0xc000102360, 0x56}, {0x3d21880?, 0xc001cb6520}, 0x189065c04[57](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:58), 0x1890[65](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:66)c046f, 0x0)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/compact.go:[69](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:70)4 +0x4e7
github.com/prometheus/prometheus/tsdb.(*DB).compactHead(0xc0006b9100, 0xc001cb6520)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1290 +0x115
github.com/prometheus/prometheus/tsdb.(*DB).CompactHead(0xc0006b9100, 0xc001cb6520)
	/__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/db.go:1181 +0xc5
github.com/grafana/mimir/pkg/ingester.(*userTSDB).compactHead(0xc0018be2a0, 0x6ddd00, 0x189065c046e)
	/__w/mimir/mimir/pkg/ingester/user_tsdb.go:220 +0x45f
github.com/grafana/mimir/pkg/ingester.(*Ingester).compactBlocks.func1({0x0?, 0x0?}, {0x3cfb6e0, 0x1})
	/__w/mimir/mimir/pkg/ingester/ingester.go:24[70](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:71) +0x317
github.com/grafana/dskit/concurrency.ForEachUser.func1()
	/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:45 +0x1[76](https://github.com/grafana/mimir/actions/runs/5410305115/jobs/9831584129?pr=5371#step:8:77)
created by github.com/grafana/dskit/concurrency.ForEachUser
	/__w/mimir/mimir/vendor/github.com/grafana/dskit/concurrency/runner.go:36 +0x1e5

@pracucci
Copy link
Collaborator Author

Another panic observed locally, while investigating the previous one:

panic: expected newly cut file to have sequence:offset 1:8, got 2:8

goroutine 1295 [running]:
github.com/prometheus/prometheus/tsdb.handleChunkWriteError({0x2e20ea0?, 0xc000e698f0?})
	/Users/marco/workspace/src/github.com/grafana/mimir/vendor/github.com/prometheus/prometheus/tsdb/head_append.go:1518 +0x76
github.com/prometheus/prometheus/tsdb/chunks.(*chunkWriteQueue).processJob(0xc000b706e0, {0x1, 0x7, 0x18906b0f1ec, 0x18906b0f263, {0x2e3ca68, 0xc002ed4f00}, 0x100000008, 0x0, 0x2a4e6f0})
	/Users/marco/workspace/src/github.com/grafana/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunk_write_queue.go:150 +0x8c
github.com/prometheus/prometheus/tsdb/chunks.(*chunkWriteQueue).start.func1()
	/Users/marco/workspace/src/github.com/grafana/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunk_write_queue.go:138 +0xc9
created by github.com/prometheus/prometheus/tsdb/chunks.(*chunkWriteQueue).start
	/Users/marco/workspace/src/github.com/grafana/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunk_write_queue.go:119 +0x6d

@pracucci
Copy link
Collaborator Author

panic: expected newly cut file to have sequence:offset 1:8, got 2:8

I did some investigation and this panic shouldn't happen in the real world. It's triggered by an edge case in the unit test, that should be fixed by this: prometheus/prometheus#12500

panic: corruption in head chunk file /tmp/TestIngester_compactBlocksToReduceInMemorySeries_ConcurrencyRun_31016170653/001/1/chunks_head/000000: head chunk file index 0 does not exist on disk

This should also be fixed by prometheus/prometheus#12500 and, again, in practice shouldn't happen in the real world, but about this panic I want to do more investigation tomorrow to be more confident.

@pracucci pracucci force-pushed the compact-tsdb-head-before-reaching-max-series-limit branch from 3a56d65 to 0fea21d Compare June 30, 2023 03:48
@pracucci pracucci marked this pull request as ready for review June 30, 2023 04:09
@pracucci pracucci requested review from a team as code owners June 30, 2023 04:09
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I've left some minor comments, but nothing blocking as far as I can see.

pkg/ingester/user_tsdb.go Outdated Show resolved Hide resolved
pkg/ingester/user_tsdb.go Outdated Show resolved Hide resolved
pkg/storage/tsdb/config.go Outdated Show resolved Hide resolved
pkg/ingester/user_tsdb_test.go Outdated Show resolved Hide resolved
pkg/ingester/user_tsdb_test.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_forced_compaction_test.go Outdated Show resolved Hide resolved
pkg/ingester/ingester_forced_compaction_test.go Outdated Show resolved Hide resolved
pkg/storage/tsdb/config.go Outdated Show resolved Hide resolved
Copy link
Contributor

@osg-grafana osg-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking with nits

cmd/mimir/config-descriptor.json Outdated Show resolved Hide resolved
cmd/mimir/help-all.txt.tmpl Outdated Show resolved Hide resolved
pkg/storage/tsdb/config.go Outdated Show resolved Hide resolved
pkg/ingester/ingester.go Outdated Show resolved Hide resolved
estimatedSeriesReduction := db.Head().NumSeries() - uint64(db.activeSeries.Active())

// Skip if the estimated series reduction is too low (there would be no big benefit).
if float64(estimatedSeriesReduction)/float64(numInitialMemorySeries)*100 < float64(i.cfg.BlocksStorageConfig.TSDB.EarlyHeadCompactionMinEstimatedSeriesReductionPercentage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do this?

Suggested change
if float64(estimatedSeriesReduction)/float64(numInitialMemorySeries)*100 < float64(i.cfg.BlocksStorageConfig.TSDB.EarlyHeadCompactionMinEstimatedSeriesReductionPercentage) {
if float64(estimatedSeriesReduction)/float64(db.Head().NumSeries())*100 < float64(i.cfg.BlocksStorageConfig.TSDB.EarlyHeadCompactionMinEstimatedSeriesReductionPercentage) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of this PR is to check it against the total number of series in the ingester, and not the per-tenant TSDB. Let's say you configure the percentage to 10%, a tenant has 100 series, and 20 are inactive. If you check against the per-tenant TSDB, it will trigger the compaction, but in practice it will not move a needle. Comparing against the total number of series in the ingester allow us to only trigger the early compaction in case there are big opportunities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then this setting will be quite useless in multi-tenant environments. If you have 10 tenants each one saving 50% of their series, you could save a total of 50% by compacting them, but with current logic you won't trigger the compaction because each tenant will save only 5% of the total. Right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math is correct but I think it's all about the percentage you configure. Maybe in multi-tenant environments it makes sense a lower %. I think this is something we can refine over time, but I definitely want to keep a way to avoid compacting a TSDB when the overall impact is negligible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having it this way will require tons of tuning, because the setting doesn't relate to the final effect. I have an ingester with a number of in-memory series, and I want to make sure that we compact heads when I can save a given amount of memory. I shouldn't need to update the percentage based on the number of tenants I have in order to achieve that.

IMO it's counter-intuitive.

OTOH, what's the problem about compacting 10 smaller Heads vs compacting one bigger?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. I pushed a commit to include Oleg's suggestion, but slightly modified to also compact all TSDBs with an high number of inactive series (even if below the configured %).

There's no strong consensus between the two of us on what's the right way to decide which TSDBs to compact. The gist is that is difficult to have good arguments because we'll learn it the hard way (running it for some time).

…memory series are above a threshold

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…tedBeforeForcedCompaction

Signed-off-by: Marco Pracucci <marco@pracucci.com>
….yaml

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…HeadUpUntilNowMinusActiveSeriesMetricsIdleTimeout from subtests to code blocks

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the compact-tsdb-head-before-reaching-max-series-limit branch from 30db2d1 to 6ad09d2 Compare June 30, 2023 11:33
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…ke it testable

Signed-off-by: Marco Pracucci <marco@pracucci.com>
…locksHonoringBlockRangePeriod

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci enabled auto-merge (squash) June 30, 2023 14:57
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci merged commit d855dff into main Jun 30, 2023
@pracucci pracucci deleted the compact-tsdb-head-before-reaching-max-series-limit branch June 30, 2023 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants