-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write and compaction stability #8384
Conversation
This pool was previously a pool.Bytes to avoid repetitive allocations. It was recently switchted to a sync.Pool because pool.Bytes held onto very larger buffers at times which were never released. sync.Pool is showing up in allocation profiles quite frequently. This switches the pool to a new pool that limits how many buffers are in the pool as well as the max size of each buffer in the pool. This provides better bounds on allocations.
models/points.go
Outdated
return buf[:i-1], nil | ||
} | ||
return buf[:i], err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "ignore the error" but it gets passed back. I just want to make sure that's intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it should be returning nil here and err should err should be _
.
Measurement name and field were converted between []byte and string repetively causing lots of garbage. This switches the code to use []byte in the write path.
This changes full compactions within a shard to run sequentially instead of running all the compaction groups in parallel. Normally, there is only 1 full compaction group to run. At times, there could be several which causes instability if they are all running concurrently as they tie up a cpu for long periods of time. Level compactions are also capped to a max of 4 concurrently running for each level in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog of tsm files at a given level.
@@ -383,20 +387,17 @@ func (l *WAL) writeToLog(entry WALEntry) (int, error) { | |||
// limit how many concurrent encodings can be in flight. Since we can only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment block appears redundant now that the limiter is gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Wal.limiter
field is no longer used, so can we remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I didn't see these comments earlier before I merged. I'll submit a new PR to remove these lines.
Required for all non-trivial PRs
These are some changes to improve stability for writes and compactions. #8302 improved write throughput by ~2x. As a result, snapshot compactions are written much more frequently which cause TSM files to accumulate much faster. The current compaction planning attempts to kicks off as many compactions as possible and allow the go runtime to manage them.
When there are many shards, this can cause problems with many full or expensive compactions running at once spiking CPU and using a lot of disk space. #8348 added a config option to limit the global number of running level and full compactions across shards, but an individual level could still kick off more compactions than the limit. The PR limits concurrent compactions within a level and reduces some allocations to improve stability and jitter.
I ran a 2B write load w/ 50 concurrent writers against 1.2.3, master and this PR. Total series is 100k w/ 1 value. There was no query load. This is on a
c4.8xlarge
. I also ran a larger 20B run against master and this PR to see when many level/full compactions within a shard would be kicked off an what the impact was.The main changes in this PR:
Point.Name
to return[]byte
instead of string. We were converting from[]byte
tostring
to[]byte
in many places causes lots of garbage.sync.Pool
. Thesync.Pool
wasn't able to hold onto the allocations so we were allocating many[]byte
over and over. The pool used is not limited to drop buffers that are too big.Writes
This is HTTP write throughput and latency. 1.2.3 sustains about 1.1M w/s and latency varies around 210ms-320ms.
master
and this PR sustain around 2.3M w/s.master
latency fluctuates around 100-110ms and with this change it reduces to 90-100ms. In themaster
run, you can see the throughput jump around quite a bit and is slowly going down. On larger test runs (20B), this variance can be more dramatic. This variance appears to be due to CPU starvation caused by lots of allocations and compactions competing for CPU.2B Point Run
20B Point Run
This larger test run shows what happens to writes when too many level or full compactions kick in at once within a shard.
master
In the middle of the graph, the write throughput fluctuations are correlated with about 21 concurrent compactions running (14 full and 7 level 2).
This is the same write load against this PR.
Compactions are limited in this run and larger variations are gone.
Compactions
Watching the TSM file count shows that
master
compactions are not keeping up with the increased write load and are slowly backing up. This is a 2B point run.The current compaction planning is also spinning up a lot of compactions that are competing with each other and slowing progress.
With this change, file count stays more steady with the increased write load and the number of concurrent compactions stays more consistent.
While I was not measuring it, I noticed CPU util and load was significantly reduced from
master
to this change as well.