Write and compaction stability #8384

jwilder · 2017-05-11T21:30:19Z

Required for all non-trivial PRs

Rebased/mergable
Tests pass
CHANGELOG.md updated
Sign CLA (if not already signed)

These are some changes to improve stability for writes and compactions. #8302 improved write throughput by ~2x. As a result, snapshot compactions are written much more frequently which cause TSM files to accumulate much faster. The current compaction planning attempts to kicks off as many compactions as possible and allow the go runtime to manage them.

When there are many shards, this can cause problems with many full or expensive compactions running at once spiking CPU and using a lot of disk space. #8348 added a config option to limit the global number of running level and full compactions across shards, but an individual level could still kick off more compactions than the limit. The PR limits concurrent compactions within a level and reduces some allocations to improve stability and jitter.

I ran a 2B write load w/ 50 concurrent writers against 1.2.3, master and this PR. Total series is 100k w/ 1 value. There was no query load. This is on a c4.8xlarge. I also ran a larger 20B run against master and this PR to see when many level/full compactions within a shard would be kicked off an what the impact was.

The main changes in this PR:

Convert Point.Name to return []byte instead of string. We were converting from []byte to string to []byte in many places causes lots of garbage.
Switch the TSM WAL buffer pool back to a managed pool from sync.Pool. The sync.Pool wasn't able to hold onto the allocations so we were allocating many []byte over and over. The pool used is not limited to drop buffers that are too big.
Limit concurrent compactions within a shard. Instead of kicking off as many goroutines as there are compactions groups, cap it at 4 (determined empirically to work well).
Rework key parsing to not allocate and intermediate tag map that we ignore.

Writes

This is HTTP write throughput and latency. 1.2.3 sustains about 1.1M w/s and latency varies around 210ms-320ms. master and this PR sustain around 2.3M w/s. master latency fluctuates around 100-110ms and with this change it reduces to 90-100ms. In the master run, you can see the throughput jump around quite a bit and is slowly going down. On larger test runs (20B), this variance can be more dramatic. This variance appears to be due to CPU starvation caused by lots of allocations and compactions competing for CPU.

2B Point Run

20B Point Run

This larger test run shows what happens to writes when too many level or full compactions kick in at once within a shard.

master

In the middle of the graph, the write throughput fluctuations are correlated with about 21 concurrent compactions running (14 full and 7 level 2).

This is the same write load against this PR.

Compactions are limited in this run and larger variations are gone.

Compactions

Watching the TSM file count shows that master compactions are not keeping up with the increased write load and are slowly backing up. This is a 2B point run.

The current compaction planning is also spinning up a lot of compactions that are competing with each other and slowing progress.

With this change, file count stays more steady with the increased write load and the number of concurrent compactions stays more consistent.

While I was not measuring it, I noticed CPU util and load was significantly reduced from master to this change as well.

This pool was previously a pool.Bytes to avoid repetitive allocations. It was recently switchted to a sync.Pool because pool.Bytes held onto very larger buffers at times which were never released. sync.Pool is showing up in allocation profiles quite frequently. This switches the pool to a new pool that limits how many buffers are in the pool as well as the max size of each buffer in the pool. This provides better bounds on allocations.

benbjohnson · 2017-05-12T16:21:07Z

models/points.go

+		return buf[:i-1], nil
+	}
+	return buf[:i], err
+}


The comment says "ignore the error" but it gets passed back. I just want to make sure that's intentional.

Good catch, it should be returning nil here and err should err should be _.

Measurement name and field were converted between []byte and string repetively causing lots of garbage. This switches the code to use []byte in the write path.

This changes full compactions within a shard to run sequentially instead of running all the compaction groups in parallel. Normally, there is only 1 full compaction group to run. At times, there could be several which causes instability if they are all running concurrently as they tie up a cpu for long periods of time. Level compactions are also capped to a max of 4 concurrently running for each level in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog of tsm files at a given level.

stuartcarnie · 2017-05-11T23:16:06Z

tsdb/engine/tsm1/wal.go

@@ -383,20 +387,17 @@ func (l *WAL) writeToLog(entry WALEntry) (int, error) {
 	// limit how many concurrent encodings can be in flight.  Since we can only


This comment block appears redundant now that the limiter is gone

The Wal.limiter field is no longer used, so can we remove that?

Yes. I didn't see these comments earlier before I merged. I'll submit a new PR to remove these lines.

jwilder added this to the 1.3.0 milestone May 11, 2017

jwilder requested review from stuartcarnie and benbjohnson May 11, 2017 21:30

jwilder added the review label May 11, 2017

benbjohnson approved these changes May 12, 2017

View reviewed changes

jwilder and others added 6 commits May 12, 2017 14:05

Convert usage of strings to []byte

2cac46e

Measurement name and field were converted between []byte and string repetively causing lots of garbage. This switches the code to use []byte in the write path.

Update changelog

0b7c0b6

explicit sizes

096d6f6

check size and allocate once

0151afe

cache MarshalSize

c863923

jwilder force-pushed the jw-write-values branch from 2b14614 to c863923 Compare May 12, 2017 20:05

stuartcarnie approved these changes May 12, 2017

View reviewed changes

jwilder merged commit 3b70086 into master May 12, 2017

jwilder deleted the jw-write-values branch May 12, 2017 20:48

jwilder removed the review label May 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write and compaction stability #8384

Write and compaction stability #8384

jwilder commented May 11, 2017 •

edited

Loading

benbjohnson May 12, 2017

jwilder May 12, 2017

stuartcarnie May 11, 2017

stuartcarnie May 11, 2017

jwilder May 12, 2017

		@@ -383,20 +387,17 @@ func (l *WAL) writeToLog(entry WALEntry) (int, error) {
		// limit how many concurrent encodings can be in flight. Since we can only

Write and compaction stability #8384

Write and compaction stability #8384

Conversation

jwilder commented May 11, 2017 • edited Loading

Required for all non-trivial PRs

Writes

2B Point Run

20B Point Run

Compactions

benbjohnson May 12, 2017

Choose a reason for hiding this comment

jwilder May 12, 2017

Choose a reason for hiding this comment

stuartcarnie May 11, 2017

Choose a reason for hiding this comment

stuartcarnie May 11, 2017

Choose a reason for hiding this comment

jwilder May 12, 2017

Choose a reason for hiding this comment

jwilder commented May 11, 2017 •

edited

Loading