-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write throughput/concurrency improvements #8302
Conversation
The lock shows up under write load. It only needs to be assigned once so a read lock eliminates the contention.
Series and Measurment have their own locks and we do not need to hold locks on the index while using those types.
tsdb/engine/tsm1/pools.go
Outdated
if x == nil { | ||
return make([]byte, size) | ||
} | ||
buf := x.([]byte) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but putting a []byte (or any slice) into a sync.Pool will allocate as the conversion from slice to interface needs to allocate space for the slice on the heap. This can be avoided by pooling *[]byte instead. For example:
// getBuf returns a buffer with length size from the buffer pool.
func getBuf(size int) *[]byte {
x := bufPool.Get()
if x == nil {
b := make([]byte, size)
return &b
}
buf := x.(*[]byte)
if cap(*buf) < size {
b := make([]byte, size)
return &b
}
*buf = (*buf)[:size]
return buf
}
// putBuf returns a buffer to the pool.
func putBuf(buf *[]byte) {
bufPool.Put(buf)
}
A quick local benchmark shows that this avoids an allocation.
@@ -121,7 +121,7 @@ func NewWAL(path string) *WAL { | |||
// these options should be overriden by any options in the config | |||
SegmentSize: DefaultSegmentSize, | |||
closing: make(chan struct{}), | |||
syncWaiters: make(chan chan error, 256), | |||
syncWaiters: make(chan chan error, 1024), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Commit message says "This increases the buffer to 4096"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
If the sync waiters channel was full, it would block sending to the channel while holding a the wal write lock. The sync goroutine would then be stuck acquiring the write lock and could not drain the channel. This increases the buffer to 1024 which would require a very high write load to fill as well as retuns and error if the channel is full to prevent the blocking.
Added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Under high write load, the sync goroutine would startup, and end very frequently. Starting a new goroutine so frequently adds a small amount of latency which causes writes to take long and sometimes timeout. This changes the goroutine to loop until there are no more waiters which reduce the churn and latency.
The inmem index would call CreateSeriesIfNotExist for each series which takes and releases and RLock to see if a series exists. Under high write load, the lock shows up in profiles quite a bit. This adds a filtering step that obtains a single RLock and checks all the series and returns the non-existent series to contine though the slow path.
When many TSM files are being compacted, the buffers can add up fairly quickly.
The current bytes.Pool will hold onto byte slices indefinitely. Large writes can cause the pool to hold onto very large buffers over time. Testing w/ sync/pool seems to perform similarly now so using a sync/pool will allow these buffers to be GC'd when necessary.
Under high write load, the check for each series was done sequentially which caused a lot of CPU time to acquire/release the RLock on LogFile. This switches the code to check multiple series at once under an RLock similar to the chang for inmem.
There was contention on the write lock which only needs to be acquired when checking to see if the log file should be rolled over.
Required for all non-trivial PRs
This PR has a few performance improvements and bug fixes to the write path which improve overall write throughput when there are many concurrent writers.
The graphs below show the write latency and throughput changes from 1.2.3 to this PR (based off of 1.3) using batches of 5k and 10k and varying concurrent writers from 50-200 in each run. Each of the humps is a write load of 500M values over 100k series with varying concurrency and batch sizes. The top row is 5k batches. Second row is 10k batches. Left column is 1.2 and right column is this PR. These were run a m4.16xlarge w/ 1000 IOPS SSDs using go1.8.1 and
inmem
andtsi1
index.For 5k batches, write throughput improved from 1.1M w/s to 2.3M w/s and latency decreased by ~50% and stayed below 500ms after the change.
For 10k batches, write throughput improved from 1.8M w/s to 2.3M w/s and latency decreased by ~30%.
The majority of the changes focus on reducing lock contention and allocations in the hot path.
The throughput was remarkably similar between
inmem
andtsi
.tsi
on master hit write timeouts under higher concurrency immediately when the run started. With this PR those, timesout were fixed as well.inmem
tsi