-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve TSM1 cache performance #7633
Conversation
eff1549
to
6aa2456
Compare
} | ||
|
||
wg.Wait() | ||
close(res) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this Wait and close are in their own goroutine, you can start looping over res
sooner. Because res
is buffered, it looks like there's no risk of deadlock.
a68fc86
to
344e576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit, but looks great!
// {1, 2, 4, 8, 16, 32, 64, 128, 256}. | ||
// | ||
func newring(n int) (*ring, error) { | ||
if n <= 0 || n > 256 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move 256
to a const
since it's repeated a few times in this func?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should probably test w/ higher partitions under high load to see if that has any benefit.
|
||
p.mu.RLock() | ||
for k, e := range p.store { | ||
if err := f(k, e); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I got a lot of timeouts under high load, this was one of the spots that showed up where many goroutines where busy. The issue was a that each entry
was being sorted which was expensive and there were millions of them which I think caused CPU starvation for the waiting write HTTP goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge, thank you very much for adding this!
Currently, whenever a snapshot occurs the Cache is reset and so many allocations are repeated, as the same type of data is re-added to the Cache. This commit allows the stores to keep track of the number of values within an entry, and use that size as a hint when the same entry needs to be recreated after a snapshot. To avoid hints persisting over a long period of time they are deleting after every snapshot, and rebuilt using the most recent entries only.
Required for all non-trivial PRs
This PR optimises the TSM1 cache, improving write performance under load by up to 70% from
1.1
.It works by using a simple (crude, really) Hash Ring, which allows any series key to be consistently mapped to a partition in the ring. Each partition contains a map of series keys to TSM1 entries, and manages safe access to that map. Entries are distributed uniformly between a small (currently fixed to 128) partitions.
Since each partition manages access to the entries that map to it, contention over the entire cache is minimised, and restricted to cases when keys that map to the same partition as each other need to operate on the cache.
Further, currently the cache is essentially re-allocated and re-build after a snapshot is taken, which is expensive. This PR adds some logic that keeps track of how entries within the cache are currently allocated as hints, and uses these hints to pre-allocate the cache after a snapshot, so that subsequent additions of entries do not result in new allocations within the cache data-structure.
Performance
I've been mainly testing performance on a
c4.8xlarge
AWS instance, firing a combined1.8 M writes/sec
at the instance usinginflux-stress
. Specifically:In total 600M points are written.
This is a typical run, where the points have a single value. The case where there are multiple points have a similar difference.