Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: Clone for batch index compaction and shorten lock #9511

Merged
merged 1 commit into from
Apr 18, 2018

Conversation

jcalvert
Copy link
Contributor

This is to address #9506 - When the BTree index grows to millions of entries, index compaction must iterate over all of them and this can take a substantial amount of time while continuing to hold the lock. This lock means that any concurrent read/writes will wait until compaction completes. By breaking compaction into batches of 10000, similar to the backend compaction, this allows relief for contention of the index lock.

We came up with the following test as a way of validating that there is contention that blocks puts to the index while compaction is ongoing. On the master branch this test will fail, but it passes with our provided changes. We felt this test was useful for us in proving the issue but does not fit well into the existing test suite for etcd as far as we can tell.

Coauthored with @cosgroveb

func TestIndexCompactAndRuntime(t *testing.T) {                           
  ti := newTreeIndex()                                                    
  size := 1000000                                                         
  bytesN := 64                                                            
  keys := createBytesSlice(bytesN, size)                                  
  for i := 1; i < size-1; i++ {                                           
    ti.Put(keys[i], revision {main: int64(i), sub: int64(i)})             
  }                                                                       
  go ti.Compact(int64(500000))                                            
  time.Sleep(200000 * time.Nanosecond)                                  
  t1 := time.Now().UnixNano()                                             
  ti.Put(keys[size-1], revision {main: int64(size-1), sub: int64(size-1)})
  t2 := time.Now().UnixNano() - t1                                        
  if t2 > 150000000 {                                                     
    t.Errorf("Run time took too long! %v", t2)                            
  }                                                                       
}                                                                         

@xiang90
Copy link
Contributor

xiang90 commented Mar 30, 2018

The idea looks right. But as @heyitsanthony commented here: #9384 (comment), we probably should have a better abstraction for traversing the tree in a mvcc manner without holding long for a long time.

@cosgroveb
Copy link
Contributor

@xiang90 We're not sure that we fully understand the concerns being discussed in your link to #9384 and aren't super confident that we could be the ones to address the need for a better abstraction here without further guidance.

Are you looking for any particular changes from us or are you asking for us to wait until the problem being discussed in #9384 is settled? If you could provide us with a more concrete example or some psuedo-code we could certainly take a crack at it.

@jcalvert
Copy link
Contributor Author

Based on the general store lock here this doesn't seem to really alleviate our problem fully and we are still investigating how to maintain consistent throughput during index compactions.

@xiang90
Copy link
Contributor

xiang90 commented Apr 12, 2018

@jcalvert

Basically, we want to iterate the tree index without locking the lock for a long time.

We can provide a generic func say: t.fuzzyAscend just like t.tree.Ascend to do it rather than exposing the details of locking in our business logic func.

@jcalvert
Copy link
Contributor Author

@xiang90

During more extensive testing, we discovered that the lock we previously mentioned was in fact preventing updates while the index was being compacted. Even batching in groups of ten thousand is enough to cause noticeable latency. We were finally able to achieve consistent throughput during compactions by pushing the lock on the tree index down into the Ascend function. In order to preserve traverse order, we used the Clone function to produce a copy on write version of the tree so that we do not need to lock for the entire traversal. Please let us know any feedback you have, as we have experienced serious throughput degradation during index compactions.

@xiang90
Copy link
Contributor

xiang90 commented Apr 16, 2018

@jcalvert We cannot really modify the vendored btree. We have to push the change to upstream btree pkg first.

@xiang90
Copy link
Contributor

xiang90 commented Apr 16, 2018

Even batching in groups of ten thousand is enough to cause noticeable latency.

do you call yield to allow a reschedule of the go routine? i previously did some benchmark myself, and found batching of 10,000 should be good enough.

@jcalvert
Copy link
Contributor Author

@xiang90 That change is already in a more recent version of the btree package. Did I not do the vendoring process correctly?

@xiang90
Copy link
Contributor

xiang90 commented Apr 16, 2018

@jcalvert You probably need to first update the vendored pkg and the vendor lock file in another PR first.

@gyuho might help you on that.

@@ -17,7 +17,7 @@ package mvcc
import (
"reflect"
"testing"

"time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strange. why this is added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Removing.

@jcalvert
Copy link
Contributor Author

do you call yield to allow a reschedule of the go routine? i previously did some benchmark myself, and found batching of 10,000 should be good enough.

No, we did not try that. You mean adding runtime.Gosched()? That would be sufficient to allow others to acquire the lock?

@xiang90
Copy link
Contributor

xiang90 commented Apr 16, 2018

@jcalvert lazily clone is probably better if provided by the btree library already.

@gyuho
Copy link
Contributor

gyuho commented Apr 16, 2018

@jcalvert We should already have the latest btree depedency. Can you try scripts/updatedep.sh?

@xiang90
Copy link
Contributor

xiang90 commented Apr 17, 2018

@jcalvert rebase with current master?

@jcalvert jcalvert force-pushed the index_compaction_breakup branch 6 times, most recently from eb74545 to fd4e132 Compare April 17, 2018 15:41
@jcalvert jcalvert changed the title Batch index compaction mvcc: Clone for batch index compaction and shorten lock Apr 17, 2018
@jcalvert
Copy link
Contributor Author

@xiang90 rebased. Thank you.

mvcc/index.go Outdated

clone.Ascend(func(item btree.Item) bool {
keyi := item.(*keyIndex)
ti.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment on why the lock is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment about why the lock is needed. Let us know if this is not sufficient. Thank you.

func BenchmarkIndexCompact100000(b *testing.B) { benchmarkIndexCompact(b, 100000) }
func BenchmarkIndexCompact1000000(b *testing.B) { benchmarkIndexCompact(b, 1000000) }

func benchmarkIndexCompact(b *testing.B, size int) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have a result for the benchmark with a comparison with the result before the patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On local development machine -

master:

go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact
goos: linux
goarch: amd64
pkg: github.com/coreos/etcd/mvcc
BenchmarkIndexCompact1-16                1000000              1655 ns/op
BenchmarkIndexCompact100-16               100000             19772 ns/op
BenchmarkIndexCompact10000-16               2000            880883 ns/op
BenchmarkIndexCompact100000-16               200           8725995 ns/op
BenchmarkIndexCompact1000000-16              100         386072323 ns/op
PASS

this branch:

go test github.com/coreos/etcd/mvcc -v -run=^$ -bench BenchmarkIndexCompact  
goos: linux
goarch: amd64
pkg: github.com/coreos/etcd/mvcc
BenchmarkIndexCompact1-16                1000000              1572 ns/op
BenchmarkIndexCompact100-16               100000             22216 ns/op
BenchmarkIndexCompact10000-16               2000           1058001 ns/op
BenchmarkIndexCompact100000-16               100          10484775 ns/op
BenchmarkIndexCompact1000000-16              100         423129293 ns/op

Added time likely due to the additional lock/unlock calls.

mvcc/kvstore.go Outdated
ch := make(chan struct{})
var j = func(ctx context.Context) {
keep := s.kvindex.Compact(rev)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this needs to be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defer of s.mu.Unlock() seems to mean that throughput still stalls during index compaction, although on audit of the code it isn't clear how this is the case. We can look to see if we can create a test to validate that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean why do we move the compact func into the schedule j unit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed cleaner to put it there rather than remove the deferred unlock in place of explicitly unlocking in the error cases. This could be mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transactions do indeed touch this lock so by offloading it to the scheduled action it means that transactions won't time out waiting for the lock. If you'd prefer us to change it so that the code explicitly unlocks we can do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not think this is safe since this func expects to be returned after the index is compacted (the old revisions are not reachable). we still need to keep the index compaction synchronously

mvcc/index.go Outdated
@@ -17,7 +17,6 @@ package mvcc
import (
"sort"
"sync"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change.

import (
"testing"

"go.uber.org/zap"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format? and why do we need this pkg?

@xiang90
Copy link
Contributor

xiang90 commented Apr 17, 2018

lgtm. defer to @gyuho


func benchmarkIndexCompact(b *testing.B, size int) {
log := zap.NewNop()
kvindex := newTreeIndex(log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run gofmt -w *.go on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. why the tree index takes log as its arg? strange.

Copy link
Contributor

@gyuho gyuho Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 I recently added structured logger as an option (where the logger object is created at top level and passed downstream). Couldn't find an easier way to pass it around (will try to find a cleaner way). We can just pass nil for testing.

@gyuho
Copy link
Contributor

gyuho commented Apr 17, 2018

@jcalvert Thanks again for hard work!

copy on write version of the tree so that we do not need to lock for the entire traversal

Could you phrase this into a short release note https://github.com/coreos/etcd/blob/master/CHANGELOG-3.4.md#improved with a link to this PR, so people know who worked on this?

Also, do you have reproducible workloads that would benefit from this patch, so that I can cross-check on my side?

@xiang90
Copy link
Contributor

xiang90 commented Apr 17, 2018

It does not make sense to pass log object at func level in my opinion. But it has nothing to do with this PR though.

@gyuho
Copy link
Contributor

gyuho commented Apr 17, 2018

@xiang90 Agree. I will find a cleaner way!

@jcalvert
Copy link
Contributor Author

@gyuho

PUT 10 million keys into server, send compaction request, immediately begin PUT, see timeouts. We discovered this trend by watching the timeouts from application servers in our Grafana dashboard. If you need, we can try to provide a client script to demonstrate. Thank you.

CHANGELOG-3.4.md Outdated
@@ -34,6 +34,7 @@ See [code changes](https://github.com/coreos/etcd/compare/v3.3.0...v3.4.0) and [
- e.g. a node is removed from cluster, or [`raftpb.MsgProp` arrives at current leader while there is an ongoing leadership transfer](https://github.com/coreos/etcd/issues/8975).
- Add [`snapshot`](https://github.com/coreos/etcd/pull/9118) package for easier snapshot workflow (see [`godoc.org/github.com/etcd/snapshot`](https://godoc.org/github.com/coreos/etcd/snapshot) for more).
- Improve [functional tester](https://github.com/coreos/etcd/tree/master/functional) coverage: [proxy layer to run network fault tests in CI](https://github.com/coreos/etcd/pull/9081), [TLS is enabled both for server and client](https://github.com/coreos/etcd/pull/9534), [liveness mode](https://github.com/coreos/etcd/issues/9230), [shuffle test sequence](https://github.com/coreos/etcd/issues/9381), [membership reconfiguration failure cases](https://github.com/coreos/etcd/pull/9564), [disastrous quorum loss and snapshot recover from a seed member](https://github.com/coreos/etcd/pull/9565), [embedded etcd](https://github.com/coreos/etcd/pull/9572).
- Improve [index compaction throughput](https://github.com/coreos/etcd/pull/9511) by using a copy on write clone to avoid holding the lock for the traversal of the entire index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wont improve the compaction throughput. it solves the blocking problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gyuho
Copy link
Contributor

gyuho commented Apr 18, 2018

@jcalvert @xiang90 I resolved some formatting issue with license header. Otherwise looks good. Will merge after CI passes. Thanks a lot!

@xiang90
Copy link
Contributor

xiang90 commented Apr 18, 2018

@gyuho Can we squash the commits into one?

For compaction, clone the original Btree for traversal purposes, so as to
not hold the lock for the duration of compaction. This allows read/write
throughput by not blocking when the index tree is large (> 1M entries).

mvcc: add comment for index compaction lock
mvcc: explicitly unlock store to do index compaction synchronously
mvcc: formatting index bench
mvcc: add release note for index compaction changes
mvcc: add license header
@gyuho
Copy link
Contributor

gyuho commented Apr 18, 2018

@xiang90 Done.

@xiang90
Copy link
Contributor

xiang90 commented Apr 18, 2018

lgtm

@gyuho gyuho merged commit e5c9483 into etcd-io:master Apr 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants