Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

Add transaction batching #285

Merged
merged 1 commit into from
Mar 18, 2015
Merged

Add transaction batching #285

merged 1 commit into from
Mar 18, 2015

Conversation

tv42
Copy link
Contributor

@tv42 tv42 commented Jan 17, 2015

Batcher makes it easy to make lots of small transactions with
significantly better performance. Batcher.Update combines multiple
Update calls into a single disk transaction, managing errors smartly.

I'm aware of https://github.com/boltdb/coalescer. I wrote the original version of this before you pushed coalescer, it just took a while for me to circle back to cleaning it up. This code doesn't leave a goroutine around, and is a few percent faster ;) But the real difference to coalescer is this: with this code, callers don't need to write a for loop around ErrRollback. I pull failing updaters out of the transaction, and retry the others automatically.

I'd really like to see this as part of bolt, and I feel like having a good batcher mechanism would help significantly with performance complaints like #244 . If this looks promising, it could even sit alongside DB.Update as DB.Batch -- or even become DB.Update (though beware the "runs multiple times" behavior difference).

With this in place, I see no reason why any of my apps would ever again bother doing non-batched transactions.

Let me know what you think. If you say no, this'll live as tv42/semiauto (as opposed to bolt action..)

@benbjohnson
Copy link
Member

@tv42 I definitely like the idea of pushing it into Bolt core. It's such a common use case that it doesn't make sense to push it into another repo. We can't have it replace DB.Update() because the idempotency piece will cause backwards incompatibility.

I like your implementation better since it does the retries instead of rolling back everything. I don't have time right this moment to give it enough testing but I'll test and merge it in within the next couple of days.

@tv42
Copy link
Contributor Author

tv42 commented Jan 18, 2015

Yeah, DB.Batch() would be better than DB.Update().

@tv42
Copy link
Contributor Author

tv42 commented Jan 18, 2015

Rebased and edited the commit to become DB.Batch().

@davecgh
Copy link
Contributor

davecgh commented Jan 18, 2015

I would highly recommend this pull request include documentation additions to README.md with example usage, caveats, etc.

@tv42
Copy link
Contributor Author

tv42 commented Jan 18, 2015

Needs more test coverage too. That work is just not useful if the whole approach is unwanted.

@arctica
Copy link

arctica commented Jan 28, 2015

Excellent addition. I hope this can be merged into Bolt core and the documentation expanded as davecgh suggested.

@benbjohnson
Copy link
Member

Sorry for the delay but I wanted to make sure I had a couple hours to dedicate to the PR and really understand it better. Here's some feedback after playing with it for a while:

  1. What's the benefit of catching panic() calls and re-panicing? It seems like we'd lose the stacktrace of where the panic originated from.
  2. The loop/LoadPointer()/CompareAndSwapPointer() in DB.Batch() is really confusing. I understand it conceptually but it feels clever and that worries me. Can we simplify this to kick off a new batch into a separate goroutine and add calls to that? Adding a call once full would cause it to flush and generate a new batch.
  3. Can you rename BatchMaxSize and BatchMaxDelay to MaxBatchSize and MaxBatchDelay? That's just a personal preference.
  4. Can you move MaxBatchSize/MaxBatchDelay to exported properties on DB instead of passing in through Options? The Options exists to set options that are required during Open().

Also, I refactored some of the tests a bit:

benbjohnson@ccee92e

@tv42
Copy link
Contributor Author

tv42 commented Jan 31, 2015

  1. Good catch.

I can't just let the panic bubble up like Update or View do, because the func is typically executed in a different goroutine than where the matching Batch call is. The error has to be shipped to the right caller (hence the channel return from merge), and the type panicked is just used as a marker that it was a panic not an error.

Second, the point of re-panicking instead of just returning an error is to keep the panic, well, exceptional. It just felt like the right thing; don't swallow the panic because we actually have a caller that can deal with it (as opposed to e.g. a net/http Handler).

Maybe the recover bit should store the stack trace, and store it in the error sent to the caller. runtime.Stack makes it easy. (All goroutines or just current goroutine?) This seems to be the right thing to do regardless of whether it's re-panicked or returned as an error.

Let me know what you think.

  1. It's essentially locking so it is bound to be clever by association. (Well, a full mutex would work too, but this oughta be faster). I wrote it originally with a goroutine but liked that even less; the atomic swap is the trick that makes it so nicely able to oppfoortunistically join an existing batch, the nil->me (or prevFullBatchIJustObserved->me) transition with CompareAndSwapPointer can only happen one at a time so that's a good signal for "I should become master". And once I added that, the need for the goroutine and channel just naturally went away.

I'm very much open to making the code better, but I wonder if that's achievable with e.g. a nicer description of the state machine.

  1. Sure. I was trying to group by being related to Batch.

  2. I did it this way to avoid the situation where callers can change the values on the fly, both for races and the weird bugs that could trigger. I'm a little worried about leaving that open. This also let me do error handling up front. I understand that that is what Options currently does, but if you look at it with fresh eyes nothing about Options itself seems specific to the actual opening event; it's just configuration passed to a constructor function.

I can't vouch for the test changes right now, but I appreciate the motivation; less indentation is good, I was copying your older programming style ;)

@tv42
Copy link
Contributor Author

tv42 commented Jan 31, 2015

The test changes look good. Hint for reviewing: make diff ignore whitespace changes.

@tv42
Copy link
Contributor Author

tv42 commented Jan 31, 2015

batch-control-flow

digraph batch_control_flow {
  start [shape=point];
  testCur [shape=diamond, label="is there a batch?"];
  start -> testCur;

  tryJoin [shape=diamond, label="try to join\ncurrent batch"];
  testCur -> tryJoin [label="yes"];
  waitForErr [label="wait for error"];
  tryJoin -> waitForErr [label="joined,\nif full\nnotify master"];
  returnOrPanic [label="return or panic", shape=doublecircle];
  waitForErr -> returnOrPanic;
  tryJoin -> tryMaster [label="join refused\n(already started)"];

  tryMaster [shape=diamond, label="try to\nbecome\nmaster"];
  testCur -> tryMaster [label="no"];
  tryMaster:se -> testCur:n [label="lost race"];
  waitFullOrDelay [label="wait for full or max delay"];
  tryMaster:sw -> waitFullOrDelay [label="we are master"];
  doTransactions [label="call db.Update, handle errors"];
  waitFullOrDelay -> doTransactions [label="full or timer /\nmark as started"];
  doTransactions -> waitForErr;
}

@tv42 tv42 force-pushed the batch branch 2 times, most recently from 29c0764 to 6931794 Compare February 11, 2015 21:14
@tv42
Copy link
Contributor Author

tv42 commented Feb 11, 2015

Alright for your reviewing pleasure, https://github.com/tv42/bolt/tree/batch-3 / 29c0764 contains a rewrite using just a mutex for coordination. It got really simple once I realized I can wire a timer to a sync.Once.

For actual merging, I squashed the branch as 6931794.

It could still use more docs and tests, but is the current approach more understandable?

Performance is close enough to the atomics version, I'm seeing a lot of jitter so any real difference is probably drowned by the noise and would need several repeats to get anything conclusive; I'd rather get the code out there.

@tv42
Copy link
Contributor Author

tv42 commented Feb 11, 2015

Err I just saw a bug in handling batch full, consider the details to be in flux. Need those tests ;)

@tv42
Copy link
Contributor Author

tv42 commented Feb 11, 2015

The fix was pretty trivial: ee8e42e (but once again, I don't think these intermediates are worth merging).

New squashed merge candidate is 4ce6f86.

Coverage looks pretty good, the only thing missing is panicked.Error, and that's there only to make it fit through chan error.

@tv42
Copy link
Contributor Author

tv42 commented Feb 11, 2015

And I should still rip out the checks & errors on bad MaxBatchSize, MaxBatchDelay values and just handle whatever comes; if the user can set the fields on DB directly, there's no validation possible at that time anyway, and they degrade nicely in a way that just disables batching.

@j
Copy link

j commented Feb 12, 2015

👍 and 😄

@tv42
Copy link
Contributor Author

tv42 commented Feb 12, 2015

I'm rebasing the "batch" branch again, new commit is b906f4e.
Here's what's new (from before the rebase, archived in branch batch-6):

commit 63c0780

Move batch item retrying to happen the original goroutine

This preserves panic stack traces, as we no longer need to communicate
panics as special errors between goroutines.

commit e698383

Internal docs for batch

commit d90b13e

Enforce MaxBatchSize

Previous logic relied on the scheduler to start the batch "soon
enough", this one will never exceed MaxBatchSize.

Fix broken test.

commit 0fabc63

Saner batch tunables

No need for errors here, and the copying logic was a leftover from
when they were actually in Options.

@tv42
Copy link
Contributor Author

tv42 commented Feb 12, 2015

@benbjohnson what do you think of b906f4e / 5261724?

@davecgh See the README addition in 5261724, does this match what you wanted to see? If not, can you talk more about what you would need as a user.

err := db.Batch(func(tx *bolt.Tx) error {
// Find last key in bucket, decode as bigendian uint64, increment
// by one, encode back to []byte, and add new key.
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation here.

@davecgh
Copy link
Contributor

davecgh commented Feb 12, 2015

@tv42 That addition to the README looks pretty good. I think it might benefit from a couple of extra things:

  • An example that shows it being used concurrently. It really doesn't do anything differently (from a caller perspective) than Update when called as shown in the README which I think could easily lead to confusion for a user unfamiliar with the package. This is especially true when coming at it from the perspective of leveldb batches which are asynchronous calls that essentially just add the data to a batch of operations which you specifically trigger with a call to perform when ready (essentially what bolt already does inside of Update).
  • A short discussion of when you wouldn't want to use a Batch. For example, given the previous point, calling it from a single goroutine is semantically equivalent to calling Update. Since a batch involves additional overhead, it might make sense to call out that Update should be called instead unless you specifically arrange/need to call Batch in a concurrent fashion.

@tv42
Copy link
Contributor Author

tv42 commented Feb 12, 2015

@davecgh Thank you.

The problem with actually showing concurrency in the README is that it just gets too verbose. That would have to be a separate example.

Batch is definitely aimed at the case where the code is e.g. serving incoming network requests with multiple goroutines; anything else cannot fit into the same "returns error" calling pattern, and then you might as well use Update or manual transactions.

663b133 adds "concurrent" in key spots and explicitly says "Batch is only useful when there are multiple goroutines calling it" -- how's that?

@davecgh
Copy link
Contributor

davecgh commented Feb 12, 2015

@tv42 Looks good to me with those additions.

Agreed about the example directly in the README. I was thinking more along the lines of referencing an example that is done in in db_test.go. For example, func ExampleDB_Batch() similar to the existing Update example.

That also has the benefit of being a testable example to ensure the example code actually remains current.

@tv42
Copy link
Contributor Author

tv42 commented Feb 12, 2015

@davecgh I still have trouble coming up with a non-silly example that's brief enough. Usually concurrency is driven by communication with the outside world, but that's not something you can encapsulate in ExampleDB_Batch easily.

@tv42
Copy link
Contributor Author

tv42 commented Feb 12, 2015

@davecgh b8ad031 has an example for Batch. I couldn't come up with anything simpler that looked like actual use.

@j
Copy link

j commented Feb 16, 2015

Can't wait. Example looks clear enough.

@benbjohnson
Copy link
Member

@tv42 Sorry for the delay in re-reviewing this. I'm going to block out some time tonight to go through this PR again. Thanks again for all the work on it. It'll be nice to have batching in Bolt core.

@tv42
Copy link
Contributor Author

tv42 commented Feb 17, 2015

Rebased and squashed with all of the above nits handled (as far as I can tell) as commit 1578db9. If there's any more feedback, please let me know.


if failIdx >= 0 {
// take the failing transaction out of the batch. it's
// safe to shorten b.calls here because b.batch no longer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b.batch -> db.batch

@benbjohnson
Copy link
Member

@tv42 I like the changes a lot. I think it flows really well now and it's easy to read. I like the use of sync.Once too. I made a few minor comments. The only other things I'd like to see are:

  1. Validate the inserted data at the end of the benchmarks to make sure that they are correct. We can call b.StopTimer() to exclude that time to validate.
  2. Add a testing/quick test to try a lot of different settings and validate that they all work as expected. I can try to add this in the next couple of days (or feel free to add it yourself). It'd be nice to run go test -quickchecks=1000000 and just let it run for a while to make sure there's not any unexpected behavior.

@tv42
Copy link
Contributor Author

tv42 commented Feb 18, 2015

I just realized the benchmarks are racy, they share hashing across goroutines. Fixing..

@tv42
Copy link
Contributor Author

tv42 commented Feb 18, 2015

Alright fixes for the above are in commits leading to e927eff and I rebased & squashed on top of the new master as 15389f9.

DB.Batch makes it easy to make lots of small transactions with
significantly better performance. Batch combines multiple concurrent
Update calls into a single disk transaction, managing errors smartly.
@tv42
Copy link
Contributor Author

tv42 commented Feb 18, 2015

Somewhere along the line I screwed up the commit message, fixed in adbb1a1

@tv42
Copy link
Contributor Author

tv42 commented Mar 18, 2015

Ping.

@benbjohnson
Copy link
Member

@tv42 Sorry for the delay on this. I keep wanting to add some testing/quick tests but I haven't had the time. Just to confirm, is this PR ready for merge?

@tv42
Copy link
Contributor Author

tv42 commented Mar 18, 2015

@benbjohnson Apart from deserving more testing coverage -- which is just about always true -- I feel like I have nothing more to add, and the comments from others have stopped too.

benbjohnson added a commit that referenced this pull request Mar 18, 2015
Add transaction batching
@benbjohnson benbjohnson merged commit 4d30731 into boltdb:master Mar 18, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants