Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: send ReplicaState in streaming snapshots #10931

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Nov 22, 2016

This removes the need to read ReplicaState from the batch when applying
a snapshot, allowing the use of a write-only RocksDB batch. The
write-only RocksDB batch does not index keys on insertion/deletion which
is only necessary if those keys are read.

Fixes #10783


This change is Reviewable

@petermattis
Copy link
Collaborator Author

Needs more commenting and some additional performance testing. In my informal testing this speeds up applySnapshot by about 50%.

@bdarnell
Copy link
Contributor

Nice!


Reviewed 8 of 8 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/raft.proto, line 132 at r1 (raw file):

  // The replica state at the time the snapshot was generated.
  optional storagebase.ReplicaState state = 5;

ReplicaState contains a RangeDescriptor, which we're already sending in the header. Should we move this into the header and remove the range descriptor field there?


pkg/storage/replica_raftstorage.go, line 639 at r1 (raw file):

		batch = r.store.Engine().NewWriteOnlyBatch()
	} else {
		// TODO(peter): This can only happen if the sender is running code before

The next beta is going to require a stop-the-world upgrade anyway, so if this goes in in time I don't think we need this backwards-compatibility mode at all.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Added a benchmark and a basic test of write-only batches. I think this is good to go now. PTAL.


Review status: 3 of 13 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/raft.proto, line 132 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote… > `ReplicaState` contains a `RangeDescriptor`, which we're already sending in the header. Should we move this into the header and remove the range descriptor field there?
`Header.RangeDescriptor` is the "updated" range descriptor after we've added the new replica. This `state` field is the current state of the replica on the source at the time of the snapshot which is different. That's a subtle difference which I'd prefer not to touch in this PR. But I can move this field to the `Header` which makes

Comments from Reviewable

@petermattis petermattis changed the title [WIP] storage: send ReplicaState in streaming snapshots storage: send ReplicaState in streaming snapshots Nov 22, 2016
@petermattis
Copy link
Collaborator Author

@tamird Can you take a look at the teamcity test failures? The new benchmarks are failing but they are following the same pattern as existing benchmarks.

@tamird
Copy link
Contributor

tamird commented Nov 22, 2016

I think you might need to rebase to pick up #10923, #10927, and #10933.


Review status: 3 of 13 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 22, 2016

I haven't looked in detail, but looks like a hardcoded file name so multiple runs of the benchmark under stress are conflicting with one another.


Review status: 3 of 13 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Why are we trying to run benchmarks under stress?

@tamird
Copy link
Contributor

tamird commented Nov 22, 2016

Perhaps we shouldn't, but sadly our benchmarks frequently rot, so I
included them in the first cut of the runner.

On Tue, Nov 22, 2016 at 2:44 PM, Peter Mattis notifications@github.com
wrote:

Why are we trying to run benchmarks under stress?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#10931 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABdsPJ4EEbznUTZNvNu0PQoubvhbs2cNks5rA0YhgaJpZM4K45h-
.

@petermattis
Copy link
Collaborator Author

Ok. Fixed these benchmarks to use an in-mem RocksDB as they aren't actually writing anything to disk.

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 7 of 10 files at r2, 3 of 3 files at r3.
Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/raft.proto, line 113 at r3 (raw file):

    // field which holds the updated descriptor after the new replica
    // has been added while ReplicaState.Desc holds the descriptor
    // before the new replica has been added.

Ah, that's subtle. It also seems incorrect to me - we shouldn't leak out the post-add range descriptor until after the transaction is committed. I looked at all the places where Header.RangeDescriptor is used and it looks like we only use the start/end key and range ID; nothing ever looks at the replica set there. So it should be safe to just drop the range_descriptor field and use the one in the state instead. (we could do this in a separate PR, but I'd be inclined to group them together just to minimize the number of separate incompatible changes here)


Comments from Reviewable

@bdarnell
Copy link
Contributor

LG again, although I'm realizing that we could use some better documentation about write-only and distinct batches. Should write-only imply distinct?


Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/raft.proto, line 113 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote… > Ah, that's subtle. It also seems incorrect to me - we shouldn't leak out the post-add range descriptor until after the transaction is committed. I looked at all the places where Header.RangeDescriptor is used and it looks like we only use the start/end key and range ID; nothing ever looks at the replica set there. So it should be safe to just drop the range_descriptor field and use the one in the state instead. (we could do this in a separate PR, but I'd be inclined to group them together just to minimize the number of separate incompatible changes here)
We call `Replica.setDescWithProcessUpdate` at the end of `Replica.applySnapshot` using `Header.RangeDescriptor`. Is it correct to change this code to use `ReplicaState.Desc`? Doing so will mean the replica's RangeDescriptor won't contain itself for a brief period of time. Is that ok with the replica GC queue?

Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: 12 of 13 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/raft.proto, line 113 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote… > We call `Replica.setDescWithProcessUpdate` at the end of `Replica.applySnapshot` using `Header.RangeDescriptor`. Is it correct to change this code to use `ReplicaState.Desc`? Doing so will mean the replica's RangeDescriptor won't contain itself for a brief period of time. Is that ok with the replica GC queue?
Oh, I didn't pull on that thread far enough. It's incorrect to use Header.RangeDescriptor there and it should be changed to use the one from the state. Leaking the updated range descriptor before it is committed could have some interesting consequences (since raft on the new replica will start using that configuration immediately), although I don't see anything disastrous (yet).

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 7 of 15 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/raft.proto, line 113 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Oh, I didn't pull on that thread far enough. It's incorrect to use Header.RangeDescriptor there and it should be changed to use the one from the state. Leaking the updated range descriptor before it is committed could have some interesting consequences (since raft on the new replica will start using that configuration immediately), although I don't see anything disastrous (yet).

Ok, I've removed Header.RangeDescriptor and made sure we populate Header.State properly.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016

:lgtm:


Reviewed 3 of 8 files at r1, 5 of 10 files at r2, 1 of 3 files at r3, 9 of 9 files at r4.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

	// The Raft log entries for this snapshot.
	LogEntries [][]byte
	State      *storagebase.ReplicaState

is there any reason this is a pointer? looks like it made sense in an earlier version when the proto field was nullable.


pkg/storage/replica_raftstorage.go, line 581 at r4 (raw file):

	s := *inSnap.State
	if s.Desc.RangeID != r.RangeID {
		log.Fatalf(ctx, "unexpected range ID %d", s.Desc.RangeID)

should this include the expected one?


pkg/storage/engine/bench_test.go, line 548 at r4 (raw file):

			ts := makeTS(timeutil.Now().UnixNano(), 0)
			if err := MVCCPut(context.Background(), batch, nil, key, ts, value, nil); err != nil {
				b.Fatalf("failed put: %s", err)

merp this can just be .Fatal(err). just sayin'


pkg/storage/engine/engine.go, line 177 at r4 (raw file):

	// this engine. A write-only batch accumulates all mutations and applies them
	// atomically on a call to Commit(). Read operations return an error.
	NewWriteOnlyBatch() Batch

is it possible to have a new, narrower interface for this thing, rather than a bunch of runtime panics?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is there any reason this is a pointer? looks like it made sense in an earlier version when the proto field was nullable.

No, there is no strong reason for it to be a pointer. There is also no strong reason for it not to be a pointer.


pkg/storage/replica_raftstorage.go, line 581 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

should this include the expected one?

The expected range ID is in the context.


pkg/storage/engine/bench_test.go, line 548 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

merp this can just be .Fatal(err). just sayin'

Done.


pkg/storage/engine/engine.go, line 177 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

is it possible to have a new, narrower interface for this thing, rather than a bunch of runtime panics?

Not easily. I just explored that and it requires a whole bunch of untangling. In particular, even defining a WriteBatch interface is irritating due to the Close method currently being defined on Reader.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016

Reviewed 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

No, there is no strong reason for it to be a pointer. There is also no strong reason for it not to be a pointer.

In this case, I think there is - a pointer creates ambiguity because it is possible for it to be nil, and this field should never be nil.


pkg/storage/engine/engine.go, line 177 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not easily. I just explored that and it requires a whole bunch of untangling. In particular, even defining a WriteBatch interface is irritating due to the Close method currently being defined on Reader.

About what I expected. It'd be good to file an issue, given the number of external contributors we've been enjoying.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

In this case, I think there is - a pointer creates ambiguity because it is possible for it to be nil, and this field should never be nil.

Using a value requires copying which is cheap, but not free. It also consumes stack space in this instance. I think we need a better justification than documenting non-nilness.


pkg/storage/engine/engine.go, line 177 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

About what I expected. It'd be good to file an issue, given the number of external contributors we've been enjoying.

This is low-priority but invasive and possibly destabilizing. Not a good combination.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016

Review status: 14 of 15 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Using a value requires copying which is cheap, but not free. It also consumes stack space in this instance. I think we need a better justification than documenting non-nilness.

Isn't this exactly the opposite of the usual wisdom? Do the thing that is clearest to human readers unless benchmarks justify doing otherwise?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 14 of 15 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_raftstorage.go, line 394 at r4 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Isn't this exactly the opposite of the usual wisdom? Do the thing that is clearest to human readers unless benchmarks justify doing otherwise?

#11616 was needed because benchmarks did show that growing the stack was problematic. And the reason our stacks are so large is our pervasive use of values instead of pointers: creating values and passing them on the stack even when a pointer was readily available. This is exactly an instance of that. The caller has a pointer readily available, we should pass it.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016 via email

@petermattis
Copy link
Collaborator Author

I'm not saying that growing the stack is never problematic, but without demonstrating that it is problematic in this instance, that argument does not hold water.

Justifying every use of pointer vs value via a benchmark will ensure that we'll almost always use values.

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016 via email

@petermattis
Copy link
Collaborator Author

either we make evidence-based decisions, or we do not.

I don't think decision making is as black and white as that. We use evidence to make some decisions and then fold that evidence and those decisions into an internal model and heuristics that we use to make other decisions. We certainly don't base all of our decisions on evidence. I'm curious what evidence you would use to justify changing IncomingSnapshot.State from a pointer to a value.

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016 via email

@petermattis
Copy link
Collaborator Author

My argument is based on the assumption that the "default" is a to use a value. At least, it is my recollection that this is the best practice documented in some document authored by the Go maintainers. It's a wise default because it more strongly encodes programmer intent than does a never-nil pointer.

You're probably thinking of https://github.com/golang/go/wiki/CodeReviewComments#pass-values:

Don't pass pointers as function arguments just to save a few bytes. If a function refers to its argument x only as *x throughout, then the argument shouldn't be a pointer. Common instances of this include passing a pointer to a string (*string) or a pointer to an interface value (*io.Reader). In both cases the value itself is a fixed size and can be passed directly. This advice does not apply to large structs, or even small structs that might grow.

I'd be interested if there is other wisdom from the Go maintainers you're thinking of. Notice that last sentence. It is an important caveat.

@tamird
Copy link
Contributor

tamird commented Nov 28, 2016 via email

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 8 of 9 files at r4, 1 of 1 files at r5, 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/storage/raft.proto, line 107 at r6 (raw file):

message SnapshotRequest {
  message Header {
    reserved 1; // range_descriptor

Nit: we don't generally leave comments about what reserved tags were formerly used for.


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 11 of 15 files reviewed at latest revision, 5 unresolved discussions, some commit checks pending.


pkg/storage/raft.proto, line 107 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Nit: we don't generally leave comments about what reserved tags were formerly used for.

Removed.


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

Reviewed 1 of 1 files at r6, 4 of 4 files at r7.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_raftstorage.go, line 394 at r7 (raw file):

In any case, if you choose to leave this as-is, the contract (never nil) and the intent (saving stack space) should be clearly documented.


Comments from Reviewable

This removes the need to read ReplicaState from the batch when applying
a snapshot, allowing the use of a write-only RocksDB batch. The
write-only RocksDB batch does not index keys on insertion/deletion which
is only necessary if those keys are read.

This change requires a stop-the-world upgrade: new nodes will not be
able to received snapshots generated by old nodes.

name                            time/op
BatchApplyBatchRepr-8             200ms ± 1%
WriteOnlyBatchApplyBatchRepr-8   99.1ms ± 1%

name                            speed
BatchApplyBatchRepr-8           180MB/s ± 1%
WriteOnlyBatchApplyBatchRepr-8  363MB/s ± 1%

Fixes cockroachdb#10783
@petermattis petermattis merged commit b820060 into cockroachdb:master Nov 30, 2016
@petermattis petermattis deleted the pmattis/snapshot-replica-state branch November 30, 2016 14:39
@tamird
Copy link
Contributor

tamird commented Nov 30, 2016

Reviewed 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants