Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: adr-40: reduce multistore and make it atomic #9355

Merged
merged 27 commits into from
Oct 22, 2021
Merged
Changes from 7 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
201c73e
adr-40: use prefix store instead of multistore
robert-zaremba May 18, 2021
c432f5e
add note about prefix.Store
robert-zaremba May 18, 2021
7c4e308
Merge branch 'master' into robert/adr-40-update
robert-zaremba May 28, 2021
bbc692f
Update SC and SS setup information and historical versions sepc
robert-zaremba May 28, 2021
0612088
add note about key prefix optimization
robert-zaremba May 28, 2021
c12c5ec
rephrased the changes related to multistore
robert-zaremba Jun 3, 2021
60af1b0
Apply suggestions from code review
robert-zaremba Jun 4, 2021
28bda90
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
fedekunze Jun 7, 2021
aa1ded8
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
fedekunze Jun 7, 2021
d14572d
Merge branch 'master' into robert/adr-40-update
robert-zaremba Jun 21, 2021
096526c
Update docs/architecture/adr-040-storage-and-smt-state-commitments.md
robert-zaremba Sep 1, 2021
4351424
Merge branch 'master' into robert/adr-40-update
robert-zaremba Sep 14, 2021
ac7cde7
design update
robert-zaremba Sep 14, 2021
4ee227e
update merkle proofs
robert-zaremba Sep 16, 2021
1781fb5
Apply suggestions from code review
robert-zaremba Sep 22, 2021
eaa80de
reword huffman compression paragraph
robert-zaremba Sep 23, 2021
3f841fd
ADR-40: update on multi-store refactor and IBC proofs (#10191)
roysc Oct 1, 2021
2a7b1aa
review updates
robert-zaremba Oct 6, 2021
50cf72a
Merge remote-tracking branch 'origin/master' into robert/adr-40-update
robert-zaremba Oct 6, 2021
fbe4676
add todo for protobuf message type compression
robert-zaremba Oct 6, 2021
5dad4ef
add link to a discussion
robert-zaremba Oct 6, 2021
f9c77ae
guarantee atomic commit with IBC workaround proposal
robert-zaremba Oct 6, 2021
1fec94c
adding more links to references
robert-zaremba Oct 6, 2021
74193a9
Apply suggestions from code review
robert-zaremba Oct 21, 2021
2390332
reword the module key compression part
robert-zaremba Oct 21, 2021
aa9fd58
Merge branch 'master' into robert/adr-40-update
robert-zaremba Oct 21, 2021
6206ce1
Merge branch 'master' into robert/adr-40-update
robert-zaremba Oct 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 59 additions & 10 deletions docs/architecture/adr-040-storage-and-smt-state-commitments.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ For data access we propose 2 additional KV buckets (namespaces for the key-value
2. B2: `hash(key, value) → key`: a reverse index to get a key from an SMT path. Recall that SMT will store `(k, v)` as `(hash(k), hash(key, value))`. So, we can get an object value by composing `SMT_path → B2 → B1`.
3. we could use more buckets to optimize the app usage if needed.

Above, we propose to use a KV DB. However, for the state machine, we could use an RDBMS, which we discuss below.
We propose to use a KV database for both `SS` and `SC` - each run by its own database instance. This design allows for the separation of `SS` and `SC` into different hardware units, providing support for more complex setup scenarios and improving DB performance (essentially this will create two store shards: one for `SS` and another for `SC`). Moreover, we will be able to configure databases for `SS` and `SC` separately.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this behavior be configurable? i.e. it could either be one database or two databases? It seems like two db's defeats the goal of atomic commits. In a configurable scenario, someone could simply pass in two prefix stores of the same db to use one atomic db.

Copy link
Contributor

@roysc roysc Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a strong case for have two separate databases here. You state improved performance, but how does it do so and is the infrastructure complexity even worth it if that's the case. This smells like pre-optimization to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Is there any known performance problem you're trying to address?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improved performance would require multi hardware setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait isn't using a separate database, one merkelized, one not, for state been shown to be a huge efficiency improvement? Like, I thought it was the main change clients like turbogeth do? Screenshot 2021-09-17 at 9 56 27 AM (Granted turbogeth's entire db

Basically all your state reads during execution don't have to do trie traversals along the merkle trie, which is guaranteed to not be cache efficient due to hashes being on every tree node. Whereas you can instead do your writes in a flat-map, and just at the end update the merkle paths in the merkelized store. You can re-use the same leaf storage for no additional data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ValarDragon - the question is about physical DB. We are still going to use 2 different stores: SS and SC (which is the main source of optimization).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronc if we will split DB into 2 hardware units, then we will have another performance boost. However, you are right - it will not make the atomic updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the call we are deciding to design the software to work with 2 DB instances as well as a single one.

robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

### Requirements

Expand All @@ -64,6 +64,7 @@ State Commitment requirements:

+ fast updates
+ tree path should be short
+ query historical commitment proofs using ICS-23 standard
+ pruning (garbage collection)

### LazyLedger SMT for State Commitment
Expand All @@ -74,27 +75,27 @@ A Sparse Merkle tree is based on the idea of a complete Merkle tree of an intrac

Below, with simple _snapshot_ we refer to a database snapshot mechanism, not to a _ABCI snapshot sync_. The latter will be referred as _snapshot sync_ (which will directly use DB snapshot as described below).

Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big), usually a snapshot mechanism is based on a _copy on write_ and it allows to efficiently deliver DB state at a certain stage.
Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). It will the supported DB engines to ones which efficiently implement snapshots. In a final section we will discuss evaluated DBs.
Database snapshot is a view of DB state at a certain time or transaction. It's not a full copy of a database (it would be too big). Usually a snapshot mechanism is based on a _copy on write_ and it allows DB state to be efficiently delivered at a certain stage.
Some DB engines support snapshotting. Hence, we propose to reuse that functionality for the state sync and versioning (described below). We limit the supported DB engines to ones which efficiently implement snapshots. In a final section we discuss the evaluated DBs.

One of the Stargate core features is a _snapshot sync_ delivered in the `/snapshot` package. It provides a way to trustlessly sync a blockchain without repeating all transactions from the genesis. This feature is implemented in SDK and requires storage support. Currently IAVL is the only supported backend. It works by streaming to a client a snapshot of a `SS` at a certain version together with a header chain.

A new `SS` snapshot will be created in every `EndBlocker` and identified by a block height. The `rootmulti.Store` keeps track of the available snapshots to offer `SS` at a certain version. The `rootmulti.Store` implements the `CommitMultiStore` interface, which encapsulates a `Committer` interface. `Committer` has a `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `rootStore.Commit` function creates a new snapshot and increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Committer` interface.
A new database snapshot will be created in every `EndBlocker` and identified by a block height. The `rootmulti.Store` keeps track of the available snapshots to offer `SS` at a certain version. The `rootmulti.Store` implements the `CommitMultiStore` interface, which encapsulates a `Committer` interface. `Committer` has a `Commit`, `SetPruning`, `GetPruning` functions which will be used for creating and removing snapshots. The `rootStore.Commit` function creates a new snapshot and increments the version on each call, and checks if it needs to remove old versions. We will need to update the SMT interface to implement the `Committer` interface.
NOTE: `Commit` must be called exactly once per block. Otherwise we risk going out of sync for the version number and block height.
NOTE: For the SDK storage, we may consider splitting that interface into `Committer` and `PruningCommitter` - only the multiroot should implement `PruningCommitter` (cache and prefix store don't need pruning).

Number of historical versions for `abci.Query` and state sync snapshots is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all past versions.
Number of historical versions for `abci.RequestQuery` and state sync snapshots is part of a node configuration, not a chain configuration (configuration implied by the blockchain consensus). A configuration should allow to specify number of past blocks and number of past blocks modulo some number (eg: 100 past blocks and one snapshot every 100 blocks for past 2000 blocks). Archival nodes can keep all past versions.

Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it creates new nodes on the update path, without removing the old one. Since we are snapshoting each block, we need to update that mechanism to immediately remove orphaned nodes from the storage. This is a safe operation - snapshots will keep track of the records which should be available for past versions.
Pruning old snapshots is effectively done by a database. Whenever we update a record in `SC`, SMT won't update nodes - instead it creates new nodes on the update path, without removing the old one. Since we are snapshotting each block, we need to change that mechanism to immediately remove orphaned nodes from the database. This is a safe operation - snapshots will keep track of the records and make it available when accessing past versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance implication of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using underlying, optimized DB mechanism for managing versions and not pruning manually. We don't have a good solution for pruning, so we don't have a good baseline. The only baseline we have is the current IAVL implementation. We were benchmarking creating snapshots. But we can also benchmark snapshot removal.

Another consequences:

  • version management is effectively shifted down to the DB
  • we take an advantage of the huge community using the underlying DB and more contributors looking into it.
  • less code and problems to maintain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think we should further discuss it then I will add a TODO in the ADR - this is not related to this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: both RocksDB an Baggar have more DB experts and ecosystem to iron the snpashot management than we have.


To manage the active snapshots we will either us a DB _max number of snapshots_ option (if available), or will remove snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height.
To manage the active snapshots we will either use a DB _max number of snapshots_ option (if available), or we will remove DB snapshots in the `EndBlocker`. The latter option can be done efficiently by identifying snapshots with block height and calling a store function to remove past versions.

#### Accessing old state versions

One of the functional requirements is to access old state. This is done through `abci.Query` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.Query` is configurable. Accessing an old state is done by using available snapshots.
`abci.Query` doesn't need old state of `SC`. So, for efficiency, we should keep `SC` and `SS` in different databases (however using the same DB engine).
One of the functional requirements is to access old state. This is done through `abci.RequestQuery` structure. The version is specified by a block height (so we query for an object by a key `K` at block height `H`). The number of old versions supported for `abci.RequestQuery` is configurable. Accessing an old state is done by using available snapshots.
`abci.RequestQuery` doesn't need old state of `SC` unless the `prove=true` parameter is set. The SMT merkle proof must be included in the `abci.ResponseQuery` only if both `SC` and `SS` have a snapshot for requested version.

Moreover, SDK could provide a way to directly access the state. However, a state machine shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution.
Moreover, the SDK could provide a way to directly access a historical state. However, a state machine shouldn't do that - since the number of snapshots is configurable, it would lead to nondeterministic execution.

We positively [validated](https://github.com/cosmos/cosmos-sdk/discussions/8297) a versioning and snapshot mechanism for querying old state with regards to the database we evaluated.

Expand All @@ -110,6 +111,52 @@ We need to be able to process transactions and roll-back state updates if a tran

We identified use-cases, where modules will need to save an object commitment without storing an object itself. Sometimes clients are receiving complex objects, and they have no way to prove a correctness of that object without knowing the storage layout. For those use cases it would be easier to commit to the object without storing it directly.


### Remove MultiStore

IAVL based store adds an additional layer in the SDK store construction - the `MultiStore` structure. The multistore exists to support the modularity of the Cosmos SDK - each module is using its own instance of IAVL, but in the current implementation, all instances share the same database.
The latter indicates, however, that the implementation doesn't provide true modularity. Instead it causes problems related to race condition and sync problems (eg: [\#6370](https://github.com/cosmos/cosmos-sdk/issues/6370)).
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

We propose to remove the multistore concept from the SDK, and to use a single instance of `SC` and `SS` in a `rootStore` object. To avoid confusion, we should rename the `MultiStore` interface to `RootStore` and make sure that the `rootStore` object implements `RootStore`.
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

Moreover, to improve usability, we should extend the `KVStore` interface with _prefix store_. This will allow module developers to bind a store to a namespace for module sub-components:
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved

```
fedekunze marked this conversation as resolved.
Show resolved Hide resolved
type KVStore interface {
... current KVStore

WithPrefix(prefix string) KVStore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be on the root KVStore interface itself? Can't this just be a functionality of RootStore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, mentioned this above #9355 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense for this API to exist on the RootStore as that is how I imagine module's will request access to a prefixed store. That abstraction makes sense to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this WithPrefix method just takes a string and compresses it, it shouldn't be used by modules for prefix stores

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then again maybe that's something we do want modules to have to manage key prefixes 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit on that? I still don't see the advantage of this over a prefix store which externally wraps a KV store. What optimizations would be useful?

I don't think it makes sense to try to do prefix compression within substores, especially not dynamically. That needs to be done on the SS bucket, so each substore would have to refer back to the root store to register the prefix. Also, if this exists on the KVStore interface, it could be used recursively on any derived store.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SDK, modules usually have key families - objects with a common prefix. For example, in x/bank we have: supply, DenomMetadata, DenomAddress, Balances. In one of the architecture updates to the module we want that the module will receive it's own store. In x/bank, today we have:

ctx.KVStore(storeKey)  // where storeKey = "bank"

This doesn't follow OCAPS, because in x/bank I coud do ctx.KVStore("gov"). Instead we want:

bankStore := ctx.KVStore(goCtx)
// or
bankStore := accessor.KVStore(goCtx)

Then I can write:

balanceStore := bankStore.WithPrefix("BalancesPrefix")

and us it in some DAL responsible for balance management.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If WithPrefix is confusing then we can rename it to Substore or Subnamespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what advantage does this have over:
balanceStore := PrefixStore(bankStore, "BalancesPrefix")

What kind of optimization would warrant adding this to the KVStore interface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roysc , you are right, PrefixStore function is better than extending the interface.

}
```

The `WithPrefix` method will create a proxy object for the parent object and will prepend `prefix` to a key parameter of all key-value operations (similar to the `store/prefix.Store` implementation).
The following invariant should be applied

```
for each OP in [Get Has, Set, ...]
store.WithPrefix(prefix).OP(key) == store.OP(prefix + key)
```

### Optimization: compress module keys

We can consider a compression of prefix keys using Huffman Coding. It will require a knowledge of used prefixes (module store keys) a priori. And for best results it will need frequency information for each prefix (how often objects are stored in the store under the same prefix key). With Huffman Coding, the above invariant should have the following form:
fedekunze marked this conversation as resolved.
Show resolved Hide resolved

```
for each OP in [Get Has, Set, ...]
store.WithPrefix(prefix).OP(key) == store.OP(store.Code(prefix) + key)
robert-zaremba marked this conversation as resolved.
Show resolved Hide resolved
```

Where `store.Code(prefix)` is a Huffman Code of `prefix` in the given `store`.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronc originally suggested to map a module "verbose" key to a 2 byte sequence prefix for key prefix compression, eg:

bank -> 0x0001
staking -> 0x0002
....

In both mechanisms we will need to assure that the created map will be stable (we will always construct the same mapping pairs).

NOTE: Both Huffman Codes and module key map compress the keys only for the SS. It's not needed for SC because in SC we always operate on a hash of a key.

Copy link
Collaborator Author

@robert-zaremba robert-zaremba May 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Aaron idea is easier to manage. The limitation is that we bound the number of modules to 65536 (2^16) - which is big enough to not worry about it now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this prefix model work effectively, I think there should be some stateful prefix map - possibly stored under some special restricted schema prefix (probably 0) on the RootStore. We can use varint encoding and not require a restriction to 2 bytes, but also 2 bytes is probably fine.

In this restricted schema key-space, we would have a mapping from key name -> compressed prefix, which is itself prefixed to allow for other schema use cases.

How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is it to have this optimization at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String prefixes could potentially be rather long (maybe full proto msg names eventually as discussed in #9156). I think it's relatively important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enable variable prefix length, then huffman coding is better, because it takes frequency into account and it is also protects against common prefixes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: we decided to use varint.

To avoid conflicts in the address space, we need to assure that in the set of prefix codes, there are not two elements where one is a prefix of another:

```
for each k1, k2 \in {store.Code(p): p \in StoreModuleKeys}
assert( !k1.hasPrefix(k2) )
```

NOTE: We need to assure that the codes won't change. Huffman Coding depends on the keys and its frequency - so we would need to generate the codes and then fix the mapping in a static variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this couldn't change when stores are added/renamed, since it will only be used in the SS implementation. The current mapping could just be stored in a separate namespace in the DB.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense.


## Consequences

### Backwards Compatibility
Expand All @@ -123,6 +170,8 @@ We change the storage layout of the state machine, a storage hard fork and netwo
+ Decoupling state from state commitment introduce better engineering opportunities for further optimizations and better storage patterns.
+ Performance improvements.
+ Joining SMT based camp which has wider and proven adoption than IAVL. Example projects which decided on SMT: Ethereum2, Diem (Libra), Trillan, Tezos, LazyLedger.
+ Multistore removal fixes a longstanding issue with the current MultiStore design.


### Negative

Expand Down