Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement state parts in new format #8984

Closed
Tracked by #8899
Longarithm opened this issue Apr 28, 2023 · 6 comments
Closed
Tracked by #8899

Implement state parts in new format #8984

Longarithm opened this issue Apr 28, 2023 · 6 comments
Assignees
Labels
C-tracking-issue Category: a tracking issue Node Node team T-core Team: issues relevant to the core team T-node Team: issues relevant to the node experience team

Comments

@Longarithm
Copy link
Member

Longarithm commented Apr 28, 2023

Presentation about implemented state part format as of 23 May 2023: https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/


This is the proposal for state parts format, to work well with #8899. Originated from Zulip. Goal - new format must be cleaner, contain only necessary data and support flat storage.

General idea

To split state, we take a list of DFS trie traversal order and compute prefix sums of node memory usages. If total size of memory usages is total_size and we split state in num_parts, we define boundary of X-th part as (total_size + num_parts - 1) / num_parts * X. All nodes between X and (X+1)-th boundaries define state items in range for part #X. See also https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/302088267/State+sync+parts+DRAFT.

Current format

  1. Two paths: from root to the first node and from root to the last node for part #X;
  2. Full set of nodes between boundaries #X and #(X+1).

One can prove that this set of nodes is connected. Because it includes root, node can verify that these nodes do exist in state. Node can independently verify and apply each part. But reading nodes is very slow, and we want to reduce amount of data sent.

New format

  1. Data part - all KV pairs (items) in range for part #X.
  2. Proof part - two boundaries.
    2.1) path from root to the first key for part #X with all left siblings;
    2.2) path from root to the first key from part #(X+1).

On this example, all marked nodes are a proof for part 2.
2.1 is necessary and sufficient to prove first item for part; 2.2 does the same for the last item.
I’ve also considered to cut path in (2.2) on the first node which does not belong to part #X (on red node). But it looks a bit harder to verify.
To prove last part, it is enough to check that path to last item doesn’t have any right siblings. Or that “16” would be the next item in its DFS order, as the code does.

Generation

  1. Search both boundaries in Trie with two different recording modes. Return also first key for both parts.
  2. Make range query to flat storage.

Validation & applying

  1. Size check (TBD). There should be some boundary on state part size.
  2. (!) Validate state items: key.len() <= 2 KB, value.len() <= 4 MB. Otherwise I'm concerned about slowdowns/increased memory usages.
  3. Restore first item from part #(X+1) from given path. Check that path doesn't have any extra nodes. Add it to the list of state items.
  4. Create trie T1 induced by state boundary nodes.
  5. Create trie T2 from scratch using received state items - by starting from empty root and calling Trie::insert.
  6. Create T3 as a union of nodes from T1 of T2. We will work with T3 now.
  7. Check that left boundary is the path to first key from #X and doesn't have extra nodes. Descend to it together with verifying memory usages, or fail if some node is missing.
  8. Iterate over Trie until first key from #(X+1) is found or fail if some node is missing. Count visited state items and collect visited nodes.
  9. Compare number of visited items with number of items in state part. If there is a match, validation is passed. Add all visited nodes to DB and add state items to flat storage.

Note that some nodes from T1 will not be visited and it is normal. For example, state root generated from state items will not match real state root from boundaries and will not be visited.

I claim that it is enough to verify data correctness.
https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/ is a visual proof for slightly modified algo, if we include all left siblings into right boundary.
Note that we don't even have to check that state items are ordered!

Pros

  • Number of nodes in state part is greatly reduced and allows us to do fast state sync.
  • Format is cleaner than now. In theory we can switch to AVL or other tree and we will have to change only proof part, data part stays the same.
  • It allows us to do optimistic state sync without proof part. Though please note that it requires building trie iteratively. When some part is applied, trie must be locked, as we build state root iteratively. Described process can be parallelised on parts.

Cons

  • Validation is harder than before, because now we are in fact we have to "guess" nodes for which only hashes were sent to us. Previously, it was enough to send raw nodes and compare visited node counters. Now, range of items can be arbitrarily spoiled, which requires extra attention.

I'm especially interested in suggestions on simplifying validation process, because it already looks complicated. Though I don't think our current validation process is articulated very well, so thorough and accurate definition is also an improvement.

TBD

@Longarithm Longarithm added T-node Team: issues relevant to the node experience team T-core Team: issues relevant to the core team C-tracking-issue Category: a tracking issue labels Apr 28, 2023
@Longarithm Longarithm self-assigned this Apr 28, 2023
@Longarithm
Copy link
Member Author

For history: it seems that our current state sync approach was tracked on #1237. Majority of code was added in #1592 which lacks documentation :P

@walnut-the-cat
Copy link
Contributor

Can we also mention how this is different from what we have now?

Optimistic part

What does 'optimistic' here mean? is there a case where sames of the part can be missing?

@Longarithm
Copy link
Member Author

Longarithm commented Apr 28, 2023

Done. I need to add more pictures at some point.
Changed "optimistic" to "data part". It is optimistic in the sense that if node trusts its peers, they can only send data to it (optimistically) without sending proofs, which should speed the process up even more.

One more point: to release state sync sooner, we may stick with slight modification of current format. It is harder to generate on sender side and node has to send more data, but it is much much easier to validate.

@nikurt
Copy link
Contributor

nikurt commented Apr 28, 2023

Huge +1, the proposal looks great.
Maximizing the amount of useful data in state parts will make the state sync process more efficient for both sender's and receivers.

Storing KV pairs introduces a cost of explicitly storing the keys, but that cost should be smaller than the Leaf nodes.

@Longarithm
Copy link
Member Author

After looking at existing implementation, I think that "right boundary path" should include all left siblings as well, because:

  1. We can reuse the same TrieRecordingStorage code
  2. We can verify boundaries independently and get proven key boundaries sooner
  3. To generate state part, you need to iterate over all left siblings anyway, and they shouldn't consume too much space. Removing them from state part is an extra effort.

near-bulldozer bot pushed a commit that referenced this issue May 15, 2023
First step towards #8984. Here I want to guarantee that state part boundaries always correspond to some key-value pair except two corner cases:
* part_id = 0 -> trie key is empty which is lower than all keys
* part_id = num_parts -> trie key is [16] which is larger than all keys
This guarantees that parts cover all nodes in state.

It solves an inconvenience on the way to #8898. It is useful to assume that boundaries are keys, because it allows to restore all keys in part by making trivial range query to flat storage. Otherwise you need a hack to convert one last nibble to byte. This is also necessary if we switch to AVL or other tree some day - AVL should not know about trie nodes, and interface should be defined in terms of state key-value pairs.

Some auxiliary work:
* more documentation for state_parts
* removing `visit_nodes_for_size_range_old` as it was needed for backwards compatibility for version deprecated long ago

@nikurt note that after this, existing state parts become incompatible with newly generated ones.

## Testing

Testing is a pain because current testset is not well organised. I'm adding two tests specifically for new behaviour:
* `boundary_is_state_key` - checks that state boundary is a key for sampling small trie. Doesn't pass without this change.
* `single_path_trie` - small sanity check that keys are evenly distributed among state parts.

Also testing revealed that `run_test_parts_not_huge` doesn't check anything, see [Zulip thread](https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/state.20part.20unreasonable.20proof.20size/near/357047125). I refactored test in such way that we separately check proof size and whole part size. Manually checked that two parts doesn't fit in memory limit for that test.
@Longarithm
Copy link
Member Author

To summarize:

  • I refactored state parts so that they always begin and end on some state item;
  • we considered compressing state parts further but don't go for it now as it is not crucial.

However, we still may want to implement the full plan at some pojnt because of two inefficiencies:

I will close this exact issue in a week if there are no objections.

near-bulldozer bot pushed a commit that referenced this issue May 25, 2023
Gennerate inner part of state part using flat storage using idea present in #8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access


## TODO
* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for #9090)
 
## Testing

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
nikurt pushed a commit that referenced this issue May 31, 2023
Gennerate inner part of state part using flat storage using idea present in #8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access

* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for #9090)

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
nikurt pushed a commit to nikurt/nearcore that referenced this issue Jun 8, 2023
Gennerate inner part of state part using flat storage using idea present in near#8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access

* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for near#9090)

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
nikurt pushed a commit to nikurt/nearcore that referenced this issue Jun 8, 2023
Gennerate inner part of state part using flat storage using idea present in near#8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access

* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for near#9090)

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
nikurt pushed a commit that referenced this issue Jun 13, 2023
Gennerate inner part of state part using flat storage using idea present in #8984.

In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`.

It requires couple of minor changes:
* now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing
* `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage
* `FlatStateValue` moved to `primitives` to allow more general access

* prometheus metrics
* integration test checking that flat storage is used during normal block processing on client (or wait for #9090)

https://nayduck.near.org/#/run/3023

Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of:
* results with/without flat storage must match
* result with incorrect flat storage must be an error
* result with flat storage and missing intermediate node should be still okay
@gmilescu gmilescu added the Node Node team label Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: a tracking issue Node Node team T-core Team: issues relevant to the core team T-node Team: issues relevant to the node experience team
Projects
None yet
Development

No branches or pull requests

4 participants