-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement state parts in new format #8984
Comments
Can we also mention how this is different from what we have now?
What does 'optimistic' here mean? is there a case where sames of the part can be missing? |
Done. I need to add more pictures at some point. One more point: to release state sync sooner, we may stick with slight modification of current format. It is harder to generate on sender side and node has to send more data, but it is much much easier to validate. |
Huge +1, the proposal looks great. Storing KV pairs introduces a cost of explicitly storing the keys, but that cost should be smaller than the Leaf nodes. |
After looking at existing implementation, I think that "right boundary path" should include all left siblings as well, because:
|
First step towards #8984. Here I want to guarantee that state part boundaries always correspond to some key-value pair except two corner cases: * part_id = 0 -> trie key is empty which is lower than all keys * part_id = num_parts -> trie key is [16] which is larger than all keys This guarantees that parts cover all nodes in state. It solves an inconvenience on the way to #8898. It is useful to assume that boundaries are keys, because it allows to restore all keys in part by making trivial range query to flat storage. Otherwise you need a hack to convert one last nibble to byte. This is also necessary if we switch to AVL or other tree some day - AVL should not know about trie nodes, and interface should be defined in terms of state key-value pairs. Some auxiliary work: * more documentation for state_parts * removing `visit_nodes_for_size_range_old` as it was needed for backwards compatibility for version deprecated long ago @nikurt note that after this, existing state parts become incompatible with newly generated ones. ## Testing Testing is a pain because current testset is not well organised. I'm adding two tests specifically for new behaviour: * `boundary_is_state_key` - checks that state boundary is a key for sampling small trie. Doesn't pass without this change. * `single_path_trie` - small sanity check that keys are evenly distributed among state parts. Also testing revealed that `run_test_parts_not_huge` doesn't check anything, see [Zulip thread](https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/state.20part.20unreasonable.20proof.20size/near/357047125). I refactored test in such way that we separately check proof size and whole part size. Manually checked that two parts doesn't fit in memory limit for that test.
To summarize:
However, we still may want to implement the full plan at some pojnt because of two inefficiencies:
I will close this exact issue in a week if there are no objections. |
Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access ## TODO * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) ## Testing https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay
Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay
Gennerate inner part of state part using flat storage using idea present in near#8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for near#9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay
Gennerate inner part of state part using flat storage using idea present in near#8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for near#9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay
Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay
Presentation about implemented state part format as of 23 May 2023: https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/
This is the proposal for state parts format, to work well with #8899. Originated from Zulip. Goal - new format must be cleaner, contain only necessary data and support flat storage.
General idea
To split state, we take a list of DFS trie traversal order and compute prefix sums of node memory usages. If total size of memory usages is
total_size
and we split state innum_parts
, we define boundary of X-th part as(total_size + num_parts - 1) / num_parts * X
. All nodes between X and (X+1)-th boundaries define state items in range for part #X. See also https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/302088267/State+sync+parts+DRAFT.Current format
One can prove that this set of nodes is connected. Because it includes root, node can verify that these nodes do exist in state. Node can independently verify and apply each part. But reading nodes is very slow, and we want to reduce amount of data sent.
New format
2.1) path from root to the first key for part #X with all left siblings;
2.2) path from root to the first key from part #(X+1).
On this example, all marked nodes are a proof for part 2.
2.1 is necessary and sufficient to prove first item for part; 2.2 does the same for the last item.
I’ve also considered to cut path in (2.2) on the first node which does not belong to part #X (on red node). But it looks a bit harder to verify.
To prove last part, it is enough to check that path to last item doesn’t have any right siblings. Or that “16” would be the next item in its DFS order, as the code does.
Generation
Validation & applying
Trie::insert
.Note that some nodes from T1 will not be visited and it is normal. For example, state root generated from state items will not match real state root from boundaries and will not be visited.
I claim that it is enough to verify data correctness.
https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/ is a visual proof for slightly modified algo, if we include all left siblings into right boundary.
Note that we don't even have to check that state items are ordered!
Pros
Cons
I'm especially interested in suggestions on simplifying validation process, because it already looks complicated. Though I don't think our current validation process is articulated very well, so thorough and accurate definition is also an improvement.
TBD
The text was updated successfully, but these errors were encountered: