Implement state parts in new format #8984

Longarithm · 2023-04-28T12:54:06Z

Presentation about implemented state part format as of 23 May 2023: https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/

This is the proposal for state parts format, to work well with #8899. Originated from Zulip. Goal - new format must be cleaner, contain only necessary data and support flat storage.

General idea

To split state, we take a list of DFS trie traversal order and compute prefix sums of node memory usages. If total size of memory usages is total_size and we split state in num_parts, we define boundary of X-th part as (total_size + num_parts - 1) / num_parts * X. All nodes between X and (X+1)-th boundaries define state items in range for part #X. See also https://pagodaplatform.atlassian.net/wiki/spaces/EAP/pages/302088267/State+sync+parts+DRAFT.

Current format

Two paths: from root to the first node and from root to the last node for part #X;
Full set of nodes between boundaries #X and #(X+1).

One can prove that this set of nodes is connected. Because it includes root, node can verify that these nodes do exist in state. Node can independently verify and apply each part. But reading nodes is very slow, and we want to reduce amount of data sent.

New format

Data part - all KV pairs (items) in range for part #X.
Proof part - two boundaries.
2.1) path from root to the first key for part #X with all left siblings;
2.2) path from root to the first key from part #(X+1).

On this example, all marked nodes are a proof for part 2.
2.1 is necessary and sufficient to prove first item for part; 2.2 does the same for the last item.
I’ve also considered to cut path in (2.2) on the first node which does not belong to part #X (on red node). But it looks a bit harder to verify.
To prove last part, it is enough to check that path to last item doesn’t have any right siblings. Or that “16” would be the next item in its DFS order, as the code does.

Generation

Search both boundaries in Trie with two different recording modes. Return also first key for both parts.
Make range query to flat storage.

Validation & applying

Size check (TBD). There should be some boundary on state part size.
(!) Validate state items: key.len() <= 2 KB, value.len() <= 4 MB. Otherwise I'm concerned about slowdowns/increased memory usages.
Restore first item from part #(X+1) from given path. Check that path doesn't have any extra nodes. Add it to the list of state items.
Create trie T1 induced by state boundary nodes.
Create trie T2 from scratch using received state items - by starting from empty root and calling Trie::insert.
Create T3 as a union of nodes from T1 of T2. We will work with T3 now.
Check that left boundary is the path to first key from #X and doesn't have extra nodes. Descend to it together with verifying memory usages, or fail if some node is missing.
Iterate over Trie until first key from #(X+1) is found or fail if some node is missing. Count visited state items and collect visited nodes.
Compare number of visited items with number of items in state part. If there is a match, validation is passed. Add all visited nodes to DB and add state items to flat storage.

Note that some nodes from T1 will not be visited and it is normal. For example, state root generated from state items will not match real state root from boundaries and will not be visited.

I claim that it is enough to verify data correctness.
https://jamboard.google.com/d/1aSzP_ujR7u_m-fe7ZFxS6bpoFePhK6OKGMMvg2FO520/ is a visual proof for slightly modified algo, if we include all left siblings into right boundary.
Note that we don't even have to check that state items are ordered!

Pros

Number of nodes in state part is greatly reduced and allows us to do fast state sync.
Format is cleaner than now. In theory we can switch to AVL or other tree and we will have to change only proof part, data part stays the same.
It allows us to do optimistic state sync without proof part. Though please note that it requires building trie iteratively. When some part is applied, trie must be locked, as we build state root iteratively. Described process can be parallelised on parts.

Cons

Validation is harder than before, because now we are in fact we have to "guess" nodes for which only hashes were sent to us. Previously, it was enough to send raw nodes and compare visited node counters. Now, range of items can be arbitrarily spoiled, which requires extra attention.

I'm especially interested in suggestions on simplifying validation process, because it already looks complicated. Though I don't think our current validation process is articulated very well, so thorough and accurate definition is also an improvement.

TBD

Values deduplication

The text was updated successfully, but these errors were encountered:

Longarithm · 2023-04-28T13:51:21Z

For history: it seems that our current state sync approach was tracked on #1237. Majority of code was added in #1592 which lacks documentation :P

walnut-the-cat · 2023-04-28T17:56:05Z

Can we also mention how this is different from what we have now?

Optimistic part

What does 'optimistic' here mean? is there a case where sames of the part can be missing?

Longarithm · 2023-04-28T18:09:56Z

Done. I need to add more pictures at some point.
Changed "optimistic" to "data part". It is optimistic in the sense that if node trusts its peers, they can only send data to it (optimistically) without sending proofs, which should speed the process up even more.

One more point: to release state sync sooner, we may stick with slight modification of current format. It is harder to generate on sender side and node has to send more data, but it is much much easier to validate.

nikurt · 2023-04-28T18:27:25Z

Huge +1, the proposal looks great.
Maximizing the amount of useful data in state parts will make the state sync process more efficient for both sender's and receivers.

Storing KV pairs introduces a cost of explicitly storing the keys, but that cost should be smaller than the Leaf nodes.

Longarithm · 2023-05-01T18:58:30Z

After looking at existing implementation, I think that "right boundary path" should include all left siblings as well, because:

We can reuse the same TrieRecordingStorage code
We can verify boundaries independently and get proven key boundaries sooner
To generate state part, you need to iterate over all left siblings anyway, and they shouldn't consume too much space. Removing them from state part is an extra effort.

@nikurt

First step towards #8984. Here I want to guarantee that state part boundaries always correspond to some key-value pair except two corner cases: * part_id = 0 -> trie key is empty which is lower than all keys * part_id = num_parts -> trie key is [16] which is larger than all keys This guarantees that parts cover all nodes in state. It solves an inconvenience on the way to #8898. It is useful to assume that boundaries are keys, because it allows to restore all keys in part by making trivial range query to flat storage. Otherwise you need a hack to convert one last nibble to byte. This is also necessary if we switch to AVL or other tree some day - AVL should not know about trie nodes, and interface should be defined in terms of state key-value pairs. Some auxiliary work: * more documentation for state_parts * removing `visit_nodes_for_size_range_old` as it was needed for backwards compatibility for version deprecated long ago @nikurt note that after this, existing state parts become incompatible with newly generated ones. ## Testing Testing is a pain because current testset is not well organised. I'm adding two tests specifically for new behaviour: * `boundary_is_state_key` - checks that state boundary is a key for sampling small trie. Doesn't pass without this change. * `single_path_trie` - small sanity check that keys are evenly distributed among state parts. Also testing revealed that `run_test_parts_not_huge` doesn't check anything, see [Zulip thread](https://near.zulipchat.com/#narrow/stream/308695-pagoda.2Fprivate/topic/state.20part.20unreasonable.20proof.20size/near/357047125). I refactored test in such way that we separately check proof size and whole part size. Manually checked that two parts doesn't fit in memory limit for that test.

Longarithm · 2023-05-23T16:50:13Z

To summarize:

I refactored state parts so that they always begin and end on some state item;
we considered compressing state parts further but don't go for it now as it is not crucial.

However, we still may want to implement the full plan at some pojnt because of two inefficiencies:

I will close this exact issue in a week if there are no objections.

Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access ## TODO * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) ## Testing https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay

Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay

Gennerate inner part of state part using flat storage using idea present in near#8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for near#9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay

Gennerate inner part of state part using flat storage using idea present in #8984. In short, if flat storage head corresponds to the state root for which we sync state, it is enough to read only boundary nodes, and inner trie part can be reconstructed using range of KV pairs from state. The main logic for that is contained in `Trie::get_trie_nodes_for_part_with_flat_storage`. It requires couple of minor changes: * now we allow creating "view" `Trie`s with flat storage as well. As before, we want to avoid creating non-view `Tries` because `TrieCache` accesses may be blocking for chunk processing * `get_head_hash` and `shard_uid` methods for `FlatStorage` allowing to make correct range query to flat storage * `FlatStateValue` moved to `primitives` to allow more general access * prometheus metrics * integration test checking that flat storage is used during normal block processing on client (or wait for #9090) https://nayduck.near.org/#/run/3023 Big sanity test `get_trie_nodes_for_part_with_flat_storage` covering all scenarios I could think of: * results with/without flat storage must match * result with incorrect flat storage must be an error * result with flat storage and missing intermediate node should be still okay

Longarithm added T-node Team: issues relevant to the node experience team T-core Team: issues relevant to the core team C-tracking-issue Category: a tracking issue labels Apr 28, 2023

Longarithm self-assigned this Apr 28, 2023

Longarithm mentioned this issue Apr 28, 2023

Initiative: Integrate Flat storage with State sync #8899

Closed

4 tasks

Longarithm mentioned this issue May 4, 2023

feat: state boundaries on KV pairs #8995

Merged

This was referenced May 23, 2023

State part proof size is too big #9098

Open

feat: construct state sync parts using flat storage #8927

Merged

Longarithm closed this as completed May 29, 2023

gmilescu added the Node Node team label Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement state parts in new format #8984

Implement state parts in new format #8984

Longarithm commented Apr 28, 2023 •

edited

Loading

Longarithm commented Apr 28, 2023

walnut-the-cat commented Apr 28, 2023

Longarithm commented Apr 28, 2023 •

edited

Loading

nikurt commented Apr 28, 2023

Longarithm commented May 1, 2023

Longarithm commented May 23, 2023

Implement state parts in new format #8984

Implement state parts in new format #8984

Comments

Longarithm commented Apr 28, 2023 • edited Loading

General idea

Current format

New format

Generation

Validation & applying

Pros

Cons

TBD

Longarithm commented Apr 28, 2023

walnut-the-cat commented Apr 28, 2023

Longarithm commented Apr 28, 2023 • edited Loading

nikurt commented Apr 28, 2023

Longarithm commented May 1, 2023

Longarithm commented May 23, 2023

Longarithm commented Apr 28, 2023 •

edited

Loading

Longarithm commented Apr 28, 2023 •

edited

Loading