fix: record nodes for writes in memtrie #10841

Longarithm · 2024-03-20T12:58:23Z

UPD: New motivation

While all said below is true, the main motivation is to account for nodes impacted at squash_nodes which can be touched without much pain only at Trie::update.

Old motivation

The root cause of errors "node not found" on reads from recorded storage is that nodes touched only at Trie::update - KV updates in the end of chunk applictaion - were never recorded on memtrie.
It happens automatically on disk tries. For memtries we retroactively account for trie node accesses, but this wasn't done in similar way for memtrie updates.
While infrastructure indeed needs to be improved so these issues won't surprise us anymore, the easiest way is to account for all accesses in the same way as move_node_to_mutable does - on memtrie paths like ensure_updated which move node out of storage to memory, which means that node has to be recorded, so trie can be validated. Unfortunately refcount changes are not enough to cover this, so I record trie accesses separately. Also, memtrie doesn't have access to values, so I have to retrieve values and record them in the Trie::update.

Testing

New logic is covered by trie_recording tests. I also discovered an issue there - destructively_delete_in_memory_state_from_disk was removing all keys because key decoding was incorrect. Now these tests fully cover trie access scenario: first we have get&get_ref accesses, and in the very end we have update call for some KV pairs. I confirmed that for incorrect impl some tests fail because of the same "node not found" error, and also that new sanity check on Trie::update works.
test_memtrie_discrepancy is renamed and works now

codecov · 2024-03-20T14:01:55Z

Codecov Report

Attention: Patch coverage is 97.84946% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 71.42%. Comparing base (b8048da) to head (c01638c).
Report is 4 commits behind head on master.

Files	Patch %	Lines
core/store/src/trie/mod.rs	88.88%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #10841    +/-   ##
========================================
  Coverage   71.41%   71.42%            
========================================
  Files         763      764     +1     
  Lines      153717   153956   +239     
  Branches   153717   153956   +239     
========================================
+ Hits       109781   109965   +184     
- Misses      38938    38991    +53     
- Partials     4998     5000     +2

Flag	Coverage Δ
backward-compatibility	`0.24% <0.00%> (-0.01%)`	⬇️
db-migration	`0.24% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.41% <0.00%> (-0.01%)`	⬇️
integration-tests	`36.82% <65.59%> (+0.03%)`	⬆️
linux	`70.05% <97.84%> (-0.02%)`	⬇️
linux-nightly	`70.91% <97.84%> (-0.02%)`	⬇️
macos	`54.62% <97.84%> (-0.04%)`	⬇️
pytests	`1.64% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.42% <0.00%> (-0.01%)`	⬇️
unittests	`67.14% <97.84%> (-0.01%)`	⬇️
upgradability	`0.29% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

walnut-the-cat · 2024-03-20T14:31:45Z

Out of curiosity, how much additional memory do we use with this change?

robin-near · 2024-03-20T20:38:12Z

core/store/src/trie/mod.rs

+                }
+
+                // Retroactively record all accessed trie items to account for
+                // key-value pairs which were only written but never read, thus


There's a gap in my logical understanding: where is the link that goes from "a key is written to but not read from" to "we must have the previous value for that key"?

For the specific test case that I knew of of a trie restructuring (converting branch of 1 child to extension, I know that the child node must be read in order to do this conversion, and at the same time the child node is deleted, so in this case, yes, the child is written to (as in, its refcount was decremented) but not read from (as in, did not get queried), and the child needed its previous value (in order to fulfill the restructuring)), but is that in general the only case where the update phase would cause a value to be written to but not read from?

So in other words, is the logic this: "if a node is written to but not read from, then that node must have been replaced during a trie restructuring, and therefore its previous value must be known" (which I need to be convinced of), or is there some other way to draw this conclusion?

if a node is written to but not read from, then that node must have been replaced during a trie restructuring, and therefore its previous value must be known

I think "key" got replaced with "node" somewhere in the message.

The logic is "if a key is written to but not read from" => "all nodes on the trie path must be known".
Let's say you don't know some node on the path to it and take the lowest of them. Then you can't recompute it after write, because you don't know hashes of neighbours of the key, so you can't compute new state root.

And, yeah, when we descend into node corresponding to key, all refcounts on the way are decremented.

I actually want to propose simpler logic by this PR: "for all keys, for which any state operation was called (get/has/write/remove), all nodes to the old path must be recorded". That's what current disk trie logic does, and that's easy to explain and follow.
For memtries, union of (nodes recorded on chunk application) and (nodes recorded on trie restructuring) should give this set. So I didn't even think much if this is completely necessary, it's enough for me that it is verifiable.

I see, ok, thanks for clarifying your intent in the mathematical form :)

I think my question is a bit more detailed than this.

Suppose we replace a node from the trie. Do we actually need the node that we're replacing? The parent of that node would have the hashes of the sibling nodes that we need to reconstruct the parent. So for example we have A -> B -> C -> D and we're deleting D. Do we need to have D, because to reconstruct a A' -> B' -> C' -> D', we only need A, B, C, and D'. D is not needed. But in the logic in this PR, it seems that it requires that D be needed. That's the part I don't understand.

It makes sense.
Then, my argument goes to:

that's what current disk logic do. Every time we descend to node, we pull it out by move_node_to_mutable or delete_value. By default I would avoid the logic in couple places to "if this is the last node, don't do it" + adding cornercase when this node is a restructured branch.

the logic "we take all nodes on the path" is simpler to explain than "we take all nodes on reads + all nodes except the last one on writes"

savings from not including previous value shouldn't be high, because it is a rare case to modify value without reading it. action_deploy_contract reads contract before redeploying it, etc. Still we can store only ValueRefs of prev values, but I'm not 100% confident I'll do it right the first time :D

If we are recording node D in A -> B -> C -> D for on disk tries, let's try to keep consistency and do the same for memtries even if it's not the most efficient.

robin-near · 2024-03-20T20:40:17Z

core/store/src/trie/mem/updating.rs

+    pub nodes: HashMap<CryptoHash, Arc<[u8]>>,
+    /// Hashes of accessed values - because values themselves are not
+    /// necessarily present in memtrie.
+    pub values: HashSet<CryptoHash>,


Should this have an optional value, if the value is present in memtrie?

robin-near · 2024-03-20T20:41:36Z

core/store/src/trie/mem/updating.rs

-        if let Some(trie_refcount_changes) = self.trie_refcount_changes.as_mut() {
-            trie_refcount_changes.subtract(hash, 1);
+        if let Some(tracked_node_changes) = self.tracked_trie_changes.as_mut() {
+            tracked_node_changes.accesses.values.insert(hash);


similar to below comment, does subtracting refcount for the value guarantee that we must need the previous value of the value? If so, why?

robin-near · 2024-03-20T20:43:37Z

Out of curiosity, how much additional memory do we use with this change?

Should be zero, it doesn't change the contents of the memtries.

jancionear · 2024-03-21T09:31:09Z

core/store/src/trie/mod.rs

+                // not recorded before.
+                if let Some(recorder) = &self.recorder {
+                    for (node_hash, serialized_node) in trie_accesses.nodes {
+                        recorder.borrow_mut().record(&node_hash, serialized_node);
+                    }
+                    for value_hash in trie_accesses.values {
+                        let value = self.storage.retrieve_raw_bytes(&value_hash)?;
+                        recorder.borrow_mut().record(&value_hash, value);
+                    }
+                }
+                Ok(trie_changes)


It's sad that the changes are recorded retroactively.
For state witness size limit it would be best to record them as they happen, this would make it possible to stop executing a receipt once it generates more than X MB of PartialState. With retroactive recording we can only measure how much PartialState was generated after the update is applied, which AFAIU happens after applying the whole chunk :/

Actually, doesn't this break #10703? The soft size limit looks at the size of recorder state after applying each receipt and stops executing new ones when this size gets above the limit. It depends on online recording, it wouldn't work with retroactive one.

Would it be possible to somehow put the recording logic in TrieUpdate::set?

And I guess it works the same way with normal trie writes. That sucks, it really throws a wrench into the witness size limit. This solution makes sense for now, but I think the whole logic of trie updating will need a refactor for the size limit :/

Okay actually we don't need to record writes for the state witness, only the reads need to be recorded :)

As discussed in the meeting today, turns out we do want to record the nodes for updation/deletion :(
I believe this update function is only called in trie_update finalize for all practical purposes?

We would need to revisit this and get a better solution for recording on the fly like in the get function instead of trie_update finalize. Required for both soft and hard limit state witness.

Currently (other than trie restructuring), I guess we just rely on the fact that runtime first calls a get before update/delete for gas cost estimation.

jancionear · 2024-03-21T09:40:13Z

core/store/src/trie/mem/updating.rs

+    /// Hashes of accessed values - because values themselves are not
+    /// necessarily present in memtrie.
+    pub values: HashSet<CryptoHash>,
+}


I don't fully understand what are the additional nodes that this change records.

AFAIU runtime updates the trie by calling storage_write, which always does a trie read before writing a value:

nearcore/runtime/near-vm-runner/src/logic/logic.rs

Line 2545 in f2f6ed0

let evicted_ptr = self.ext.storage_get(&key, StorageGetMode::Trie)?;

This trie read records all nodes that were accessed to reach this value, even in the case of memtries:

nearcore/core/store/src/trie/mod.rs

Line 1296 in f2f6ed0

if let Some(recorder) = &self.recorder {

Doesn't that record everything that is needed to prove execution of the contract? Why do we need an additional access log?

Ah I guess it doesn't record nodes that are created when adding new values...

And this path is called only during contract execution. The gateway to update trie outside it is set<T: BorshSerialize>(state_update: &mut TrieUpdate, key: TrieKey, value: &T).

shreyan-gupta

Overall looks good, although I would wait for an approval from Robin.

…nto record-trie-writes

robin-near

We still don't know what to do with witness size limit right?

Longarithm · 2024-03-25T10:21:32Z

We still don't know what to do with witness size limit right?

Yeah, that's a following PR.

Longarithm changed the title ~~draft: record trie writes~~ draft: record nodes for writes in memtrie Mar 20, 2024

record on writes

2c8493c

Longarithm force-pushed the record-trie-writes branch from 691b53c to 2c8493c Compare March 20, 2024 13:24

Longarithm requested review from pugachAG and robin-near March 20, 2024 13:33

Longarithm marked this pull request as ready for review March 20, 2024 13:34

Longarithm requested a review from a team as a code owner March 20, 2024 13:34

Merge branch 'master' into record-trie-writes

3e4066f

Longarithm changed the title ~~draft: record nodes for writes in memtrie~~ fix: record nodes for writes in memtrie Mar 20, 2024

jancionear self-requested a review March 20, 2024 13:53

Merge branch 'master' into record-trie-writes

c237ffc

robin-near reviewed Mar 20, 2024

View reviewed changes

jancionear reviewed Mar 21, 2024

View reviewed changes

shreyan-gupta self-requested a review March 21, 2024 17:01

shreyan-gupta approved these changes Mar 22, 2024

View reviewed changes

Longarithm added 4 commits March 22, 2024 17:24

benefit from inlining

4f3f3bd

Merge branch 'record-trie-writes' of github.com:Longarithm/nearcore i…

6bfb934

…nto record-trie-writes

very important comment

df30057

very important comment 2

c01638c

robin-near approved these changes Mar 22, 2024

View reviewed changes

Longarithm added this pull request to the merge queue Mar 25, 2024

Merged via the queue into near:master with commit 73b8827 Mar 25, 2024
31 checks passed

Longarithm deleted the record-trie-writes branch March 25, 2024 10:43

walnut-the-cat added the A-stateless-validation Area: stateless validation label Mar 27, 2024

jancionear mentioned this pull request Apr 6, 2024

[stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit #10890

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: record nodes for writes in memtrie #10841

fix: record nodes for writes in memtrie #10841

Longarithm commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 20, 2024 •

edited

Loading

walnut-the-cat commented Mar 20, 2024

robin-near Mar 20, 2024

Longarithm Mar 20, 2024 •

edited

Loading

robin-near Mar 21, 2024

Longarithm Mar 21, 2024

shreyan-gupta Mar 21, 2024

robin-near Mar 20, 2024

robin-near Mar 20, 2024

robin-near commented Mar 20, 2024

jancionear Mar 21, 2024

jancionear Mar 21, 2024

jancionear Mar 21, 2024

jancionear Mar 21, 2024

shreyan-gupta Mar 21, 2024

shreyan-gupta Mar 21, 2024

jancionear Mar 21, 2024

jancionear Mar 21, 2024 •

edited

Loading

Longarithm Mar 22, 2024

shreyan-gupta left a comment

robin-near left a comment

Longarithm commented Mar 25, 2024

fix: record nodes for writes in memtrie #10841

fix: record nodes for writes in memtrie #10841

Conversation

Longarithm commented Mar 20, 2024 • edited Loading

UPD: New motivation

Old motivation

Testing

codecov bot commented Mar 20, 2024 • edited Loading

Codecov Report

walnut-the-cat commented Mar 20, 2024

Choose a reason for hiding this comment

Longarithm Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robin-near commented Mar 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jancionear Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyan-gupta left a comment

Choose a reason for hiding this comment

robin-near left a comment

Choose a reason for hiding this comment

Longarithm commented Mar 25, 2024

Longarithm commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 20, 2024 •

edited

Loading

Longarithm Mar 20, 2024 •

edited

Loading

jancionear Mar 21, 2024 •

edited

Loading