MissingTrieValue error in mocknet running statelessnet protocol #11065

tayfunelmas · 2024-04-12T20:33:28Z

This is currently blocking the further mocknet testing of stateless validation by causing the chunk validation to get stuck for shard 0.

Related Zulip thread

Error with stack trace:

thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, JAaiThPtra2e14LZZRMpCa8SQwA2BgAp3BiWUpZYewG7)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
   4: near_store::trie::accounting_cache::TrieAccountingCache::retrieve_raw_bytes_with_accounting
   5: near_store::trie::Trie::internal_retrieve_trie_node
   6: near_store::trie::Trie::retrieve_raw_node
   7: near_store::trie::Trie::move_node_to_mutable
   8: near_store::trie::insert_delete::<impl near_store::trie::Trie>::fix_extension_node
   9: near_store::trie::insert_delete::<impl near_store::trie::Trie>::delete
  10: near_store::trie::Trie::update
  11: near_store::trie::update::TrieUpdate::finalize
  12: node_runtime::Runtime::apply
  13: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
  14: near_chain::update_shard::apply_new_chunk
  15: core::ops::function::FnOnce::call_once{{vtable.shim}}
  16: tracing_core::dispatcher::with_default
  17: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  18: rayon_core::registry::WorkerThread::wait_until_cold
  19: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Test case manifesting the issue described here.

The text was updated successfully, but these errors were encountered:

tayfunelmas · 2024-04-12T20:35:00Z

@Longarithm can you take a look at this if @robin-near cannot identify the right fix by Monday?

robin-near · 2024-04-13T22:18:56Z

The problem arises from the difference between the on-disk trie logic (DT) and the memtrie logic (MT).

In the non-stateless world, it suffices to ensure that

When looking up from the trie, DT and MT exhibit the same behavior (i.e. yields the same values, charges the same gas, etc.)
When applying a set of key-value changes to the trie, DT and MT yield the same on-disk changes.

For these, we have already done a great job.

In the stateless world, we additionally need to ensure that

The DT and MT read the same nodes during lookup;
The DT and MT read the same nodes when applying changes.

The latter of which is currently not true, because the DT logic may unnecessarily read nodes when deleting a non-existent key, but that may not be the only case where these implementations differ.

Now, here are all the ways we may use the trie (stateless and stateful combined):

Using the DT logic, against disk
Using the flat storage logic
Using the MT logic, against memtrie
Using the DT logic, against recorded partial trie

The last two are used in stateless validation with memtrie enabled. We use the MT logic to produce a state witness, and then that state witness is validated using the DT logic against the partial trie. That causes a discrepancy because the extraneous nodes that the DT logic reads are not read by the MT logic and therefore not recorded.

So, at a high level there are a couple of strategies to fix the issue:

Strategy 1. Do not use DT at all for stateless validation, so we only have the MT code path.
Strategy 2. Ensure that DT and MT read the same nodes when applying changes.

For strategy 1, the reason why the DT logic is used on recorded partial tries is because in the current implementation, the DT logic accepts a trait called TrieStorage, which can be implemented by on-disk or recorded storage, whereas the MT logic accepts only a concrete memtrie storage. So to use the MT logic we may need to first introduce a trait; but that may or may not be easy -- keeping in mind that during the memtrie implementation we did not choose to use the TrieStorage trait for memtries, likely because there's something that makes them just incompatible.

For strategy 2, the challenge is making sure that we've covered all corner cases. I'm not sure how difficult that is, but at the very least we need to add rigorous testing. We can either (Strategy 2.1) change MT logic to be backwards compatible with DT, i.e. read those extraneous nodes that DT would have read; or (Strategy 2.2) change the DT logic and also audit the MT logic so that they both adhere to the principle that a node must only be read if doing so is absolutely needed. I think strategy 2.2. is better because

We can feel free to change the DT logic and still avoid doing a protocol upgrade on mainnet, because the update code path in mainnet is not sensitive to which nodes are read;
Using the principle of least access makes it well-defined what the correct behavior should be, so that we can test both MT and DT against that behavior, as opposed to testing them against each other.

It's not clear to me which strategy is better at the moment; I'd probably lean towards strategy 1, but perhaps in the interest of time strategy 2.2 is more feasible.

) Yet another cornercase fix: if deleted key doesn't exist, read only nodes required to prove its non-existence, thus only nodes along the path. The chaos it caused is described on #11065 and discussed on [Zulip](https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Forknet.20.2B.20Stateless.20Validation/near/432971999). This is not a protocol change, because currently nodes read on `Trie::update` don't impact costs, so we can modify current disk trie logic to follow the principle that only necessary nodes should be read. Unfortunately disk tries logic, on descending, always subtracts child' memory usage from parent' memory usage and requires "backwards iteration" along the path to recompute potentially changed memory usages. I want to minimise changes, so I just introduce `key_deleted` to compute memory usages but avoid reads caused by node restructuring if key wasn't found. If I just return early, I get `attempt to subtract with overflow` on `calc_memory_usage_and_store`. Memtrie logic is already correct because it computes memory usages after processing all keys, which is cleaner. I also restructure trie consistency changes. Now these tests: * always use memtries by default, as **this** is stateless validation flow we want to test; * always include comparison of partial storages generated by disk tries, memtries and recorded storages; * use random probabilities and any insert/delete operations for existing/missing keys. Unfortunately with previous impl all test still can pass, but in like 9/10 runs at least some of them fails. These tests are random, however, so I don't target zero flakiness. Also tests with/without flat storage are better parametrised. --------- Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>

walnut-the-cat · 2024-04-17T15:27:03Z

It seems the issue is not fully fixed yet: link

We might have introduced some indeterminism with the latest memtrie fix.

One node (qtl3) crashed and never recovered.

Same error (with the same hash) appeared on (fpkf) soon after restarting neard, but somehow this node was able to 'restart' after crash.

tayfunelmas · 2024-04-17T19:48:02Z

The latest issue in the stateless mocknet is identified as due to contract code missing from the state witnesses due to caching in the chunk producer. Related discussion.

There is also another issue but there is no clear evidence that it is related to the changes in the context of this issue. So closing this issue for now.

tayfunelmas added the A-stateless-validation Area: stateless validation label Apr 12, 2024

tayfunelmas assigned Longarithm and robin-near Apr 12, 2024

walnut-the-cat mentioned this issue Apr 12, 2024

[ProjectTracking]: Stateless validation Mainnet Release near/near-one-project-tracking#46

Open

52 tasks

Longarithm mentioned this issue Apr 16, 2024

fix: read only nodes along the path on non-existent key deletion #11071

Merged

Longarithm linked a pull request Apr 16, 2024 that will close this issue

fix: read only nodes along the path on non-existent key deletion #11071

Merged

Longarithm closed this as completed in #11071 Apr 16, 2024

walnut-the-cat reopened this Apr 17, 2024

tayfunelmas closed this as completed Apr 17, 2024

walnut-the-cat mentioned this issue Apr 19, 2024

Make contract code inclusion deterministic #11124

Closed

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #11194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MissingTrieValue error in mocknet running statelessnet protocol #11065

MissingTrieValue error in mocknet running statelessnet protocol #11065

tayfunelmas commented Apr 12, 2024 •

edited

Loading

tayfunelmas commented Apr 12, 2024 •

edited

Loading

robin-near commented Apr 13, 2024

walnut-the-cat commented Apr 17, 2024

tayfunelmas commented Apr 17, 2024

MissingTrieValue error in mocknet running statelessnet protocol #11065

MissingTrieValue error in mocknet running statelessnet protocol #11065

Comments

tayfunelmas commented Apr 12, 2024 • edited Loading

tayfunelmas commented Apr 12, 2024 • edited Loading

robin-near commented Apr 13, 2024

walnut-the-cat commented Apr 17, 2024

tayfunelmas commented Apr 17, 2024

tayfunelmas commented Apr 12, 2024 •

edited

Loading

tayfunelmas commented Apr 12, 2024 •

edited

Loading