Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MissingTrieValue error in mocknet running statelessnet protocol #11065

Closed
Tracked by #46
tayfunelmas opened this issue Apr 12, 2024 · 4 comments · Fixed by #11071
Closed
Tracked by #46

MissingTrieValue error in mocknet running statelessnet protocol #11065

tayfunelmas opened this issue Apr 12, 2024 · 4 comments · Fixed by #11071
Assignees
Labels
A-stateless-validation Area: stateless validation

Comments

@tayfunelmas
Copy link
Contributor

tayfunelmas commented Apr 12, 2024

This is currently blocking the further mocknet testing of stateless validation by causing the chunk validation to get stuck for shard 0.

Related Zulip thread

Error with stack trace:

thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, JAaiThPtra2e14LZZRMpCa8SQwA2BgAp3BiWUpZYewG7)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
   4: near_store::trie::accounting_cache::TrieAccountingCache::retrieve_raw_bytes_with_accounting
   5: near_store::trie::Trie::internal_retrieve_trie_node
   6: near_store::trie::Trie::retrieve_raw_node
   7: near_store::trie::Trie::move_node_to_mutable
   8: near_store::trie::insert_delete::<impl near_store::trie::Trie>::fix_extension_node
   9: near_store::trie::insert_delete::<impl near_store::trie::Trie>::delete
  10: near_store::trie::Trie::update
  11: near_store::trie::update::TrieUpdate::finalize
  12: node_runtime::Runtime::apply
  13: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
  14: near_chain::update_shard::apply_new_chunk
  15: core::ops::function::FnOnce::call_once{{vtable.shim}}
  16: tracing_core::dispatcher::with_default
  17: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  18: rayon_core::registry::WorkerThread::wait_until_cold
  19: rayon_core::registry::ThreadBuilder::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Test case manifesting the issue described here.

@tayfunelmas tayfunelmas added the A-stateless-validation Area: stateless validation label Apr 12, 2024
@tayfunelmas
Copy link
Contributor Author

tayfunelmas commented Apr 12, 2024

@Longarithm can you take a look at this if @robin-near cannot identify the right fix by Monday?

@robin-near
Copy link
Contributor

The problem arises from the difference between the on-disk trie logic (DT) and the memtrie logic (MT).

In the non-stateless world, it suffices to ensure that

  • When looking up from the trie, DT and MT exhibit the same behavior (i.e. yields the same values, charges the same gas, etc.)
  • When applying a set of key-value changes to the trie, DT and MT yield the same on-disk changes.

For these, we have already done a great job.

In the stateless world, we additionally need to ensure that

  • The DT and MT read the same nodes during lookup;
  • The DT and MT read the same nodes when applying changes.

The latter of which is currently not true, because the DT logic may unnecessarily read nodes when deleting a non-existent key, but that may not be the only case where these implementations differ.

Now, here are all the ways we may use the trie (stateless and stateful combined):

  1. Using the DT logic, against disk
  2. Using the flat storage logic
  3. Using the MT logic, against memtrie
  4. Using the DT logic, against recorded partial trie

The last two are used in stateless validation with memtrie enabled. We use the MT logic to produce a state witness, and then that state witness is validated using the DT logic against the partial trie. That causes a discrepancy because the extraneous nodes that the DT logic reads are not read by the MT logic and therefore not recorded.

So, at a high level there are a couple of strategies to fix the issue:

Strategy 1. Do not use DT at all for stateless validation, so we only have the MT code path.
Strategy 2. Ensure that DT and MT read the same nodes when applying changes.

For strategy 1, the reason why the DT logic is used on recorded partial tries is because in the current implementation, the DT logic accepts a trait called TrieStorage, which can be implemented by on-disk or recorded storage, whereas the MT logic accepts only a concrete memtrie storage. So to use the MT logic we may need to first introduce a trait; but that may or may not be easy -- keeping in mind that during the memtrie implementation we did not choose to use the TrieStorage trait for memtries, likely because there's something that makes them just incompatible.

For strategy 2, the challenge is making sure that we've covered all corner cases. I'm not sure how difficult that is, but at the very least we need to add rigorous testing. We can either (Strategy 2.1) change MT logic to be backwards compatible with DT, i.e. read those extraneous nodes that DT would have read; or (Strategy 2.2) change the DT logic and also audit the MT logic so that they both adhere to the principle that a node must only be read if doing so is absolutely needed. I think strategy 2.2. is better because

  • We can feel free to change the DT logic and still avoid doing a protocol upgrade on mainnet, because the update code path in mainnet is not sensitive to which nodes are read;
  • Using the principle of least access makes it well-defined what the correct behavior should be, so that we can test both MT and DT against that behavior, as opposed to testing them against each other.

It's not clear to me which strategy is better at the moment; I'd probably lean towards strategy 1, but perhaps in the interest of time strategy 2.2 is more feasible.

github-merge-queue bot pushed a commit that referenced this issue Apr 16, 2024
)

Yet another cornercase fix: if deleted key doesn't exist, read only
nodes required to prove its non-existence, thus only nodes along the
path.
The chaos it caused is described on #11065 and discussed on
[Zulip](https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Forknet.20.2B.20Stateless.20Validation/near/432971999).
This is not a protocol change, because currently nodes read on
`Trie::update` don't impact costs, so we can modify current disk trie
logic to follow the principle that only necessary nodes should be read.

Unfortunately disk tries logic, on descending, always subtracts child'
memory usage from parent' memory usage and requires "backwards
iteration" along the path to recompute potentially changed memory
usages. I want to minimise changes, so I just introduce `key_deleted` to
compute memory usages but avoid reads caused by node restructuring if
key wasn't found. If I just return early, I get `attempt to subtract
with overflow` on `calc_memory_usage_and_store`.

Memtrie logic is already correct because it computes memory usages after
processing all keys, which is cleaner.

I also restructure trie consistency changes. Now these tests:
* always use memtries by default, as **this** is stateless validation
flow we want to test;
* always include comparison of partial storages generated by disk tries,
memtries and recorded storages;
* use random probabilities and any insert/delete operations for
existing/missing keys. Unfortunately with previous impl all test still
can pass, but in like 9/10 runs at least some of them fails. These tests
are random, however, so I don't target zero flakiness.

Also tests with/without flat storage are better parametrised.

---------

Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
@walnut-the-cat
Copy link
Contributor

It seems the issue is not fully fixed yet: link

We might have introduced some indeterminism with the latest memtrie fix.

One node (qtl3) crashed and never recovered.

Same error (with the same hash) appeared on (fpkf) soon after restarting neard, but somehow this node was able to 'restart' after crash.

@tayfunelmas
Copy link
Contributor Author

The latest issue in the stateless mocknet is identified as due to contract code missing from the state witnesses due to caching in the chunk producer. Related discussion.

There is also another issue but there is no clear evidence that it is related to the changes in the context of this issue. So closing this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants