-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MissingTrieValue error in mocknet running statelessnet protocol #11065
Comments
@Longarithm can you take a look at this if @robin-near cannot identify the right fix by Monday? |
The problem arises from the difference between the on-disk trie logic (DT) and the memtrie logic (MT). In the non-stateless world, it suffices to ensure that
For these, we have already done a great job. In the stateless world, we additionally need to ensure that
The latter of which is currently not true, because the DT logic may unnecessarily read nodes when deleting a non-existent key, but that may not be the only case where these implementations differ. Now, here are all the ways we may use the trie (stateless and stateful combined):
The last two are used in stateless validation with memtrie enabled. We use the MT logic to produce a state witness, and then that state witness is validated using the DT logic against the partial trie. That causes a discrepancy because the extraneous nodes that the DT logic reads are not read by the MT logic and therefore not recorded. So, at a high level there are a couple of strategies to fix the issue: Strategy 1. Do not use DT at all for stateless validation, so we only have the MT code path. For strategy 1, the reason why the DT logic is used on recorded partial tries is because in the current implementation, the DT logic accepts a trait called TrieStorage, which can be implemented by on-disk or recorded storage, whereas the MT logic accepts only a concrete memtrie storage. So to use the MT logic we may need to first introduce a trait; but that may or may not be easy -- keeping in mind that during the memtrie implementation we did not choose to use the TrieStorage trait for memtries, likely because there's something that makes them just incompatible. For strategy 2, the challenge is making sure that we've covered all corner cases. I'm not sure how difficult that is, but at the very least we need to add rigorous testing. We can either (Strategy 2.1) change MT logic to be backwards compatible with DT, i.e. read those extraneous nodes that DT would have read; or (Strategy 2.2) change the DT logic and also audit the MT logic so that they both adhere to the principle that a node must only be read if doing so is absolutely needed. I think strategy 2.2. is better because
It's not clear to me which strategy is better at the moment; I'd probably lean towards strategy 1, but perhaps in the interest of time strategy 2.2 is more feasible. |
) Yet another cornercase fix: if deleted key doesn't exist, read only nodes required to prove its non-existence, thus only nodes along the path. The chaos it caused is described on #11065 and discussed on [Zulip](https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Forknet.20.2B.20Stateless.20Validation/near/432971999). This is not a protocol change, because currently nodes read on `Trie::update` don't impact costs, so we can modify current disk trie logic to follow the principle that only necessary nodes should be read. Unfortunately disk tries logic, on descending, always subtracts child' memory usage from parent' memory usage and requires "backwards iteration" along the path to recompute potentially changed memory usages. I want to minimise changes, so I just introduce `key_deleted` to compute memory usages but avoid reads caused by node restructuring if key wasn't found. If I just return early, I get `attempt to subtract with overflow` on `calc_memory_usage_and_store`. Memtrie logic is already correct because it computes memory usages after processing all keys, which is cleaner. I also restructure trie consistency changes. Now these tests: * always use memtries by default, as **this** is stateless validation flow we want to test; * always include comparison of partial storages generated by disk tries, memtries and recorded storages; * use random probabilities and any insert/delete operations for existing/missing keys. Unfortunately with previous impl all test still can pass, but in like 9/10 runs at least some of them fails. These tests are random, however, so I don't target zero flakiness. Also tests with/without flat storage are better parametrised. --------- Co-authored-by: Longarithm <the.aleksandr.logunov@gmail.com>
It seems the issue is not fully fixed yet: link We might have introduced some indeterminism with the latest memtrie fix. One node (qtl3) crashed and never recovered. Same error (with the same hash) appeared on (fpkf) soon after restarting neard, but somehow this node was able to 'restart' after crash. |
The latest issue in the stateless mocknet is identified as due to contract code missing from the state witnesses due to caching in the chunk producer. Related discussion. There is also another issue but there is no clear evidence that it is related to the changes in the context of this issue. So closing this issue for now. |
This is currently blocking the further mocknet testing of stateless validation by causing the chunk validation to get stuck for shard 0.
Related Zulip thread
Error with stack trace:
Test case manifesting the issue described here.
The text was updated successfully, but these errors were encountered: