Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit #10890

Open
Tracked by #46
Longarithm opened this issue Mar 27, 2024 · 32 comments
Assignees
Labels
A-stateless-validation Area: stateless validation A-storage Area: storage and databases

Comments

@Longarithm
Copy link
Member

When we change state KV pairs, it happens on e.g. TrieUpdate::set which doesn't access nodes on its own. Nodes required to update trie are read in the end, on TrieUpdate::finalize. While Runtime::apply produces correct state witness, its "online" size is not computed correctly because of that.

One way to compute better size is to always call Trie::get on TrieUpdate::set.

Note that TrieUpdate::finalize still may access new nodes due to branch restructuring. But I think the impact of that it small, only up to 2 nodes per each key removal.

Discussion https://near.zulipchat.com/#narrow/stream/295558-core/topic/Trie.3A.3Aget.20on.20TrieUpdate.3A.3Aset/near/428163849

@Longarithm Longarithm added A-storage Area: storage and databases A-stateless-validation Area: stateless validation labels Mar 27, 2024
@jancionear jancionear self-assigned this Mar 27, 2024
@jancionear
Copy link
Contributor

jancionear commented Apr 6, 2024

To expand a bit on the problem:

When the runtime reads or writes a value in the trie, it calls storage_read/storage_write. This function performs the operation on the TrieUpdate struct and additionally calculates TTN (Touched Trie Nodes) to charge gas for each trie node that was touched.
With stateless validation there's one more thing that needs to be done - we need to record all nodes from the pre-state trie that are touched during chunk application to construct the storage proof.
So ideally on every read or write we would walk down the Trie, perform the operation, calculate TTN and record all the touched nodes from the original Trie.
After every read or write we would charge gas for the TTN and see how much storage proof was recorded to limit state witness size. Any receipt that uses too much gas or produces too large storage proof would be rejected.
But this isn't what actually happens in the code - when the runtime performs a write using TrieUpdate, it doesn't actually perform the operations on the Trie. It maintains a hashmap from trie keys to key values. Each write adds an entry to this hashmap. Each following read reads from this hashmap, and only reads the Trie if a key isn't available in the hashmap. The Trie itself is only modified during finalize(), which happens at the end of chunk application, once all receipts in the chunk have already been executed.
For reads this isn't that big of a deal - every read that has to be recorded won't be available in the hashmap (which contains only changes written in this chunk), so it'll naturally go through the trie, and all nodes required for the read can be recorded there. With writes the situation is worse - a write can be performed as a single hashmap operation, without ever touching the original trie. To deal with this the runtime performs an additional read for every value that is written (it's done here).
Originally this was done to calculate the number of TTN and charge gas accordingly, but now it also serves the purpose of recording the nodes touched during the write.
Flat storage reads do a similar thing - they perform an additional Trie read to record the nodes (which makes flat storage pretty useless x.x).

This sounds like a workable solution - we record all nodes on the path to all read and written values, but there's one corner case that isn't covered by this. When there's a branch node with two children and we delete one of the children, the branch node should be merged with the the surviving child, as it doesn't make sense to have a branch node with only one child.
In such case we will record the child that is deleted (as we will do a read before writing to it), but we won't record the other child, even though recording the other child is necessary for performing the merge.

image

This problem was addressed in #10841 by recording all nodes touched during finalize(). finalize() actually performs the operations on the trie, so it will touch the other child while merging it.
It fixes the problem of nodes missing in the storage proof, but it doesn't help the runtime, as the recording
happens during finalize(), after all receipts have already been executed.

This means that runtime is not aware of those extra "surviving child" nodes, and as a result it's unable to charge gas or enforce storage proof size limit for them.

There are two main questions here:

  1. What is the impact of this? Could a malicious contract abuse this problem to bypass the state witness size limit and generate a storage proof so big that it crashes the system?
  2. How to fix it?

@jancionear
Copy link
Contributor

Impact

What's the worst that could happen?

A malicious actor could prepare a state which allows to execute the corner case as many times as possible.
They could create a bunch of branch nodes with two children - one very small, and the other as big as possible. Then in another contract call they could delete all of the small children without touching the big ones. This would allow them to put the big child in the storage proof while paying only for the branch node and the small child.
How bad would that be? I need to read into the exact serialization code, but for now my estimates would be:

  • Branch node with 2 children: (32 bytes for node hash) + (2*33 bytes for children hashes) + (14 * 1 byte for 14 None children) + (1 byte for None value) = 113 bytes
  • Small leaf node = (32 bytes for node hash) + (1 byte for path) + (32 bytes for value hash) = 65 bytes
  • Big leaf node = (32 bytes for node hash) + (2000 bytes for path) + (32 byes for value hash) = 2064 bytes

The cost to record the big node (2064 bytes) would be (113 + 65 = 178 bytes). This means that a malicious actor could record 12 times more state than what's accounted for in the storage proof size limit! ( (2064+178)/178 = 12.5 ).

That's pretty bad :/ It means that a malicious actor could grow the size of ChunkStateWitness about 10 times, from 8 MB to 80 MB. There are already problems with distributing an 8 MB witness, so an 80 MB one could fail altogether.

IMO it would be good to fix it. Although with all the improvements in network distribution it might turn out that it isn't all that necessary, we'll see how it goes.
Still it's good to consider what we could do to fix it. I spent some time thinking about various ideas, discussing with Robin, and I'll write them below in separate comments.

@jancionear
Copy link
Contributor

Idea 1 - Charge extra to account for the unrecorded children

We could just assume that every time we delete something there's also a big malicious child that'll be recorded and charge 2000 bytes extra. The big problem with that is that it lowers chunk capacity - charging 10x more for an operation means that the number of receipts that fit in a chunk will be a few times lower.
There are already problems with congestion, so limiting throughput sounds like a bad idea.

We could add some clever heuristics (only charge once per branch, count children, trace deletions, etc...).
That's something to think about, but it'd be an imperfect, complex solution.
I'm not a big fan of this idea.

@jancionear
Copy link
Contributor

jancionear commented Apr 6, 2024

Idea 2 - Immediately perform the operations on a real trie

A big reason why we needed TrieUpdate is that the Trie used to live on disk, which is very slow to read and write from. Keeping the changes in memory allows for faster access and writing the changes to disk in one big batch during finalize().
But that's in the past now, we've transitioned to in-memory tries, so we could apply the operations directly
to this in-memory trie instead of using the hashmap in TrieUpdate. On every read/write we would immediately perform this operation on the memtrie, and immediately record all nodes that are touched during the operation. It'd give the runtime precise information about how much storage proof was generated by each operation.

This solution would be very clean - no strange tricks, we just apply the trie changes like we said we do.
A downside is that it could affect performance. Reading and writing to a hashmap is likely faster than jumping between memtrie nodes. Measuring the performance impact would require some benchmarks.

@jancionear
Copy link
Contributor

jancionear commented Apr 6, 2024

Idea 3 - Finalize after every receipt

Currently we finalize after every chunk, but we could do it after every receipt instead. All operations during the receipt's execution would be performed on TrieUpdate just like before, but once the receipt is done we would go and apply the update on a real trie (memtrie). After finalizing we know exactly how much storage proof was generated by this receipt, and we can fail it if it went over the hard per-receipt limit.
It'd allow to keep the performance benefits of TrieUpdate while still providing precise information about storage proof size.

Postponing node recording could have additional performance benefits - recording the node on every node access means that we have to perform extra work during every access - maybe serialize the node, maybe hash it. This work is repeated every time we access this node, which is wasteful. Recording only during finalize could allow to record every node exactly once, which could be more performant.

@jancionear
Copy link
Contributor

jancionear commented Apr 6, 2024

Idea 4 - Modify how the trie works

What if we didn't merge the branch during deletion? We could leave this work to the next person that tries to read the value from the surviving child. With such a modification it'd be enough to record nodes that are on the path to the value that is read/written.

We'd need to think about the implications of such a fundamental change.

  • Is it okay to force another person to do a lot of work?
    • it might be ok, we already do this when we force someone to merge/record the big child
  • Is it okay to have multiple representations of the same trie?
    • For correctness the trie changes have to be deterministic, but the representation might not need to be unique.
    • There could be some trouble with memtrie loading/saving
  • ???

It's a very interesting idea, but it'd change how a fundamental piece of the system works, so it could be a lot of work. OTOH it'd prepare us for Verkle trees later ;)

@Longarithm
Copy link
Member Author

It makes sense, thank you very much for the overview!
Which one of these ideas would you suggest?
I'm still for (1) because:

  • Users rarely delete data from blockchain. So deletion will unlikely lower throughput. If we want to be careful, we can return back from Trie a boolean flag like "did the latest branch have only 2 children?" and charge extra only in this case, where restructuring would be needed. We can also measure how many storage_remove ops are called on mainnet.
  • 12x increase would happen only in the edge case when each deletion reads exactly one small node and cause read of one extra big node. However, if random deletion causes ~5 new (branch) node reads and one extra big read, then overestimate is 113 * 5 = 565 -> 565 + 2000 = 2565 which is 5x increase. Still bad but not that much.

In the longer term, anything from (2) to (4) looks feasible. Perhaps (4) is simpler and (3) is harder to implement, because it requires to serve state reads from MemTrieUpdate which would work but it wasn't designed with that usecase in mind.

@jancionear
Copy link
Contributor

jancionear commented Apr 8, 2024

I'm still for (1) because:
Users rarely delete data from blockchain. So deletion will unlikely lower throughput. If we want to be careful, we can return back from Trie a boolean flag like "did the latest branch have only 2 children?" and charge extra only in this case, where restructuring would be needed. We can also measure how many storage_remove ops are called on mainnet.

Hmm that's a good point, if it's true that contracts rarely delete things then charging extra wouldn't hurt normal users that much, while still protecting us from malicious actors.
I dislike (1) because it's an unclean workaround that'll add even more complexity to Trie logic, which is already super complicated, but maybe it's the fastest way to go, even if it adds some tech debt.
I'll try to gather some data to see how things are.

Which one of these ideas would you suggest?

I really like (2), because it's a clean simple solution, but it might be a big effort to get it done, make sure that there are no performance problems, etc. (3) sounds more feasible to me, while still being a proper solution. I need to dive into the code and figure out what's feasible.

requires to serve state reads from MemTrieUpdate which would work but it wasn't designed with that usecase in mind

We could still serve reads from TrieUpdate, but we'd flush the commited updates to MemTrieChanges after every receipt (???? idk)

@Longarithm
Copy link
Member Author

that'll add even more complexity to Trie logic

Just to clarify, adding 2000 to the size estimation on each storage_remove doesn't look like too much complexity, and it doesn't have to be inside Trie. Or you mean additional logic with "checking if there were 2 children"?

@jancionear
Copy link
Contributor

Just to clarify, adding 2000 to the size estimation on each storage_remove doesn't look like too much complexity, and it doesn't have to be inside Trie. Or you mean additional logic with "checking if there were 2 children"?

Eh IMO it counts as complex, it's one more thing that you need to remember about and be cautious to do properly. It's very nonobvious, and it'll be easy to forget about charging the extra gas somewhere. People who will work with this code in the future won't be aware of it, it's a trap set for them. It's not the end of the world, but I'd prefer a solution which is intuitive, without such gotchas.

@Longarithm
Copy link
Member Author

Ah I see. Actually, I don't think we should necessarily charge more gas. I just mean maintaining u64 variable which estimates state witness size, which is usually size(recorded_storage), and say every time when storage_remove is invoked, we add 2000 there.
We also shouldn't worry about undercharging users there. They already have to pay for storage_remove. This operation, as each storage operation in contract, has a base cost, which can account for that. And also we charge using "Gas costs" but in fact chunk space is limited by "Compute costs" which already can be seen as severe undercharging. Additional undercharging in storage_remove doesn't look too serious to me.

@walnut-the-cat
Copy link
Contributor

@robin-near , @tayfunelmas , @shreyan-gupta , @bowenwang1996 , and I talked about ideas in the warroom and we converged into our own preference. We can talk about pros/cons of each idea during stateless validation meeting tomorrow

@jancionear
Copy link
Contributor

For now I'll work towards implementing Solution 1 (charge extra). It's a simple solution that could be good enough (depends on the data gathered from #11000).

From other discussion it seems that we've converged on Idea 3 (finalize after every receipt) as a more proper solution, but implementing it would involve more effort. We can consider implementing the proper solution later, after the MVP release.

@walnut-the-cat
Copy link
Contributor

cc. @robin-near , @tayfunelmas , @shreyan-gupta .

If we go with solution 1, do we plan to implement hard limit separately? I thought benefit of solution 3 was us being able to tackle soft limit corner case and hard limit at the same time? Separate question. I understand that solution 1 will result in less number of receipts in a chunk, but what if one receipt is already too big? (e.g. 50mb) Won't it still cause chunk miss to happen?

@shreyan-gupta
Copy link
Contributor

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

@jancionear
Copy link
Contributor

cc. @robin-near , @tayfunelmas , @shreyan-gupta .

If we go with solution 1, do we plan to implement hard limit separately? I thought benefit of solution 3 was us being able to tackle soft limit corner case and hard limit at the same time? Separate question. I understand that solution 1 will result in less number of receipts in a chunk, but what if one receipt is already too big? (e.g. 50mb) Won't it still cause chunk miss to happen?

Going with (1) doesn't cause any trouble with implementing the hard per-receipt limit. During storage_* we can ask for the adjusted size of the storage proof (with extra charges for removals) and fail the receipt if the adjusted size gets too large. Implementing the hard limit with (3) sounds harder to me, as it would require us to fail a receipt after it has already been executed. It should be easier with (1).

@jancionear
Copy link
Contributor

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

I don't know, is it expected that the current binary can apply any block from the past?

@robin-near
Copy link
Contributor

I don't know, is it expected that the current binary can apply any block from the past?

That's still expected today as far as I understand.

@shreyan-gupta
Copy link
Contributor

So, confirming, solution 1 basically says the following right?

  • During state witness size calculation, in the recorder, if we have a node delete operation, add/account for an extra 2000 bytes in the calculation
  • This is effectively tracked by recorder class via trie_update.remove() call
  • We would need to now call this as state_witness_size_estimate or state_witness_size_upper_bound
  • Finalize works as expected and calculates the actual state witness and its size

@shreyan-gupta
Copy link
Contributor

Meanwhile for the hard limit, we were having some more discussions here, as I'll try to give a brief summary

Our original though of "finalize after each receipt execution" to get the size of witness (per receipt) doesn't work as it's impossible to rollback touched trie nodes. Once touched, they need to be a part of the state witness else the validation can not happen properly.

What we need to remember here is that chunk validators only work on the state witness partial trie and don't have access to the whole trie like the chunk producer does.

Instead, we need to tap into the runtime and stop execution as soon as we reach the hard limit. This would be consistent across chunk producer and validators.

To be noted, initially we were thinking about this as a hard limit per receipt execution, but we may need to change our mindset and think of it as hard limit for the whole chunk/for all receipts instead as that's the most straightforward way to implement it. (We only keep a running track of state witness size, not for individual receipts).

We would have to then set the limits as something on the lines of hard_limit = soft_limit + max allowed limit per receipt.

@shreyan-gupta
Copy link
Contributor

In terms of implementation, the logic for the hard limit need to go via trie_update calls in runtime/runtime/src/ext.rs

Code flow is

  • apply in runtime/runtime/src/lib.rs
  • storage_remove, storage_read etc. in runtime/near-vm-runner/src/logic/logic.rs
  • storage_get, storage_remove etc. in runtime/runtime/src/ext.rs
  • trie_update

@jancionear
Copy link
Contributor

jancionear commented Apr 10, 2024

Our original though of "finalize after each receipt execution" to get the size of witness (per receipt) doesn't work as it's impossible to rollback touched trie nodes. Once touched, they need to be a part of the state witness else the validation can not happen properly.

Oooh that's a good point. To prove that executing the receipt generated 100MB of storage proof we would have to include the 100 MB storage proof in the witness x.x. You're right, post execution checking doesn't really look viable. That kinda kills idea (3) :c
I guess that promotes (2) to the "proper solution" position...

@walnut-the-cat walnut-the-cat changed the title [stateless_validation] Count nodes needed for key updates in state witness size limit [stateless_validation][Statewitess softlimit] Count nodes needed for key updates in state witness size limit Apr 10, 2024
@walnut-the-cat walnut-the-cat changed the title [stateless_validation][Statewitess softlimit] Count nodes needed for key updates in state witness size limit [stateless_validation][witess size soft limit] Count nodes needed for key updates in state witness size limit Apr 10, 2024
@walnut-the-cat walnut-the-cat changed the title [stateless_validation][witess size soft limit] Count nodes needed for key updates in state witness size limit [stateless_validation][witness size soft limit] Count nodes needed for key updates in state witness size limit Apr 10, 2024
@jancionear
Copy link
Contributor

Btw, quick question, in case we are going with solution 1, would we have to maintain this extra charge implementation forever, even if we upgrade protocol later?

Eh yeah, this sucks :/

But I'm kinda thinking that after moving to stateless validation we might want to throw some of the old trie infrastructure away anyway. There's a lot of things that don't make much sense with stateless validation - accounting cache, flat storage, etc.
Maybe in the future we would be able to put all of this legacy trie code in some module and make a clean new implementation based on memtries (something like (2))?

github-merge-queue bot pushed a commit that referenced this issue Apr 11, 2024
…ize based on number of trie removals (#11000)

In #10890 we considered adding
extra 2000 bytes to the storage proof size for every removal operation
on the trie (Idea 1). This PR implements the logic of recording the
number of removals and calculating the adjusted storage proof size. It's
not used in the soft witness size limit for now, as we have to evaluate
how viable of a solution it is. There are concerns that charging 2000
bytes extra would cause the throughput to drop.

To evaluate the impact of adjusting the size calculation like this, the
PR adds three new metrics.
* `near_receipt_recorded_size` - a histogram of how much storage proof
is recorded when processing a receipt. Apart from
#10890, it'll help us estimate
what the hard per-receipt storage proof size limit should be.
* `near_receipt_adjusted_recorded_size` - a histogram of adjusted
storage proof size calculated when processing a receipt
* `near_receipt_adjusted_recorded_size_ratio` - ratio of adjusted size
to non-adjusted size. It'll allow us to evaluate how much the adjustment
affects the final size. The hope is that contracts rarely delete things,
so the effect will be small (a few percent), but it might turn out that
this assumption is false and the adjusted size is e.g 2x higher that the
non-adjusted one. In that case we'd have to reevaluate whether it's a
viable solution.

I'd like to run code from this branch on shadow validation nodes to
gather data from mainnet traffic.
@jancionear
Copy link
Contributor

jancionear commented Apr 11, 2024

I'll try to gather some data to see how things are.

I added some metrics to evaluate viability of idea 1 (in #11000), and started a shadow validation node to see how the upper-bound size compares to the recorded size.

Here are the grafana dashboards (starting at "Recorded storage size per receipt"):
https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&from=1713182578000&to=1713193378000&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf

@jancionear
Copy link
Contributor

Here are the grafana dashboards (starting at "Recorded storage size per receipt"):
https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&from=1713182578000&to=1713193378000&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf

Some observation about the data:

  • 99% of receipts generate very little storage proof (< 50KB)
  • There are some outlier receipts that generate up to 2MB of storage proof
  • The maximum values of the actual per-receipt recorded storage and the upper bound are fairly similar, both are under 2MB
  • 95% of chunks generate small storage proofs (< 1MB)
  • There are some outlier chunks that generate up to ~2.5MB of storage proof
  • The maximum value of per-chunk upper bound is ~2x larger than the actual recorded size. The upper bound size gets as large as ~5MB on outliers
  • For 95% of chunks the upper bound is a pretty accurate estimate, at most 20% larger than the actual recorded size.
  • There are some outlier chunks for which the upper bound estimate is 7x larger than the actual recorded size.

Based on this we could try setting the per-receipt limit to 4MB, and the per-chunk soft limit to 8MB. It leaves us with a bit of safety margin to make sure that the existing contracts won't break, while ensuring that the total size of storage proof stays under 12MB.

The truth is that gas costs seem to already limit the size quite effectively. All of the chunk-level storage proofs are under 2MB (on mainnet traffic). The limit doesn't really need to maintain the 2MB size of storage proof, as it's already maintained by gas costs. It only has to protect against malicious traffic aiming to generate a huge storage proof.

@walnut-the-cat
Copy link
Contributor

The maximum values of the actual per-receipt recorded storage and the upper bound are fairly similar, both are under 2MB

How often do this go out of sync?

The maximum value of per-chunk upper bound is ~2x larger than the actual recorded size. The upper bound size gets as large as ~5MB on outliers

I guess this is the other 5% of the cases? (since you mentioned upper bound is accurate for 95% of chunks)

Based on this we could try setting the per-receipt limit to 4MB, and the per-chunk soft limit to 8MB.

Is this pre-compression size?

cc. @saketh-are , @shreyan-gupta

@jancionear
Copy link
Contributor

jancionear commented Apr 15, 2024

How often do this go out of sync?

I only collected this data for receipts which generate >100KB of storage proof, the ratio could get really high for small receipts, and small receipts don't matter for hard limit, so I excluded them. I can't say for sure, but I'd estimate that ~5% of receipts have a big difference between the upper bound estimation and the actual value.

I guess this is the other 5% of the cases? (since you mentioned upper bound is accurate for 95% of chunks)

Those really big ones are ~0.2% of all chunks. I guess we could ignore them and lower the soft limit further. The soft limit can be arbitrarily lowered as long as it doesn't affect throughput too much.

Is this pre-compression size?

Yes, although Trie nodes are mostly hashes, which aren't compressible, so it might not matter that much.

@jancionear
Copy link
Contributor

jancionear commented Apr 15, 2024

Added one more dashboard to see how many chunks are at most X MB Big:

image

It looks like we could set the soft size limit to as low as 3MB, and 99% of the chunks would still fit in this limit. For the other 1% we would move the receipts to the delayed queue and execute them in another chunk. That would be a ~1% slowdown in chunk throughput, but it'd give us a solid guarantee that no chunk storage proof is larger than 7MB.

@walnut-the-cat
Copy link
Contributor

walnut-the-cat commented Apr 15, 2024

There are some outlier receipts that generate up to 2MB of storage proof

Can you also check where these outliers are coming from? If they are from major dapps (e.g HOT), that can be troublesome

@jancionear
Copy link
Contributor

Can you also check where these outliers are coming from? If they are from major dapps (e.g HOT), that can be troubles

I added some logs to print out receipts that generate more than 500KB of storage proof (upper bound).
Looking at the past 12 hours there are five accounts which produce such receipts:

  • aurora - generates ~700KB of storage proof
  • nft.w3music.near - generates ~700KB of storage proof
  • claim.sweat - generates ~500KB of storage proof
  • v2_0_2.perp.spin-fi.near - generates 50KB of storage proof, but the upper bound is ~1MB
  • a.mitte-orderbook.near - generates ~200KB of storage proof, but upper bound is ~1.3MB

For v2_0_2.perp.spin-fi.near and a.mitte-orderbook.near the upper bound estimate is very inaccurate, up to 20x larger than the normal recorded size, but it's still under 2MB, so we can live with it.

Here are the logs:
big_receipts.txt

jancionear added a commit to jancionear/nearcore that referenced this issue Apr 16, 2024
…time

To enforce the hard per-receipt limit we need to monitor how much storage
proof has been recorded during execution of the receipt and halt the
execution when the size of generated storage proof goes over the limit.

To achieve this the runtime needs to be able to see how much proof was recorded,
so let's expose this information so that it's available from the runtime.

`recorded_storage_size()` doesn't provide the exact size of storage
proof, as it doesn't cover some corner cases (see near#10890),
so we use the `upper_bound` version to estimate how much storage proof
could've been generated by the receipt. As long as upper bound is
under the limit we can be sure that the actual value is also
under the limit.
jancionear added a commit to jancionear/nearcore that referenced this issue Apr 16, 2024
…limit

The `recorded_storage_size()` function can sometimes return a value
which is less than the actual recorded size. See near#10890
for details.
This means that a malicious actor could create a workload which would
bypass the soft size limit and e.g generate 10x more storage proof
than allowed.

To fix this proble let's use the upper bound estimation of the total
recorded size. As long as the upper bound estimation is under the
limit we can be sure that the actual value is also under the limit.
@jancionear
Copy link
Contributor

I added some logs to print out receipts that generate more than 500KB of storage proof (upper bound).

We've got a new record!
sega.segaai.near generated 2.1MB of storage proof (upper bound estimated to 2.3MB)

Apr 17 03:14:59 ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf neard[132426]: 2024-04-17T03:14:59.035123Z  INFO apply_chunk{shard_id=4}: runtime: Receipt with large storage proof! v2 recorded_storage_diff=2118418.0 recorded_storage_upper_bound_diff=2320418.0 receipt_id=4Qz3F5ZN4npjvLUdaoR1fBiHRiqQHpPWxFFg82mbXgy1 predecessor_id=AccountId("sega.segaai.near") receiver_id=AccountId("sega.segaai.near")

github-merge-queue bot pushed a commit that referenced this issue Apr 19, 2024
…tion (#11069)

During receipt execution we record all touched nodes from the pre-state
trie. Those recorded nodes form the storage proof that is sent to
validators, and validators use it to execute the receipts and validate
the results.

In #9378 it's stated that in a
worst case scenario a single receipt can generate hundreds of megabytes
of storage proof. That would cause problems, as it'd cause the
`ChunkStateWitness` to also be hundreds of megabytes in size, and there
would be problems with sending this much data over the network.

Because of that we need to limit the size of the storage proof. We plan
to have two limits:
* per-chunk soft limit - once a chunk has more than X MB of storage
proof we stop processing new receipts, and move the remaining ones to
the delayed receipt queue. This has been implemented in
#10703
* per-receipt hard limit - once a receipt generates more than X MB of
storage proof we fail the receipt, similarly to what happens when a
receipt goes over the allowed gas limit. This one is implemented in this
PR.

Most of the hard-limit code is straightforward - we need to track the
size of recorded storage and fail the receipt if it goes over the limit.
But there is one ugly problem:
#10890. Because of the way
current `TrieUpdate` works we don't record all of the storage proof in
real time. There are some corner cases (deleting one of two children of
a branch) in which some nodes are not recorded until we do `finalize()`
at the end of the chunk. This means that we can't really use
`Trie::recorded_storage_size()` to limit the size, as it isn't fully
accurate. If we do that, a malicious actor could prepare receipts which
seem to have only 1MB of storage proof during execution, but actually
record 10MB during `finalize()`.
There is a long discussion in
#10890 along with some possible
solution ideas, please read that if you need more context.

This PR implements Idea 1 from
#10890.
Instead of using `Trie::recorded_storage_size()` we'll use
`Trie::recorded_storage_size_upper_bound()`, which estimates the upper
bound of recorded storage size by assuming that every trie removal
records additional 2000 bytes:
```rust
    /// Size of the recorded state proof plus some additional size added to cover removals.
    /// An upper-bound estimation of the true recorded size after finalization.
    /// See #10890 and #11000 for details.
    pub fn recorded_storage_size_upper_bound(&self) -> usize {
        // Charge 2000 bytes for every removal
        let removals_size = self.removal_counter.saturating_mul(2000);
        self.recorded_storage_size().saturating_add(removals_size)
    }
```
As long as the upper bound is below the limit we can be sure that the
real recorded size is also below the limit.
It's a rough estimation, which often exaggerates the actual recorded
size (even by 20+ times), but it could be a good-enough/MVP solution for
now. Doing it in a better way would require a lot of refactoring in the
Trie code. We're now [moving
fast](https://near.zulipchat.com/#narrow/stream/407237-core.2Fstateless-validation/topic/Faster.20decision.20making),
so I decided to go with this solution for now.

The upper bound calculation has been added in a previous PR along with
the metrics to see if using such a rough estimation is viable:
#11000

I set up a mainnet node with shadow validation to gather some data about
the size distribution with mainnet traffic: [Metrics
link](https://nearinc.grafana.net/d/edbl9ztm5h1q8b/stateless-validation?orgId=1&var-chain_id=mainnet&var-shard_id=All&var-node_id=ci-b20a9aef-mainnet-rpc-europe-west4-01-84346caf&from=1713225600000&to=1713272400000)

![image](https://github.com/near/nearcore/assets/149345204/dc3daa88-5f59-4ae5-aa9e-ab2802f034b8)

![image](https://github.com/near/nearcore/assets/149345204/90602443-7a0f-4503-9bce-8fbce352c0ba)

The metrics show that:
* For all receipts both the recorded size and the upper bound estimate
are below 2MB
* Overwhelming majority of receipts generate < 50KB of storage proof
* For all chunks the upper bound estimate is below 6MB
* For 99% of chunks the upper bound estimate is below 3MB

Based on this I believe that we can:
* Set the hard per-receipt limit to 4MB. All receipts were below 2MB,
but it's good to have a bit of a safety margin here. This is a hard
limit, so it might break existing contracts if they turn out to generate
more storage proof than the limit.
* Set the soft per-chunk limit to 3MB. 99% of chunks will not be
affected by this limit. For the 1% that hit the limit they'll execute
fewer receipts, with the rest of the receipts put into the delayed
receipt queue. This slightly lowers throughput of a single chunk, but
it's not a big slowdown, by ~1%.

Having a 4MB per-receipt hard limit and a 3MB per-chunk soft limit would
give us a hard guarantee that for all chunks the total storage proof
size is below 7MB.

It's worth noting that gas usage already limits the storage proof size
quite effectively. For 98% of chunks the storage proof size is already
below 2MB, so the limit isn't really needed for typical mainnet traffic.
The limit matters mostly for stopping malicious actors that'd try to DoS
the network by generating large storage proofs.

Fixes: #11019
@jancionear
Copy link
Contributor

#11069 worked around this issue by implementing Idea (1) (upper bound estimation of the recorded size).
The issue is no longer a blocker for stateless validation, but IMO it's still an issue. The upper bound estimation is inaccurate, it'd be much cleaner to fix the recording issue properly.
I think the issue should remain open, but we can remove it from the stateless validation roadmap.

github-merge-queue bot pushed a commit that referenced this issue Jun 6, 2024
…11507)

To limit the amount of storage proof generated during chunk application
we calculate the upper bound estimation of how big the storage proof
will be, and stop executing receipts when this estimated size gets to
big. When estimating we assume that every trie removals generates 2000
bytes of storage proof, because this is the maximum size that a
malicious attacker could generate (#11069, #10890).

This estimation was meant to limit the size of proof generated while
executing receipts, but currently it also applies to other trie removals
performed by the runtime, for example when removing receipts from the
delayed receipt queue. This is really wasteful - removing 1000 receipts
would cause the estimation to jump by 2MB, hitting the soft limit. We
don't really need to charge this much for internal operations performed
by the runtime, they aren't malicious. Let's change is so that only
contracts are charged extra for removals. This will avoid the extra big
estimation caused by normal queue manipulation.

Refs:
https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Large.20number.20of.20delayed.20receipts.20in.20statelessnet/near/442878068
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation A-storage Area: storage and databases
Projects
None yet
Development

No branches or pull requests

5 participants