Skip to content

Commit

Permalink
Restructure TrieStorage and implement correct semantics for state pro…
Browse files Browse the repository at this point in the history
…ofs (near#9350)

## Background
For stateless validation, it must be possible to generate a state proof when applying a chunk, and we must ensure that
 - During generation of the state proof, the execution outcome is identical to if we weren't generating a state proof
 - When reapplying the same chunk using only the state proof (no full trie or flat storage), the outcome must also be identical.

## Problem
The tricky part is storage gas accounting. 

For background, the trie is stored as two kinds of entries: trie nodes, and values. For trie nodes, the key is the hash of the node and the value is the node. For values, the key is the hash of the whole value, and the value is the value.

When doing a lookup of either a trie node or a value, there can be three cases:
 - If looking up a trie node, the node is returned from flat storage (unless flat storage is disabled for testing)
 - Otherwise, the node or value is looked up first from the chunk cache (deterministic cache that is accumulated during chunk application). If present, it is considered an in-memory access.
 - Otherwise, the node or value is fetched from disk, and that's considered an from-db access. (The actual fetch however may come from another cache called the 'shard cache', which is non-deterministic and best-effort.)

**Each of these cases incurs a different storage operation cost.** Therefore, when recording and replaying, we must also compute the exact same cost, by correctly determining which case happens for each access.

Previously, this was difficult, because the chunk cache lives in TrieCachingStorage, while the flat storage layer lives in Trie. (As a background, Trie contains TrieStorage, which is a base class that is implemented by TrieCachingStorage, TrieRecordingStorage, and TriePartialMemoryStorage). So when recording or replaying, the TrieStorage used is NOT a TrieCachingStorage, but either TrieRecordingStorage or TrieMemoryStorage. This makes the chunk cache inaccessible, and so the other two storage implementations had to replicate the behavior; however this behavior was impossible to replicate because the flat storage layer is independent:

* Consider when a lookup happens through flat storage. The Trie looks it up from flat storage and then skips calling the underlying TrieStorage. This is undesirable if we're recording, because even though the result came from flat storage, we must still include it in the recorded state proof. That means we should call the underlying TrieStorage to do a lookup, but then that would increment in-memory and db access counters which would be incorrect.

## What this PR does
This PR moves the critical chunk cache logic out of TrieCachingStorage and into a field of Trie, which ensures that no matter what TrieStorage is being used, the chunk cache logic is always the same. Additionally, TrieRecordingStorage is removed and the logic is also moved into Trie as an optional field. This allows Trie to have total control of when to cache and when to record:

* In the flat storage case, if we're recording, we would additionally look up via the trie (so we can produce the state proof), but this lookup goes straight to the TrieStorage layer, which means we're not counting the cost of such trie accesses. This ensures that the gas computation is the same whether we're recording or not.
* If we're recording, any access that went to the TrieStorage layer also goes into the recorded state proof. This is done in a helper method so we ensure we don't miss anything.
* If we're replaying, instead of using TrieCachingStorage, we use TriePartialMemoryStorage. The chunk cache layer remains the same since that's independent. There's one difference, which is that we don't have flat storage when replaying. The trick here is to do trie lookups without counting the cost, just like when recording.

This is still not a perfect design as there's still quite a few places where we penetrate the abstraction layers, but it's better than before. At least we don't need to have a TrieRecordingStorage having a TrieCachingStorage as the backend and having to reimplement chunk caching behavior everywhere.

## Testing
Eight randomized tests have been added (see trie_recording.rs) to check that in various scenarios (chunk cache enabled or disabled; looking up existing or missing keys; using flat storage or not), the trie access results (values and counters) are exactly the same between normal trie, recording trie, and replaying trie. This should give us confidence that this implementation is suitable for stateless validation.

Also, I'm running a mainnet node with this code. For a few hundred blocks it's been surviving fine, so the implementation should be consistent with the previous, too.

Nayduck: https://nayduck.near.org/#/run/3150
  • Loading branch information
robin-near authored and nikurt committed Aug 24, 2023
1 parent 13d3a58 commit 5bee57a
Show file tree
Hide file tree
Showing 29 changed files with 729 additions and 304 deletions.
2 changes: 1 addition & 1 deletion chain/chain/src/flat_storage_creator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ impl FlatStorageShardCreator {
trie_iter.visit_nodes_interval(&path_begin, &path_end).unwrap()
{
if let Some(key) = key {
let value = trie.storage.retrieve_raw_bytes(&hash).unwrap();
let value = trie.retrieve_value(&hash).unwrap();
store_helper::set_flat_state_value(
&mut store_update,
shard_uid,
Expand Down
2 changes: 1 addition & 1 deletion core/primitives/src/types.rs
Original file line number Diff line number Diff line change
Expand Up @@ -960,7 +960,7 @@ pub trait EpochInfoProvider {
}

/// Mode of the trie cache.
#[derive(Debug, Copy, Clone)]
#[derive(Debug, Copy, Clone, PartialEq, Eq)]
pub enum TrieCacheMode {
/// In this mode we put each visited node to LRU cache to optimize performance.
/// Presence of any exact node is not guaranteed.
Expand Down
3 changes: 3 additions & 0 deletions core/store/src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ pub(crate) static DATABASE_OP_LATENCY_HIST: Lazy<HistogramVec> = Lazy::new(|| {
.unwrap()
});

// TODO(#9054): Rename the metric to be consistent with "accounting cache".
pub static CHUNK_CACHE_HITS: Lazy<IntCounterVec> = Lazy::new(|| {
try_create_int_counter_vec(
"near_chunk_cache_hits",
Expand All @@ -27,6 +28,7 @@ pub static CHUNK_CACHE_HITS: Lazy<IntCounterVec> = Lazy::new(|| {
.unwrap()
});

// TODO(#9054): Rename the metric to be consistent with "accounting cache".
pub static CHUNK_CACHE_MISSES: Lazy<IntCounterVec> = Lazy::new(|| {
try_create_int_counter_vec(
"near_chunk_cache_misses",
Expand Down Expand Up @@ -68,6 +70,7 @@ pub static SHARD_CACHE_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
.unwrap()
});

// TODO(#9054): Rename the metric to be consistent with "accounting cache".
pub static CHUNK_CACHE_SIZE: Lazy<IntGaugeVec> = Lazy::new(|| {
try_create_int_gauge_vec("near_chunk_cache_size", "Chunk cache size", &["shard_id", "is_view"])
.unwrap()
Expand Down
32 changes: 31 additions & 1 deletion core/store/src/test_utils.rs
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
use std::collections::{HashMap, HashSet};
use std::sync::Arc;

use near_primitives::state::{FlatStateValue, ValueRef};
use near_primitives::trie_key::TrieKey;
use rand::seq::SliceRandom;
use rand::Rng;

use crate::db::TestDB;
use crate::flat::{store_helper, BlockInfo, FlatStorageReadyStatus};
use crate::metadata::{DbKind, DbVersion, DB_VERSION};
use crate::{get, get_delayed_receipt_indices, DBCol, NodeStorage, ShardTries, Store};
use near_primitives::account::id::AccountId;
use near_primitives::hash::CryptoHash;
use near_primitives::hash::{hash, CryptoHash};
use near_primitives::receipt::{DataReceipt, Receipt, ReceiptEnum};
use near_primitives::shard_layout::{ShardUId, ShardVersion};
use near_primitives::types::{NumShards, StateRoot};
Expand Down Expand Up @@ -87,6 +89,34 @@ pub fn test_populate_trie(
root
}

pub fn test_populate_flat_storage(
tries: &ShardTries,
shard_uid: ShardUId,
block_hash: &CryptoHash,
prev_block_hash: &CryptoHash,
changes: &Vec<(Vec<u8>, Option<Vec<u8>>)>,
) {
let mut store_update = tries.store_update();
store_helper::set_flat_storage_status(
&mut store_update,
shard_uid,
crate::flat::FlatStorageStatus::Ready(FlatStorageReadyStatus {
flat_head: BlockInfo { hash: *block_hash, prev_hash: *prev_block_hash, height: 1 },
}),
);
for (key, value) in changes {
store_helper::set_flat_state_value(
&mut store_update,
shard_uid,
key.clone(),
value.as_ref().map(|value| {
FlatStateValue::Ref(ValueRef { hash: hash(value), length: value.len() as u32 })
}),
);
}
store_update.commit().unwrap();
}

/// Insert values to non-reference-counted columns in the store.
pub fn test_populate_store(store: &Store, data: &[(DBCol, Vec<u8>, Vec<u8>)]) {
let mut update = store.store_update();
Expand Down
122 changes: 122 additions & 0 deletions core/store/src/trie/accounting_cache.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
use near_o11y::metrics::prometheus;
use near_o11y::metrics::prometheus::core::{GenericCounter, GenericGauge};
use near_primitives::errors::StorageError;
use near_primitives::hash::CryptoHash;
use near_primitives::shard_layout::ShardUId;
use near_vm_runner::logic::TrieNodesCount;
use std::collections::HashMap;
use std::sync::Arc;

use crate::{metrics, TrieStorage};

/// Deterministic cache to store trie nodes that have been accessed so far
/// during the cache's lifetime. It is used for deterministic gas accounting
/// so that previously accessed trie nodes and values are charged at a
/// cheaper gas cost.
///
/// This cache's correctness is critical as it contributes to the gas
/// accounting of storage operations during contract execution. For that
/// reason, a new TrieAccountingCache must be created at the beginning of a
/// chunk's execution, and the db_read_nodes and mem_read_nodes must be taken
/// into account whenever a storage operation is performed to calculate what
/// kind of operation it was.
///
/// Note that we don't have a size limit for values in the accounting cache.
/// There are two reasons:
/// - for nodes, value size is an implementation detail. If we change
/// internal representation of a node (e.g. change `memory_usage` field
/// from `RawTrieNodeWithSize`), this would have to be a protocol upgrade.
/// - total size of all values is limited by the runtime fees. More
/// thoroughly:
/// - number of nodes is limited by receipt gas limit / touching trie
/// node fee ~= 500 Tgas / 16 Ggas = 31_250;
/// - size of trie keys and values is limited by receipt gas limit /
/// lowest per byte fee (`storage_read_value_byte`) ~=
/// (500 * 10**12 / 5611005) / 2**20 ~= 85 MB.
/// All values are given as of 16/03/2022. We may consider more precise limit
/// for the accounting cache as well.
///
/// Note that in general, it is NOT true that all storage access is either a
/// db read or mem read. It can also be a flat storage read, which is not
/// tracked via TrieAccountingCache.
pub struct TrieAccountingCache {
/// Whether the cache is enabled. By default it is not, but it can be
/// turned on or off on the fly.
enable: bool,
/// Cache of trie node hash -> trie node body, or a leaf value hash ->
/// leaf value.
cache: HashMap<CryptoHash, Arc<[u8]>>,
/// The number of times a key was accessed by reading from the underlying
/// storage. (This does not necessarily mean it was accessed from *disk*,
/// as the underlying storage layer may have a best-effort cache.)
db_read_nodes: u64,
/// The number of times a key was accessed when it was deterministically
/// already cached during the processing of this chunk.
mem_read_nodes: u64,
/// Prometheus metrics. It's optional - in testing it can be None.
metrics: Option<TrieAccountingCacheMetrics>,
}

struct TrieAccountingCacheMetrics {
accounting_cache_hits: GenericCounter<prometheus::core::AtomicU64>,
accounting_cache_misses: GenericCounter<prometheus::core::AtomicU64>,
accounting_cache_size: GenericGauge<prometheus::core::AtomicI64>,
}

impl TrieAccountingCache {
/// Constructs a new accounting cache. By default it is not enabled.
/// The optional parameter is passed in if prometheus metrics are desired.
pub fn new(shard_uid_and_is_view: Option<(ShardUId, bool)>) -> Self {
let metrics = shard_uid_and_is_view.map(|(shard_uid, is_view)| {
let mut buffer = itoa::Buffer::new();
let shard_id = buffer.format(shard_uid.shard_id);

let metrics_labels: [&str; 2] = [&shard_id, if is_view { "1" } else { "0" }];
TrieAccountingCacheMetrics {
accounting_cache_hits: metrics::CHUNK_CACHE_HITS.with_label_values(&metrics_labels),
accounting_cache_misses: metrics::CHUNK_CACHE_MISSES
.with_label_values(&metrics_labels),
accounting_cache_size: metrics::CHUNK_CACHE_SIZE.with_label_values(&metrics_labels),
}
});
Self { enable: false, cache: HashMap::new(), db_read_nodes: 0, mem_read_nodes: 0, metrics }
}

pub fn set_enabled(&mut self, enabled: bool) {
self.enable = enabled;
}

/// Retrieve raw bytes from the cache if it exists, otherwise retrieve it
/// from the given storage, and count it as a db access.
pub fn retrieve_raw_bytes_with_accounting(
&mut self,
hash: &CryptoHash,
storage: &dyn TrieStorage,
) -> Result<Arc<[u8]>, StorageError> {
if let Some(node) = self.cache.get(hash) {
self.mem_read_nodes += 1;
if let Some(metrics) = &self.metrics {
metrics.accounting_cache_hits.inc();
}
Ok(node.clone())
} else {
self.db_read_nodes += 1;
if let Some(metrics) = &self.metrics {
metrics.accounting_cache_misses.inc();
}
let node = storage.retrieve_raw_bytes(hash)?;

if self.enable {
self.cache.insert(*hash, node.clone());
if let Some(metrics) = &self.metrics {
metrics.accounting_cache_size.set(self.cache.len() as i64);
}
}
Ok(node)
}
}

pub fn get_trie_nodes_count(&self) -> TrieNodesCount {
TrieNodesCount { db_reads: self.db_read_nodes, mem_reads: self.mem_read_nodes }
}
}
7 changes: 2 additions & 5 deletions core/store/src/trie/iterator.rs
Original file line number Diff line number Diff line change
Expand Up @@ -372,7 +372,7 @@ impl<'a> TrieIterator<'a> {
if self.key_nibbles[prefix..] >= path_end[prefix..] {
break;
}
self.trie.storage.retrieve_raw_bytes(&hash)?;
self.trie.retrieve_value(&hash)?;
nodes_list.push(TrieTraversalItem {
hash,
key: self.has_value().then(|| self.key()),
Expand Down Expand Up @@ -417,10 +417,7 @@ impl<'a> Iterator for TrieIterator<'a> {
},
(IterStep::Value(hash), true) => {
return Some(
self.trie
.storage
.retrieve_raw_bytes(&hash)
.map(|value| (self.key(), value.to_vec())),
self.trie.retrieve_value(&hash).map(|value| (self.key(), value.to_vec())),
)
}
}
Expand Down
Loading

0 comments on commit 5bee57a

Please sign in to comment.