Unify and lower state caches #5313

dknopik · 2024-02-26T19:28:26Z

Issue Addressed

Improve & unify parallel de-duplication caches #5112

Proposed Changes

Rewrite the PromiseCache: Instead of holding finished values for an indefinite amount of time, we only supply them to threads that wait for them, and then discard them. Caller may or may not store the finished value in another cache.
- Rationale: The PromiseCache can be easily combined with a "long-term" caching solution such as an LruCache or another situationally appropriate cache.
- Drawback: Separation goes directly against the goal of the issue addressed: unification of caches. Could be mitigated by providing a wrapper that links a PromiseCache and some kind of "long-term" cache automatically.
Introduce the new PromiseCache in HotColdDB::get_hot_state.
Introduce the new PromiseCache in HotColdDB::load_hdiff_buffer_for_slot to accelerate parallel cold state loads.
- Rationale: Introducing the cache here instead of in e.g. load_cold_state_by_slot allows us to benefit from the cache not only if we request the state of a specific slot in parallel, but also if we request states requiring the same diff that is currently computed.
- Drawback: Compared to an introduction load_cold_state_by_slot, we have more writing state accesses, which are superfluous if there is no parallel access.
Remove the ShufflingCache.
- Rationale: The new caches in the HotColdDB are sufficient, so we save memory by removing this cache.
- Drawback: Especially for cold states, relying on the low level caches is slower. (On my machine: ~10ms for infrequent requests and ~20ms-50ms for rapid requests to the same state)
Arc the HDiffBuffers in the diff_buffer_cache.
- Rationale: I was able to figure out that some of the performance regression mentioned in the drawbacks of the previous point were due to avoidable clones out of the diff_buffer_cache. This change avoids those copies.
- Drawback: In some code paths, the Arc only adds unnecessary indirection and does not save a clone.

Additional Info

I considered reintroducing the historic_state_cache, but decided against it, as I believe the performance of the new caching is still sufficient. Reintroducing it is however still a viable option to aid with rapid requests for the same slot.
Keep in mind that the tasks waiting in a PromiseCache are blocking, i.e. they are not available for the tokio executor. This is currently not possible as the store is not async. Therefore, while they save system resources by not running parallel computations, they still are not available for tokio to run some other task on that thread. Future work might make the state retrieval interface async to allow use of an async PromiseCache variant to make these threads usable for our tokio executor.
I tested the performance mostly on local testnets, but tried to keep in mind that real states tend to be larger...
As always, I am happy about any feedback :)

…miseCache

michaelsproul

Thanks for implementing this! It looks awesome on the whole!

I'll do some benchmarks for the committee cache removal on mainnet or Holesky. I think the hot committees should be fine as we will use the head or some state from the state_cache, but as you said via DM the cold committees are going to be slower due to the non-zero cost of going from HDiffBuffer -> BeaconState -> CommitteeCache.

I guess a hybrid approach could be to retain the shuffling_cache as a simpler LRU, which relies on the de-dupe at the store level for parallel requests?

beacon_node/store/src/hdiff.rs

michaelsproul · 2024-02-27T03:59:08Z

beacon_node/store/src/hot_cold_store.rs

-        diff.apply(&mut buffer)?;
+            // Load diff and apply it to buffer.
+            let diff = self.load_hdiff_for_slot(slot)?;
+            diff.apply(Arc::make_mut(&mut buffer))?;


I like the addition of the Arcs here. In future we might be able to go further and just clone the buffer.balances, because that's the only one we actually mutate.

The main state diff only needs to be mutable because we re-assign it here:

lighthouse/beacon_node/store/src/hdiff.rs

Lines 156 to 159 in 928915c

pub fn apply_xdelta(&self, source: &[u8], target: &mut Vec<u8>) -> Result<(), Error> {

*target = xdelta3::decode(&self.bytes, source).ok_or(Error::UnableToApplyDiff)?;

Ok(())

}

i.e. it's a bit wasteful as-is, because we clone the buffer.state and then don't even use that memory

Aaah, I missed that! I committed a quick suggestion, but am not sure if it's the best approach.

common/promise_cache/src/lib.rs

dknopik · 2024-02-29T20:06:25Z

I guess a hybrid approach could be to retain the shuffling_cache as a simpler LRU, which relies on the de-dupe at the store level for parallel requests?

Hmm. I think I would prefer re-enabling the historic state cache with computed committee caches and default size of 1 over that, as the shuffling_cache would then again also cache the shufflings from the hot states, which is kind of wasteful. That approach would also benefit users that are collecting all kinds of info for single cold states.

But it is kind of hard to judge that, as I don't know the API usage profile of the average user interested in cold states.

dknopik · 2024-03-13T22:19:50Z

This is tagged as "waiting-on-author". How do we want to proceed with it?

realbigsean · 2024-03-13T22:25:28Z

This is tagged as "waiting-on-author". How do we want to proceed with it?

looks ready for a re-review, I think the tag was just outdated

michaelsproul · 2024-04-08T00:17:08Z

Hey @dknopik sorry for the slow review on this one

I was testing out your changes while also trying to get tree-states to help rescue the Goerli network. I found we were still getting slogged by lots of cache misses and parallel state loads, so I tried re-working the cache to de-duplicate even more requests on this branch: https://github.com/michaelsproul/lighthouse/commits/tree-states-goerli-special/

Even with those changes, Goerli was still basically untenable. This has lead me to reconsider whether we want to continue using state diffs in the hot database. The problem is that, during periods of long non-finality, when you get a cache miss you need to load potentially hundreds of diffs (one every few epochs) back to the finalized state. This takes a lot of time.

In tandem, we're also working on merging tree-states down to unstable gradually. The plan there is to split it into 3 parts:

Single-pass epoch processing and optimised block processing #5279
In-memory tree-states (I'm working on this now)
Database changes (everything from tree-states, and the migration from Tree states database upgrade 🏗️ #5067)

For (2) we don't need all the complexity of the state diff handling in the cache, so I will try to incorporate some form of your changes. To keep the change gradual I think I will also keep the attester shuffling caches, until we can show they're not necessary.

For the disk-based changes, I think if we abandon the state diffing and store full CompactBeaconStates every N epochs, this will accomplish some space saving compared to unstable, and be faster during non-finality, because we just need to load 1 state and replay up to N epochs worth of blocks in case of a cache miss. I think it will also be good to drip the DB improvements in gradually after we have (2). The ones we're confident in, like fixing the pubkey cache (#3505) could go in first.

dknopik · 2024-04-08T06:04:42Z

Hi @michaelsproul, thanks for the detailed update!

Unfortunately I'll be pretty busy this week, but if I find some time I'll try to properly catch up. It probably makes no sense for me to get involved through coding right now, but if I have any ideas regarding caching (or the current state of affairs in general) I'll let you know.

michaelsproul · 2024-04-08T06:24:27Z

No worries @dknopik! I think I can handle merging stuff down in a satisfactory way.

The in-memory PR is here for reference:

In-memory tree states #5533

dknopik · 2024-11-27T12:45:47Z

Closing, obsolete

dknopik added 7 commits February 13, 2024 19:09

Introduce ComputationCache to avoid redundant state computation

a152c27

experimentally remove shuffling cache, rename ComputationCache to Pro…

5046979

…miseCache

cargo fmt

4c8ab44

clean up and fix tests

02f7835

more cleanup and comments

60bbd60

Merge branch 'tree-states' into tree-states-super-cache

19ec711

cache HDiffBuffers Arc'd to prevent large clones

537ead4

michaelsproul added optimization Something to make Lighthouse run more efficiently. tree-states Upcoming state and database overhaul ready-for-review The code is ready for review labels Feb 27, 2024

michaelsproul reviewed Feb 27, 2024

View reviewed changes

michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 27, 2024

dknopik added 2 commits February 29, 2024 19:46

Merge branch 'tree-states' into tree-states-super-cache

9203d3a

code review

a400a0f

realbigsean added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Mar 13, 2024

Merge branch 'tree-states' into tree-states-super-cache

8726560

michaelsproul added work-in-progress PR is a work-in-progress under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Apr 8, 2024

michaelsproul mentioned this pull request Apr 8, 2024

In-memory tree states #5533

Merged

4 tasks

dknopik closed this Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify and lower state caches #5313

Unify and lower state caches #5313

dknopik commented Feb 26, 2024

michaelsproul left a comment

michaelsproul Feb 27, 2024

dknopik Feb 29, 2024

dknopik commented Feb 29, 2024

dknopik commented Mar 13, 2024

realbigsean commented Mar 13, 2024 •

edited

Loading

michaelsproul commented Apr 8, 2024

dknopik commented Apr 8, 2024

michaelsproul commented Apr 8, 2024

dknopik commented Nov 27, 2024

	pub fn apply_xdelta(&self, source: &[u8], target: &mut Vec<u8>) -> Result<(), Error> {
	*target = xdelta3::decode(&self.bytes, source).ok_or(Error::UnableToApplyDiff)?;
	Ok(())
	}

Unify and lower state caches #5313

Unify and lower state caches #5313

Conversation

dknopik commented Feb 26, 2024

Issue Addressed

Proposed Changes

Additional Info

michaelsproul left a comment

Choose a reason for hiding this comment

michaelsproul Feb 27, 2024

Choose a reason for hiding this comment

dknopik Feb 29, 2024

Choose a reason for hiding this comment

dknopik commented Feb 29, 2024

dknopik commented Mar 13, 2024

realbigsean commented Mar 13, 2024 • edited Loading

michaelsproul commented Apr 8, 2024

dknopik commented Apr 8, 2024

michaelsproul commented Apr 8, 2024

dknopik commented Nov 27, 2024

realbigsean commented Mar 13, 2024 •

edited

Loading