Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify and lower state caches #5313

Closed
wants to merge 10 commits into from

Conversation

dknopik
Copy link
Member

@dknopik dknopik commented Feb 26, 2024

Issue Addressed

Proposed Changes

  1. Rewrite the PromiseCache: Instead of holding finished values for an indefinite amount of time, we only supply them to threads that wait for them, and then discard them. Caller may or may not store the finished value in another cache.

    • Rationale: The PromiseCache can be easily combined with a "long-term" caching solution such as an LruCache or another situationally appropriate cache.
    • Drawback: Separation goes directly against the goal of the issue addressed: unification of caches. Could be mitigated by providing a wrapper that links a PromiseCache and some kind of "long-term" cache automatically.
  2. Introduce the new PromiseCache in HotColdDB::get_hot_state.

  3. Introduce the new PromiseCache in HotColdDB::load_hdiff_buffer_for_slot to accelerate parallel cold state loads.

    • Rationale: Introducing the cache here instead of in e.g. load_cold_state_by_slot allows us to benefit from the cache not only if we request the state of a specific slot in parallel, but also if we request states requiring the same diff that is currently computed.
    • Drawback: Compared to an introduction load_cold_state_by_slot, we have more writing state accesses, which are superfluous if there is no parallel access.
  4. Remove the ShufflingCache.

    • Rationale: The new caches in the HotColdDB are sufficient, so we save memory by removing this cache.
    • Drawback: Especially for cold states, relying on the low level caches is slower. (On my machine: ~10ms for infrequent requests and ~20ms-50ms for rapid requests to the same state)
  5. Arc the HDiffBuffers in the diff_buffer_cache.

    • Rationale: I was able to figure out that some of the performance regression mentioned in the drawbacks of the previous point were due to avoidable clones out of the diff_buffer_cache. This change avoids those copies.
    • Drawback: In some code paths, the Arc only adds unnecessary indirection and does not save a clone.

Additional Info

  • I considered reintroducing the historic_state_cache, but decided against it, as I believe the performance of the new caching is still sufficient. Reintroducing it is however still a viable option to aid with rapid requests for the same slot.
  • Keep in mind that the tasks waiting in a PromiseCache are blocking, i.e. they are not available for the tokio executor. This is currently not possible as the store is not async. Therefore, while they save system resources by not running parallel computations, they still are not available for tokio to run some other task on that thread. Future work might make the state retrieval interface async to allow use of an async PromiseCache variant to make these threads usable for our tokio executor.
  • I tested the performance mostly on local testnets, but tried to keep in mind that real states tend to be larger...
  • As always, I am happy about any feedback :)

@michaelsproul michaelsproul added optimization Something to make Lighthouse run more efficiently. tree-states Upcoming state and database overhaul ready-for-review The code is ready for review labels Feb 27, 2024
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this! It looks awesome on the whole!

I'll do some benchmarks for the committee cache removal on mainnet or Holesky. I think the hot committees should be fine as we will use the head or some state from the state_cache, but as you said via DM the cold committees are going to be slower due to the non-zero cost of going from HDiffBuffer -> BeaconState -> CommitteeCache.

I guess a hybrid approach could be to retain the shuffling_cache as a simpler LRU, which relies on the de-dupe at the store level for parallel requests?

beacon_node/store/src/hdiff.rs Outdated Show resolved Hide resolved
diff.apply(&mut buffer)?;
// Load diff and apply it to buffer.
let diff = self.load_hdiff_for_slot(slot)?;
diff.apply(Arc::make_mut(&mut buffer))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the addition of the Arcs here. In future we might be able to go further and just clone the buffer.balances, because that's the only one we actually mutate.

The main state diff only needs to be mutable because we re-assign it here:

pub fn apply_xdelta(&self, source: &[u8], target: &mut Vec<u8>) -> Result<(), Error> {
*target = xdelta3::decode(&self.bytes, source).ok_or(Error::UnableToApplyDiff)?;
Ok(())
}

i.e. it's a bit wasteful as-is, because we clone the buffer.state and then don't even use that memory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, I missed that! I committed a quick suggestion, but am not sure if it's the best approach.

common/promise_cache/src/lib.rs Outdated Show resolved Hide resolved
common/promise_cache/src/lib.rs Outdated Show resolved Hide resolved
@michaelsproul michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 27, 2024
@dknopik
Copy link
Member Author

dknopik commented Feb 29, 2024

I guess a hybrid approach could be to retain the shuffling_cache as a simpler LRU, which relies on the de-dupe at the store level for parallel requests?

Hmm. I think I would prefer re-enabling the historic state cache with computed committee caches and default size of 1 over that, as the shuffling_cache would then again also cache the shufflings from the hot states, which is kind of wasteful. That approach would also benefit users that are collecting all kinds of info for single cold states.

But it is kind of hard to judge that, as I don't know the API usage profile of the average user interested in cold states.

@dknopik
Copy link
Member Author

dknopik commented Mar 13, 2024

This is tagged as "waiting-on-author". How do we want to proceed with it?

@realbigsean realbigsean added ready-for-review The code is ready for review and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Mar 13, 2024
@realbigsean
Copy link
Member

realbigsean commented Mar 13, 2024

This is tagged as "waiting-on-author". How do we want to proceed with it?

looks ready for a re-review, I think the tag was just outdated

@michaelsproul
Copy link
Member

Hey @dknopik sorry for the slow review on this one

I was testing out your changes while also trying to get tree-states to help rescue the Goerli network. I found we were still getting slogged by lots of cache misses and parallel state loads, so I tried re-working the cache to de-duplicate even more requests on this branch: https://github.com/michaelsproul/lighthouse/commits/tree-states-goerli-special/

Even with those changes, Goerli was still basically untenable. This has lead me to reconsider whether we want to continue using state diffs in the hot database. The problem is that, during periods of long non-finality, when you get a cache miss you need to load potentially hundreds of diffs (one every few epochs) back to the finalized state. This takes a lot of time.

In tandem, we're also working on merging tree-states down to unstable gradually. The plan there is to split it into 3 parts:

  1. Single-pass epoch processing and optimised block processing #5279
  2. In-memory tree-states (I'm working on this now)
  3. Database changes (everything from tree-states, and the migration from Tree states database upgrade 🏗️ #5067)

For (2) we don't need all the complexity of the state diff handling in the cache, so I will try to incorporate some form of your changes. To keep the change gradual I think I will also keep the attester shuffling caches, until we can show they're not necessary.

For the disk-based changes, I think if we abandon the state diffing and store full CompactBeaconStates every N epochs, this will accomplish some space saving compared to unstable, and be faster during non-finality, because we just need to load 1 state and replay up to N epochs worth of blocks in case of a cache miss. I think it will also be good to drip the DB improvements in gradually after we have (2). The ones we're confident in, like fixing the pubkey cache (#3505) could go in first.

@michaelsproul michaelsproul added work-in-progress PR is a work-in-progress under-review A reviewer has only partially completed a review. and removed ready-for-review The code is ready for review labels Apr 8, 2024
@dknopik
Copy link
Member Author

dknopik commented Apr 8, 2024

Hi @michaelsproul, thanks for the detailed update!

Unfortunately I'll be pretty busy this week, but if I find some time I'll try to properly catch up. It probably makes no sense for me to get involved through coding right now, but if I have any ideas regarding caching (or the current state of affairs in general) I'll let you know.

@michaelsproul michaelsproul mentioned this pull request Apr 8, 2024
4 tasks
@michaelsproul
Copy link
Member

No worries @dknopik! I think I can handle merging stuff down in a satisfactory way.

The in-memory PR is here for reference:

@dknopik
Copy link
Member Author

dknopik commented Nov 27, 2024

Closing, obsolete

@dknopik dknopik closed this Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Something to make Lighthouse run more efficiently. tree-states Upcoming state and database overhaul under-review A reviewer has only partially completed a review. work-in-progress PR is a work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants