[Merged by Bors] - Use the forwards iterator more often #2376

paulhauner · 2021-05-29T05:15:04Z

Issue Addressed

NA

Primary Change

When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting.

After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the BeaconChain::block_root_at_slot to use the forwards iterator, instead of the reverse one.

I also noticed that the networking stack is using BeaconChain::root_at_slot to check if a peer is relevant (check_peer_relevance). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function.

Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant).

Additional Changes

In the process I also noticed that we have two functions for getting block roots:

BeaconChain::block_root_at_slot: returns None for a skip slot.
BeaconChain::root_at_slot: returns the previous root for a skip slot.

I unified these two functions into block_root_at_slot and added the WhenSlotSkipped enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root.

Additionally, I replaced vec![] with Vec::with_capacity in store::chunked_vector::range_query. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way.

Also, BeaconChain::get_ancestor_block_root is unused, so I got rid of it 🗑️.

Additional Info

I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific.

Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the check_peer_relevance call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

paulhauner · 2021-05-29T20:48:57Z

Results on mainnet are looking quite promising so far! The blue line gets this update shortly before 16:00.

paulhauner · 2021-05-30T00:15:12Z

There's also a nice reflection in the number of bytes read from the database:

And also the count of read operations (ignore the kB scale, it's just a count of operations):

paulhauner · 2021-05-30T22:58:07Z

Results on mainnet are looking quite promising so far! The blue line gets this update shortly before 16:00.

After almost 2 days this node is still showing the same memory usage!

paulhauner · 2021-05-31T01:03:23Z

In my last commit (6c73e22) I added a "fast lookup" paths. It makes things a little more complicated, but it should save a bunch of overhead when doing queries near the head (e.g., reversing the roots for the fwds iterator).

I think it's worth it, but open to suggestion :)

michaelsproul

Great find! It's so satisfying to see that memory usage graph drop, and to know the root cause at last!

I've carefully reviewed the fast path and I'm confident that it's correct. I have a few tips for improving readability further.

It makes things a little more complicated, but it should save a bunch of overhead when doing queries near the head (e.g., reversing the roots for the fwds iterator).

The fast path won't help the forwards iterator, because the forwards iterator always clones the head state, and then uses a regular backwards iterator on it, here:

lighthouse/beacon_node/store/src/forwards_iter.rs

Lines 72 to 81 in 3a24ca5

    
           // Iterate backwards from the end state, stopping at the start slot. 
        
           let values = process_results( 
        
               std::iter::once(Ok((end_block_root, end_state.slot))) 
        
                   .chain(BlockRootsIterator::owned(store, end_state)), 
        
               |iter| { 
        
                   iter.take_while(|(_, slot)| *slot >= start_slot) 
        
                       .collect::<Vec<_>>() 
        
               }, 
        
           )?; 
        
           Ok(Self { values })

That state gets passed in when the "hybrid" forwards iterator gets constructed:

lighthouse/beacon_node/store/src/forwards_iter.rs

Lines 97 to 103 in 3a24ca5

    
           pub fn new( 
        
               store: Arc<HotColdDB<E, Hot, Cold>>, 
        
               start_slot: Slot, 
        
               end_state: BeaconState<E>, 
        
               end_block_root: Hash256, 
        
               spec: &ChainSpec, 
        
           ) -> Result<Self> {

Even so, I think the fast path optimisation is worth keeping for other cases where we query close to the head state.

EDIT: realised I misinterpreted what you were saying

beacon_node/store/src/chunked_vector.rs

beacon_node/beacon_chain/src/beacon_chain.rs

paulhauner · 2021-05-31T02:32:47Z

Thanks for the readability tips and the extra sanity checks for Slot. I've addressed your comments but haven't done anything for #2376 (comment) yet.

It's good to go assuming you don't have more changes :)

michaelsproul

LGTM! 🚀

paulhauner · 2021-05-31T03:54:26Z

bors r+

## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

paulhauner · 2021-05-31T03:56:03Z

bors r-

(Batching)

bors · 2021-05-31T03:56:04Z

Canceled.

paulhauner · 2021-05-31T03:59:23Z

bors r+

## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

bors · 2021-05-31T04:10:13Z

Build failed (retrying...):

clippy

## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

paulhauner · 2021-05-31T04:12:54Z

bors r-

bors · 2021-05-31T04:12:55Z

Canceled.

paulhauner · 2021-05-31T04:14:21Z

bors r+

bors · 2021-05-31T04:14:23Z

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

paulhauner · 2021-05-31T04:15:48Z

bors r+

bors · 2021-05-31T04:15:48Z

Already running a review

paulhauner · 2021-05-31T04:16:48Z

bors -

bors · 2021-05-31T04:16:49Z

Did you mean "r-"?

paulhauner · 2021-05-31T04:17:24Z

bors r-

bors · 2021-05-31T04:17:24Z

Canceled.

paulhauner · 2021-05-31T04:18:02Z

bors r+

🤞

## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here #2377.

bors · 2021-05-31T05:41:10Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed NA ## Proposed Changes Bump versions. ## Additional Info This is not exactly the v1.4.0 release described in [Lighthouse Update #36](https://lighthouse.sigmaprime.io/update-36.html). Whilst it contains: - Beta Windows support - A reduction in Eth1 queries - A reduction in memory footprint It does not contain: - Altair - Doppelganger Protection - The remote signer We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible. ## TODO - [x] Wait for #2340, #2356 and #2376 to merge and then rebase on `unstable`. - [x] Ensure discovery issues are fixed (see #2388) - [x] Ensure #2382 is merged/removed. - [x] Ensure #2383 is merged/removed. - [x] Ensure #2384 is merged/removed. - [ ] Double-check eth1 cache is carried between boots

@pawanjay176

commit ea6838b Author: Paul Hauner <paul@paulhauner.com> Date: Mon Jun 21 17:45:55 2021 +1000 Flip bool commit 89afc26 Author: Paul Hauner <paul@paulhauner.com> Date: Mon Jun 21 15:22:48 2021 +1000 Avoid taking the write-lock on val monitor commit b84ff9f Author: realbigsean <seananderson33@GMAIL.com> Date: Fri Jun 18 05:58:01 2021 +0000 rust 1.53.0 updates (sigp#2411) ## Issue Addressed `make lint` failing on rust 1.53.0. ## Proposed Changes 1.53.0 updates ## Additional Info I haven't figure out why yet, we were now hitting the recursion limit in a few crates. So I had to add `#![recursion_limit = "256"]` in a few places Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Michael Sproul <michael@sigmaprime.io> commit 3dc1eb5 Author: Michael Sproul <michael@sigmaprime.io> Date: Thu Jun 17 02:10:48 2021 +0000 Ignore inactive validators in validator monitor (sigp#2396) ## Proposed Changes A user on Discord (`@ChewsMacRibs`) reported that the validator monitor was logging `WARN Attested to an incorrect head` for their validator while it was awaiting activation. This PR modifies the monitor so that it ignores inactive validators, by the logic that they are either awaiting activation, or have already exited. Either way, there's no way for an inactive validator to have their attestations included on chain, so no need for the monitor to report on them. ## Additional Info To reproduce the bug requires registering validator keys manually with `--validator-monitor-pubkeys`. I don't think the bug will present itself with `--validator-monitor-auto`. commit 98ab00c Author: Jack <jmcph4.github@gmail.com> Date: Thu Jun 17 02:10:47 2021 +0000 Handle Geth pre-EIP-155 block sync error condition (sigp#2304) ## Issue Addressed sigp#2293 ## Proposed Changes - Modify the handler for the `eth_chainId` RPC (i.e., `get_chain_id`) to explicitly match against the Geth error string returned for pre-EIP-155 synced Geth nodes - ~~Add a new helper function, `rpc_error_msg`, to aid in the above point~~ - Refactor `response_result` into `response_result_or_error` and patch reliant RPC handlers accordingly (thanks to @pawanjay176) ## Additional Info Geth, as of Pangaea Expanse (v1.10.0), returns an explicit error when it is not synced past the EIP-155 block (2675000). Previously, Geth simply returned a chain ID of 0 (which was obviously much easier to handle on Lighthouse's part). Co-authored-by: Paul Hauner <paul@paulhauner.com> commit b1657a6 Author: realbigsean <seananderson33@GMAIL.com> Date: Thu Jun 17 02:10:46 2021 +0000 Reorg events (sigp#2090) ## Issue Addressed Resolves sigp#2088 ## Proposed Changes Add the `chain_reorg` SSE event topic ## Additional Info Co-authored-by: realbigsean <seananderson33@gmail.com> Co-authored-by: Paul Hauner <paul@paulhauner.com> commit 3261eff Author: divma <divma@protonmail.com> Date: Thu Jun 17 00:40:16 2021 +0000 split outbound and inbound codecs encoded types (sigp#2410) Splits the inbound and outbound requests, for maintainability. commit a526145 Author: Clifton King <cliftonk@gmail.com> Date: Wed Jun 16 10:42:55 2021 +0000 Fix remote signer test (sigp#2400) ## Proposed Changes Unescape text for json comparison in: https://github.com/sigp/lighthouse/blob/3a24ca5f14c6e6d6612fd43eca82aa0c1e6aba16/remote_signer/tests/sign.rs#L282-L285 Which causes this error: ``` ---- sign::invalid_field_fork stdout ---- thread 'sign::invalid_field_fork' panicked at 'assertion failed: `(left == right)` left: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: 'I', index: 0 })\", line: 1, column: 237097)"`, right: `"Unable to parse body message from JSON: Error(\"invalid hex (InvalidHexCharacter { c: \\'I\\', index: 0 })\", line: 1, column: 237097)"`', testing/remote_signer_test/src/consumer.rs:144:5 ``` This is my first contribution and happy to receive feedback if you have any. Thanks commit dffe31c Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jun 16 09:16:51 2021 +0000 Add an account command to enable/disable validators (sigp#2386) ## Issue Addressed Resolves sigp#2322 ## Proposed Changes Adds a `modify` command to `lighthouse account validator` with subcommands to enable and disable specific or all pubkeys. commit 3b600ac Author: Paul Hauner <paul@paulhauner.com> Date: Thu Jun 10 01:44:49 2021 +0000 v1.4.0 (sigp#2402) ## Issue Addressed NA ## Proposed Changes - Bump versions and update `Cargo.lock` ## Additional Info NA ## TODO - [x] Ensure sigp#2398 gets merged succesfully commit b383836 Author: Paul Hauner <paul@paulhauner.com> Date: Wed Jun 9 02:30:06 2021 +0000 Modify Malloc Tuning (sigp#2398) ## Issue Addressed NA ## Proposed Changes I've noticed some of the SigP Prater nodes struggling on v1.4.0-rc.0. I suspect this is due to the changes in sigp#2296. Specifically, the trade-off which lowered the memory footprint whilst increasing runtime on some functions. Presently, this PR is documenting my testing on Prater. ## Additional Info NA commit 4a6f2fa Author: Paul Hauner <paul@paulhauner.com> Date: Mon Jun 7 02:34:10 2021 +0000 Only perform malloc tuning for beacon node (sigp#2397) ## Issue Addressed NA ## Proposed Changes Only run `configure_memory_alllocator` for the BN process. I noticed that VC memory usage increases significantly with the new malloc tuning parameters. This was also raised by a user on [r/ethstaker](https://www.reddit.com/r/ethstaker/comments/nr8998/lighthouse_prerelease_v140rc0/h0fnt9l?utm_source=share&utm_medium=web2x&context=3). There wasn't any issue with memory usage by the VC before we implemented sigp#2296, so I think we were a bit overzealous when we allowed these changes to affect it. This PR allows things that weren't broken to remain unfixed. ## Additional Info NA commit 93100f2 Author: Paul Hauner <paul@paulhauner.com> Date: Mon Jun 7 02:34:09 2021 +0000 Make less logs for attn with unknown head (sigp#2395) ## Issue Addressed NA ## Proposed Changes I am starting to see a lot of slog-async overflows (i.e., too many logs) on Prater whenever we see attestations for an unknown block. Since these logs are identical (except for peer id) and we expose volume/count of these errors via `metrics::GOSSIP_ATTESTATION_ERRORS_PER_TYPE`, I took the following actions to remove them from `DEBUG` logs: - Push the "Attestation for unknown block" log to trace. - Add a debug log in `search_for_block`. In effect, this should serve as a de-duped version of the previous, downgraded log. ## Additional Info TBC commit 502402c Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Fri Jun 4 00:10:59 2021 +0000 Fix options for `--eth1-endpoints` flag (sigp#2392) ## Issue Addressed N/A ## Proposed Changes Set `config.sync_eth1_chain` to true when using just the `--eth1-endpoints` flag (without `--eth1`). commit f6280aa Author: Paul Hauner <paul@paulhauner.com> Date: Thu Jun 3 00:13:02 2021 +0000 v1.4.0-rc.0 (sigp#2379) ## Issue Addressed NA ## Proposed Changes Bump versions. ## Additional Info This is not exactly the v1.4.0 release described in [Lighthouse Update sigp#36](https://lighthouse.sigmaprime.io/update-36.html). Whilst it contains: - Beta Windows support - A reduction in Eth1 queries - A reduction in memory footprint It does not contain: - Altair - Doppelganger Protection - The remote signer We have decided to release some features early. This is primarily due to the desire to allow users to benefit from the memory saving improvements as soon as possible. ## TODO - [x] Wait for sigp#2340, sigp#2356 and sigp#2376 to merge and then rebase on `unstable`. - [x] Ensure discovery issues are fixed (see sigp#2388) - [x] Ensure sigp#2382 is merged/removed. - [x] Ensure sigp#2383 is merged/removed. - [x] Ensure sigp#2384 is merged/removed. - [ ] Double-check eth1 cache is carried between boots commit 90ea075 Author: Paul Hauner <paul@paulhauner.com> Date: Wed Jun 2 01:07:28 2021 +0000 Revert "Network protocol upgrades (sigp#2345)" (sigp#2388) ## Issue Addressed NA ## Proposed Changes Reverts sigp#2345 in the interests of getting v1.4.0 out this week. Once we have released that, we can go back to testing this again. ## Additional Info NA commit d34f922 Author: Paul Hauner <paul@paulhauner.com> Date: Wed Jun 2 01:07:27 2021 +0000 Add early check for RPC block relevancy (sigp#2289) ## Issue Addressed NA ## Proposed Changes When observing `jemallocator` heap profiles and Grafana, it became clear that Lighthouse is spending significant RAM/CPU on processing blocks from the RPC. On investigation, it seems that we are loading the parent of the block *before* we check to see if the block is already known. This is a big waste of resources. This PR adds an additional `check_block_relevancy` call as the first thing we do when we try to process a `SignedBeaconBlock` via the RPC (or other similar methods). Ultimately, `check_block_relevancy` will be called again later in the block processing flow. It's a very light function and I don't think trying to optimize it out is worth the risk of a bad block slipping through. Also adds a `New RPC block received` info log when we process a new RPC block. This seems like interesting and infrequent info. ## Additional Info NA commit bf4e02e Author: Paul Hauner <paul@paulhauner.com> Date: Tue Jun 1 06:59:43 2021 +0000 Return a specific error for frozen attn states (sigp#2384) ## Issue Addressed NA ## Proposed Changes Return a very specific error when at attestation reads shuffling from a frozen `BeaconState`. Previously, this was returning `MissingBeaconState` which indicates a much more serious issue. ## Additional Info Since `get_inconsistent_state_for_attestation_verification_only` is only called once in `BeaconChain::with_committee_cache`, it is quite easy to reason about the impact of this change. commit ba9c4c5 Author: Paul Hauner <paul@paulhauner.com> Date: Tue Jun 1 06:59:41 2021 +0000 Return more detail in Eth1 HTTP errors (sigp#2383) ## Issue Addressed NA ## Proposed Changes Whilst investigating sigp#2372, I [learned](sigp#2372 (comment)) that the error message returned from some failed Eth1 requests are always `NotReachable`. This makes debugging quite painful. This PR adds more detail to these errors. For example: - Bad infura key: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: https://mainnet.infura.io/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 401 Unauthorized.\"))", retry_millis: 60000, service: eth1_rpc` - Unreachable server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Request failed: reqwest::Error { kind: Request, url: Url { scheme: \\\"http\\\", cannot_be_a_base: false, username: \\\"\\\", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(8545), path: \\\"/\\\", query: None, fragment: None }, source: hyper::Error(Connect, ConnectError(\\\"tcp connect error\\\", Os { code: 111, kind: ConnectionRefused, message: \\\"Connection refused\\\" })) }\"))", retry_millis: 60000, service: eth1_rpc` - Bad server: `ERRO Failed to update eth1 cache error: Failed to update Eth1 service: "All fallback errored: http://127.0.0.1:8545/ => EndpointError(RequestFailed(\"Response HTTP status was not 200 OK: 501 Not Implemented.\"))", retry_millis: 60000, service: eth1_rpc` ## Additional Info NA commit 4c7bb49 Author: Paul Hauner <paul@paulhauner.com> Date: Mon May 31 04:18:20 2021 +0000 Use the forwards iterator more often (sigp#2376) ## Issue Addressed NA ## Primary Change When investigating memory usage, I noticed that retrieving a block from an early slot (e.g., slot 900) would cause a sharp increase in the memory footprint (from 400mb to 800mb+) which seemed to be ever-lasting. After some investigation, I found that the reverse iteration from the head back to that slot was the likely culprit. To counter this, I've switched the `BeaconChain::block_root_at_slot` to use the forwards iterator, instead of the reverse one. I also noticed that the networking stack is using `BeaconChain::root_at_slot` to check if a peer is relevant (`check_peer_relevance`). Perhaps the steep, seemingly-random-but-consistent increases in memory usage are caused by the use of this function. Using the forwards iterator with the HTTP API alleviated the sharp increases in memory usage. It also made the response much faster (before it felt like to took 1-2s, now it feels instant). ## Additional Changes In the process I also noticed that we have two functions for getting block roots: - `BeaconChain::block_root_at_slot`: returns `None` for a skip slot. - `BeaconChain::root_at_slot`: returns the previous root for a skip slot. I unified these two functions into `block_root_at_slot` and added the `WhenSlotSkipped` enum. Now, the caller must be explicit about the skip-slot behaviour when requesting a root. Additionally, I replaced `vec![]` with `Vec::with_capacity` in `store::chunked_vector::range_query`. I stumbled across this whilst debugging and made this modification to see what effect it would have (not much). It seems like a decent change to keep around, but I'm not concerned either way. Also, `BeaconChain::get_ancestor_block_root` is unused, so I got rid of it :wastebasket:. ## Additional Info I haven't also done the same for state roots here. Whilst it's possible and a good idea, it's more work since the fwds iterators are presently block-roots-specific. Whilst there's a few places a reverse iteration of state roots could be triggered (e.g., attestation production, HTTP API), they're no where near as common as the `check_peer_relevance` call. As such, I think we should get this PR merged first, then come back for the state root iters. I made an issue here sigp#2377. commit 320a683 Author: Kevin Lu <klu93@seas.upenn.edu> Date: Mon May 31 04:18:19 2021 +0000 Minimum Outbound-Only Peers Requirement (sigp#2356) ## Issue Addressed sigp#2325 ## Proposed Changes This pull request changes the behavior of the Peer Manager by including a minimum outbound-only peers requirement. The peer manager will continue querying for peers if this outbound-only target number hasn't been met. Additionally, when peers are being removed, an outbound-only peer will not be disconnected if doing so brings us below the minimum. ## Additional Info Unit test for heartbeat function tests that disconnection behavior is correct. Continual querying for peers if outbound-only hasn't been met is not directly tested, but indirectly through unit testing of the helper function that counts the number of outbound-only peers. EDIT: Am concerned about the behavior of ```update_peer_scores```. If we have connected to a peer with a score below the disconnection threshold (-20), then its connection status will remain connected, while its score state will change to disconnected. ```rust let previous_state = info.score_state(); // Update scores info.score_update(); Self::handle_score_transitions( previous_state, peer_id, info, &mut to_ban_peers, &mut to_unban_peers, &mut self.events, &self.log, ); ``` ```previous_state``` will be set to Disconnected, and then because ```handle_score_transitions``` only changes connection status for a peer if the state changed, the peer remains connected. Then in the heartbeat code, because we only disconnect healthy peers if we have too many peers, these peers don't get disconnected. I'm not sure realistically how often this scenario would occur, but it might be better to adjust the logic to account for scenarios where the score state implies a connection status different from the current connection status. Co-authored-by: Kevin Lu <kevlu93@gmail.com> commit 0847986 Author: Mac L <mjladson@pm.me> Date: Mon May 31 04:18:18 2021 +0000 Reduce outbound requests to eth1 endpoints (sigp#2340) ## Issue Addressed sigp#2282 ## Proposed Changes Reduce the outbound requests made to eth1 endpoints by caching the results from `eth_chainId` and `net_version`. Further reduce the overall request count by increasing `auto_update_interval_millis` from `7_000` (7 seconds) to `60_000` (1 minute). This will result in a reduction from ~2000 requests per hour to 360 requests per hour (during normal operation). A reduction of 82%. ## Additional Info If an endpoint fails, its state is dropped from the cache and the `eth_chainId` and `net_version` calls will be made for that endpoint again during the regular update cycle (once per minute) until it is back online. Co-authored-by: Paul Hauner <paul@paulhauner.com> commit ec5cceb Author: Age Manning <Age@AgeManning.com> Date: Sat May 29 07:25:06 2021 +0000 Correct issue with dialing peers (sigp#2375) The ordering of adding new peers to the peerdb and deciding when to dial them was not considered in a previous update. This adds the condition that if a peer is not in the peer-db then it is an acceptable peer to dial. This makes sigp#2374 obsolete. commit d12e746 Author: Age Manning <Age@AgeManning.com> Date: Fri May 28 22:02:10 2021 +0000 Network protocol upgrades (sigp#2345) This provides a number of upgrades to gossipsub and discovery. The updates are extensive and this needs thorough testing. commit 456b313 Author: Paul Hauner <paul@paulhauner.com> Date: Fri May 28 05:59:45 2021 +0000 Tune GNU malloc (sigp#2299) ## Issue Addressed NA ## Proposed Changes Modify the configuration of [GNU malloc](https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html) to reduce memory footprint. - Set `M_ARENA_MAX` to 4. - This reduces memory fragmentation at the cost of contention between threads. - Set `M_MMAP_THRESHOLD` to 2mb - This means that any allocation >= 2mb is allocated via an anonymous mmap, instead of on the heap/arena. This reduces memory fragmentation since we don't need to keep growing the heap to find big contiguous slabs of free memory. - ~~Run `malloc_trim` every 60 seconds.~~ - ~~This shaves unused memory from the top of the heap, preventing the heap from constantly growing.~~ - Removed, see: sigp#2299 (comment) *Note: this only provides memory savings on the Linux (glibc) platform.* ## Additional Info I'm going to close sigp#2288 in favor of this for the following reasons: - I've managed to get the memory footprint *smaller* here than with jemalloc. - This PR seems to be less of a dramatic change than bringing in the jemalloc dep. - The changes in this PR are strictly runtime changes, so we can create CLI flags which disable them completely. Since this change is wide-reaching and complex, it's nice to have an easy "escape hatch" if there are undesired consequences. ## TODO - [x] Allow configuration via CLI flags - [x] Test on Mac - [x] Test on RasPi. - [x] Determine if GNU malloc is present? - I'm not quite sure how to detect for glibc.. This issue suggests we can't really: rust-lang/rust#33244 - [x] Make a clear argument regarding the affect of this on CPU utilization. - [x] Test with higher `M_ARENA_MAX` values. - [x] Test with longer trim intervals - [x] Add some stats about memory savings - [x] Remove `malloc_trim` calls & code commit fdaeec6 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed May 26 05:58:41 2021 +0000 Monitoring service api (sigp#2251) ## Issue Addressed N/A ## Proposed Changes Adds a client side api for collecting system and process metrics and pushing it to a monitoring service. commit 55aada0 Author: Age Manning <Age@AgeManning.com> Date: Wed May 26 14:21:44 2021 +1000 More stringent dialing (sigp#2363) * More stringent dialing * Cover cached enr dialing commit 5d9a1bc Author: Michael Sproul <michael@sigmaprime.io> Date: Thu May 20 00:23:08 2021 +0000 Add Windows to Bors config (sigp#2358) We accidentally omitted the new Windows tests (sigp#2333) from the Bors config, meaning that PRs will merge before the tests pass. This PR corrects that. commit ba55e14 Author: ethDreamer <ethereumdreamer@gmail.com> Date: Wed May 19 23:05:16 2021 +0000 Enable Compatibility with Windows (sigp#2333) ## Issue Addressed Windows incompatibility. ## Proposed Changes On windows, lighthouse needs to default to STDIN as tty doesn't exist. Also Windows uses ACLs for file permissions. So to mirror chmod 600, we will remove every entry in a file's ACL and add only a single SID that is an alias for the file owner. Beyond that, there were several changes made to different unit tests because windows has slightly different error messages as well as frustrating nuances around killing a process :/ ## Additional Info Tested on my Windows VM and it appears to work, also compiled & tested on Linux with these changes. Permissions look correct on both platforms now. Just waiting for my validator to activate on Prater so I can test running full validator client on windows. Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com> Co-authored-by: Michael Sproul <micsproul@gmail.com>

## Issue Addressed #2377 ## Proposed Changes Implement the same code used for block root lookups (from #2376) to state root lookups in order to improve performance and reduce associated memory spikes (e.g. from certain HTTP API requests). ## Additional Changes - Tests using `rev_iter_state_roots` and `rev_iter_block_roots` have been refactored to use their `forwards` versions instead. - The `rev_iter_state_roots` and `rev_iter_block_roots` functions are now unused and have been removed. - The `state_at_slot` function has been changed to use the `forwards` iterator. ## Additional Info - Some tests still need to be refactored to use their `forwards_iter` versions. These tests start their iteration from a specific beacon state and thus use the `rev_iter_state_roots_from` and `rev_iter_block_roots_from` functions. If they can be refactored, those functions can also be removed.

paulhauner added the work-in-progress PR is a work-in-progress label May 29, 2021

paulhauner added 2 commits May 30, 2021 10:18

Use fwds iter more often

ba9ed10

Remove unused ancestor block root fn

f4cf594

paulhauner force-pushed the more-fwds-iter branch from 1628f74 to f4cf594 Compare May 30, 2021 00:18

paulhauner mentioned this pull request May 30, 2021

Implement forwards iterator for state roots #2377

Closed

paulhauner added 6 commits May 31, 2021 09:21

Tidy, specify skips for block_at_slot

33085fd

Define capacity for chunks range query

78e3d55

Add test

fee7711

Rename Skips -> WhenSlotSkipped

4fcdf6f

Handle a skipped head slot

53d808e

Fix typo

2786a46

paulhauner marked this pull request as ready for review May 31, 2021 00:27

paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels May 31, 2021

Add fast-lookups

6c73e22

paulhauner added 3 commits May 31, 2021 11:06

Tidy fast path

70ba90e

Appease clippy

781b8ab

Rename target_slot -> request_slot

1bdd7e1

michaelsproul reviewed May 31, 2021

View reviewed changes

beacon_node/store/src/chunked_vector.rs Show resolved Hide resolved

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

Implement slot checks, tuple windows

f607035

michaelsproul approved these changes May 31, 2021

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 31, 2021

Appease clippy

0e367c8

paulhauner mentioned this pull request May 31, 2021

v1.4.0 #2378

Closed

1 task

bors bot changed the title ~~Use the forwards iterator more often~~ [Merged by Bors] - Use the forwards iterator more often May 31, 2021

bors bot closed this May 31, 2021

paulhauner mentioned this pull request May 31, 2021

[Merged by Bors] - v1.4.0-rc.0 #2379

Closed

6 tasks

macladson mentioned this pull request Jun 28, 2021

[Merged by Bors] - Use forwards iterator for state root lookups #2422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Use the forwards iterator more often #2376

[Merged by Bors] - Use the forwards iterator more often #2376

paulhauner commented May 29, 2021 •

edited

Loading

paulhauner commented May 29, 2021 •

edited

Loading

paulhauner commented May 30, 2021 •

edited

Loading

paulhauner commented May 30, 2021

paulhauner commented May 31, 2021

michaelsproul left a comment •

edited

Loading

paulhauner commented May 31, 2021

michaelsproul left a comment

paulhauner commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

	// Iterate backwards from the end state, stopping at the start slot.
	let values = process_results(
	std::iter::once(Ok((end_block_root, end_state.slot)))
	.chain(BlockRootsIterator::owned(store, end_state)),
	\|iter\| {
	iter.take_while(\|(_, slot)\| *slot >= start_slot)
	.collect::<Vec<_>>()
	},
	)?;
	Ok(Self { values })

	pub fn new(
	store: Arc<HotColdDB<E, Hot, Cold>>,
	start_slot: Slot,
	end_state: BeaconState<E>,
	end_block_root: Hash256,
	spec: &ChainSpec,
	) -> Result<Self> {

[Merged by Bors] - Use the forwards iterator more often #2376

[Merged by Bors] - Use the forwards iterator more often #2376

Conversation

paulhauner commented May 29, 2021 • edited Loading

Issue Addressed

Primary Change

Additional Changes

Additional Info

paulhauner commented May 29, 2021 • edited Loading

paulhauner commented May 30, 2021 • edited Loading

paulhauner commented May 30, 2021

paulhauner commented May 31, 2021

michaelsproul left a comment • edited Loading

Choose a reason for hiding this comment

paulhauner commented May 31, 2021

michaelsproul left a comment

Choose a reason for hiding this comment

paulhauner commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 31, 2021

bors bot commented May 31, 2021

paulhauner commented May 29, 2021 •

edited

Loading

paulhauner commented May 29, 2021 •

edited

Loading

paulhauner commented May 30, 2021 •

edited

Loading

michaelsproul left a comment •

edited

Loading