[Merged by Bors] - Disallow attestation production earlier than head #2130

paulhauner · 2020-12-29T07:04:41Z

Issue Addressed

The non-finality period on Pyrmont between epochs 9114 and 9182 was contributed to by all the lighthouse_team validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the lighthouse bn process. The Restart=on-failure directive for systemd caused the process to bounce in ~10-30m intervals.

Diagnosis with heaptrack showed that the BeaconChain::produce_unaggregated_attestation function was calling store::beacon_state::get_full_state and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when lighthouse bn was killed by the OS.

There was no CPU analysis (e.g., perf), but the BeaconChain::produce_unaggregated_attestation is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too.

Proposed Changes

BeaconChain::produce_unaggregated_attestation has two paths:

Fast path: attesting to the head slot or later.
Slow path: attesting to a slot earlier than the head block.

Path (2) is the only path that calls store::beacon_state::get_full_state, therefore it is the path causing this excessive CPU/RAM usage.

This PR removes the current functionality of path (2) and replaces it with a static error (BeaconChainError::AttestingPriorToHead).

This change reduces the generality of BeaconChain::produce_unaggregated_attestation (and therefore /eth/v1/validator/attestation_data), but I argue that this functionality is an edge-case and arguably a violation of the Honest Validator spec.

It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is:

Not specified as "honest validator" attesting behaviour.
Is behaviour that is risky for slashing (although, all validator clients should have slashing protection and will eventually fail if they do not).
It disguises clock-sync issues between a BN and VC.

Additional Info

It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.

michaelsproul

100% agree with your assessment of the situation. This is a good simplification. That codepath was unlikely to be useful, and getting rid of it removes DoS risk.

paulhauner · 2021-01-20T06:52:19Z

bors r+

## Issue Addressed The non-finality period on Pyrmont between epochs [`9114`](https://pyrmont.beaconcha.in/epoch/9114) and [`9182`](https://pyrmont.beaconcha.in/epoch/9182) was contributed to by all the `lighthouse_team` validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the `lighthouse bn` process. The `Restart=on-failure` directive for `systemd` caused the process to bounce in ~10-30m intervals. Diagnosis with `heaptrack` showed that the `BeaconChain::produce_unaggregated_attestation` function was calling `store::beacon_state::get_full_state` and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when `lighthouse bn` was killed by the OS. There was no CPU analysis (e.g., `perf`), but the `BeaconChain::produce_unaggregated_attestation` is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too. ## Proposed Changes `BeaconChain::produce_unaggregated_attestation` has two paths: 1. Fast path: attesting to the head slot or later. 2. Slow path: attesting to a slot earlier than the head block. Path (2) is the only path that calls `store::beacon_state::get_full_state`, therefore it is the path causing this excessive CPU/RAM usage. This PR removes the current functionality of path (2) and replaces it with a static error (`BeaconChainError::AttestingPriorToHead`). This change reduces the generality of `BeaconChain::produce_unaggregated_attestation` (and therefore [`/eth/v1/validator/attestation_data`](https://ethereum.github.io/eth2.0-APIs/#/Validator/produceAttestationData)), but I argue that this functionality is an edge-case and arguably a violation of the [Honest Validator spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/validator.md). It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is: - Not specified as "honest validator" attesting behaviour. - Is behaviour that is risky for slashing (although, all validator clients *should* have slashing protection and will eventually fail if they do not). - It disguises clock-sync issues between a BN and VC. ## Additional Info It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.

bors · 2021-01-20T07:58:34Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed - Resolves #2064 ## Proposed Changes Adds a `ValidatorMonitor` struct which provides additional logging and Grafana metrics for specific validators. Use `lighthouse bn --validator-monitor` to automatically enable monitoring for any validator that hits the [subnet subscription](https://ethereum.github.io/eth2.0-APIs/#/Validator/prepareBeaconCommitteeSubnet) HTTP API endpoint. Also, use `lighthouse bn --validator-monitor-pubkeys` to supply a list of validators which will always be monitored. See the new docs included in this PR for more info. ## TODO - [x] Track validator balance, `slashed` status, etc. - [x] ~~Register slashings in current epoch, not offense epoch~~ - [ ] Publish Grafana dashboard, update TODO link in docs - [x] ~~#2130 is merged into this branch, resolve that~~

Disallow attestation prior to head

48d8509

paulhauner added work-in-progress PR is a work-in-progress blocked labels Dec 29, 2020

paulhauner changed the base branch from stable to unstable December 29, 2020 07:21

paulhauner removed the blocked label Dec 29, 2020

michaelsproul approved these changes Jan 7, 2021

View reviewed changes

paulhauner mentioned this pull request Jan 8, 2021

[Merged by Bors] - Detailed validator monitoring #2151

Closed

4 tasks

Fix attestation_production tests

b961d11

paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jan 19, 2021

paulhauner marked this pull request as ready for review January 19, 2021 03:31

paulhauner self-assigned this Jan 19, 2021

paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jan 20, 2021

bors bot changed the title ~~Disallow attestation production earlier than head~~ [Merged by Bors] - Disallow attestation production earlier than head Jan 20, 2021

bors bot closed this Jan 20, 2021

paulhauner mentioned this pull request Jan 21, 2021

Allow attesting to older slots #2169

Closed

paulhauner deleted the att-prod-mem branch February 15, 2021 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Disallow attestation production earlier than head #2130

[Merged by Bors] - Disallow attestation production earlier than head #2130

paulhauner commented Dec 29, 2020 •

edited

Loading

michaelsproul left a comment

paulhauner commented Jan 20, 2021

bors bot commented Jan 20, 2021

[Merged by Bors] - Disallow attestation production earlier than head #2130

[Merged by Bors] - Disallow attestation production earlier than head #2130

Conversation

paulhauner commented Dec 29, 2020 • edited Loading

Issue Addressed

Proposed Changes

Additional Info

michaelsproul left a comment

Choose a reason for hiding this comment

paulhauner commented Jan 20, 2021

bors bot commented Jan 20, 2021

paulhauner commented Dec 29, 2020 •

edited

Loading