Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Disallow attestation production earlier than head #2130

Closed
wants to merge 2 commits into from

Conversation

paulhauner
Copy link
Member

@paulhauner paulhauner commented Dec 29, 2020

Issue Addressed

The non-finality period on Pyrmont between epochs 9114 and 9182 was contributed to by all the lighthouse_team validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the lighthouse bn process. The Restart=on-failure directive for systemd caused the process to bounce in ~10-30m intervals.

Diagnosis with heaptrack showed that the BeaconChain::produce_unaggregated_attestation function was calling store::beacon_state::get_full_state and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when lighthouse bn was killed by the OS.

There was no CPU analysis (e.g., perf), but the BeaconChain::produce_unaggregated_attestation is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too.

Proposed Changes

BeaconChain::produce_unaggregated_attestation has two paths:

  1. Fast path: attesting to the head slot or later.
  2. Slow path: attesting to a slot earlier than the head block.

Path (2) is the only path that calls store::beacon_state::get_full_state, therefore it is the path causing this excessive CPU/RAM usage.

This PR removes the current functionality of path (2) and replaces it with a static error (BeaconChainError::AttestingPriorToHead).

This change reduces the generality of BeaconChain::produce_unaggregated_attestation (and therefore /eth/v1/validator/attestation_data), but I argue that this functionality is an edge-case and arguably a violation of the Honest Validator spec.

It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is:

  • Not specified as "honest validator" attesting behaviour.
  • Is behaviour that is risky for slashing (although, all validator clients should have slashing protection and will eventually fail if they do not).
  • It disguises clock-sync issues between a BN and VC.

Additional Info

It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.

@paulhauner paulhauner added work-in-progress PR is a work-in-progress blocked labels Dec 29, 2020
@paulhauner paulhauner changed the base branch from stable to unstable December 29, 2020 07:21
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree with your assessment of the situation. This is a good simplification. That codepath was unlikely to be useful, and getting rid of it removes DoS risk.

@paulhauner paulhauner added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Jan 19, 2021
@paulhauner paulhauner marked this pull request as ready for review January 19, 2021 03:31
@paulhauner paulhauner self-assigned this Jan 19, 2021
@paulhauner
Copy link
Member Author

bors r+

@paulhauner paulhauner added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Jan 20, 2021
bors bot pushed a commit that referenced this pull request Jan 20, 2021
## Issue Addressed

The non-finality period on Pyrmont between epochs [`9114`](https://pyrmont.beaconcha.in/epoch/9114) and [`9182`](https://pyrmont.beaconcha.in/epoch/9182) was contributed to by all the `lighthouse_team` validators going down. The nodes saw excessive CPU and RAM usage, resulting in the system to kill the `lighthouse bn` process. The `Restart=on-failure` directive for `systemd` caused the process to bounce in ~10-30m intervals.

Diagnosis with `heaptrack` showed that the `BeaconChain::produce_unaggregated_attestation` function was calling `store::beacon_state::get_full_state` and sometimes resulting in a tree hash cache allocation. These allocations were approximately the size of the hosts physical memory and still allocated when `lighthouse bn` was killed by the OS.

There was no CPU analysis (e.g., `perf`), but the `BeaconChain::produce_unaggregated_attestation` is very CPU-heavy so it is reasonable to assume it is the cause of the excessive CPU usage, too.

## Proposed Changes

`BeaconChain::produce_unaggregated_attestation` has two paths:

1. Fast path: attesting to the head slot or later.
2. Slow path: attesting to a slot earlier than the head block.

Path (2) is the only path that calls `store::beacon_state::get_full_state`, therefore it is the path causing this excessive CPU/RAM usage.

This PR removes the current functionality of path (2) and replaces it with a static error (`BeaconChainError::AttestingPriorToHead`).

This change reduces the generality of `BeaconChain::produce_unaggregated_attestation` (and therefore [`/eth/v1/validator/attestation_data`](https://ethereum.github.io/eth2.0-APIs/#/Validator/produceAttestationData)), but I argue that this functionality is an edge-case and arguably a violation of the [Honest Validator spec](https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/validator.md).

It's possible that a validator goes back to a prior slot to "catch up" and submit some missed attestations. This change would prevent such behaviour, returning an error. My concerns with this catch-up behaviour is that it is:

- Not specified as "honest validator" attesting behaviour.
- Is behaviour that is risky for slashing (although, all validator clients *should* have slashing protection and will eventually fail if they do not).
- It disguises clock-sync issues between a BN and VC.

## Additional Info

It's likely feasible to implement path (2) if we implement some sort of caching mechanism. This would be a multi-week task and this PR gets the issue patched in the short term. I haven't created an issue to add path (2), instead I think we should implement it if we get user-demand.
@bors bors bot changed the title Disallow attestation production earlier than head [Merged by Bors] - Disallow attestation production earlier than head Jan 20, 2021
@bors bors bot closed this Jan 20, 2021
bors bot pushed a commit that referenced this pull request Jan 20, 2021
## Issue Addressed

- Resolves #2064

## Proposed Changes

Adds a `ValidatorMonitor` struct which provides additional logging and Grafana metrics for specific validators.

Use `lighthouse bn --validator-monitor` to automatically enable monitoring for any validator that hits the [subnet subscription](https://ethereum.github.io/eth2.0-APIs/#/Validator/prepareBeaconCommitteeSubnet) HTTP API endpoint.

Also, use `lighthouse bn --validator-monitor-pubkeys` to supply a list of validators which will always be monitored.

See the new docs included in this PR for more info.

## TODO

- [x] Track validator balance, `slashed` status, etc.
- [x] ~~Register slashings in current epoch, not offense epoch~~
- [ ] Publish Grafana dashboard, update TODO link in docs
- [x] ~~#2130 is merged into this branch, resolve that~~
@paulhauner paulhauner deleted the att-prod-mem branch February 15, 2021 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants