Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concerns about eth1 voting in the context of caching #1537

Closed
paulhauner opened this issue Dec 18, 2019 · 5 comments
Closed

Concerns about eth1 voting in the context of caching #1537

paulhauner opened this issue Dec 18, 2019 · 5 comments

Comments

@paulhauner
Copy link
Contributor

paulhauner commented Dec 18, 2019

I think the current eth1 voting mechanism has some undesirable properties when we consider that staking eth2 clients should cache eth1 blocks. They should cache because:

  1. Requiring a call to some external server during block production is risky. If the eth1 node is unavailable at that time, we miss a beacon slot.
  2. The amount of data required to make the vote is >1,000 blocks. Doing this remote call during block production will add significant (likely infeasible) lag to block production.

First I will state two undesirable properties of the current system in the context of caching and then suggest a simpler system.

Undesirable property 1: Clients must cache the eth1 chain all the way up to the head

Consider the first block proposer in the eth1 voting period (slot % SLOTS_PER_ETH1_VOTING_PERIOD == 0). In order to calculate get_eth1_data(distance) it needs to know the block number of the eth1 block at the start of the voting period (now). That block is the head of the eth1 chain.

This is the primary problem I have and it has two effects:

  • The eth2 client caches must be concerned with eth1 re-orgs, whilst the core eth2 protocol is not.
  • There are points in time (the start of the voting period) where an eth2 node's ability to vote is reduced by not having a cache that is immediately up to date.

Undesirable property 2: You need to cache all the way back to the current eth1_data

In order to cast a vote (i.e., not trigger an exception in the spec), a node must have in their cache all descendants of the block represented by state.eth1_data (at the start of the current voting period). So, the cache grows linearly with the time since a successful eth1 voting period.

Additionally, if a node wants their cache to be safe in the case of an eth2 re-org, they should cache all the way back to the eth1_data in the last finalized block. Therefore, the cache also grows linearly with the time since finalization.

My proposal:

Below is some rough Python code that I think is minimal and viable. I'm not convinced this should be the final solution, but it's a starting point at least.

It has the following properties that the present solution does not:

  • Clients can go for ETH1_FOLLOW_DISTANCE * SECONDS_PER_SLOT without contacting an eth1 node and still vote perfectly.
  • The size of this cache is limited to the amount of eth1 blocks in ETH1_FOLLOW_DISTANCE * SECONDS_PER_SLOT.
  • It is reasonable to assume that a re-org will never affect this cache.
def voting_period_start(state: BeaconState):
    eth1_voting_period_start_slot = state.slot % SLOTS_PER_ETH1_VOTING_PERIOD
    state.genesis_time + eth1_voting_period_start_slot * SECONDS_PER_SLOT

def is_candidate_block(block: Eth1Block, period_start: uint64) -> Bool:
    block.timestamp <= period_start - SECONDS_PER_ETH1_BLOCK * ETH1_FOLLOW_DISTANCE and \
    block.timestamp >= period_start - SECONDS_PER_ETH1_BLOCK * ETH1_FOLLOW_DISTANCE * 2

def get_eth1_vote(state: BeaconState) -> Eth1Data:
    period_start = voting_period_start(state)
    # `eth1_chain` abstractly represents all blocks in the eth1 chain.
    votes_to_consider = [get_eth1_data(block) for block in eth1_chain if
                         is_candidate_block(block, period_start)]

    valid_votes = [vote for vote in state.eth1_data_votes if vote in votes_to_consider]

    return max(
        valid_votes,
        key=lambda v: (valid_votes.count(v), -valid_votes.index(v)),  # Tiebreak by smallest distance
        default=get_eth1_data(ETH1_FOLLOW_DISTANCE),
    )

Basically, solution makes the following changes:

  1. We select the range of viable votes using only timestamp, instead of the timestamp/block number hybrid we have now. This frees us from undesirable property 1.
  2. Don't try and vote on blocks prior to the current voting period (i.e., ditch all_eth1_data). This frees us from undesirable property 2.

WRT (2), it's not clear to me why we bother with all_eth1_data. I have some ideas, but I'd be keen to hear the original motivations.

@paulhauner paulhauner changed the title Concerns about eth1 voting Concerns about eth1 voting in the context of caching Dec 18, 2019
@djrtwo
Copy link
Contributor

djrtwo commented Dec 18, 2019

why all_eth1_data

The original intent of new_eth1_data vs all_eth1_data is to attempt to only cast repeat votes for new eth1 data if at the "start" of the period, and then in the tail of the period, favor just coming to consensus on anything valid.

This is to fight against the case in which an attacker has control of some proposers early in a voting period and just selects say the next block past the last eth1data. This attack is relatively easy to conduct, costs nothing, and would majorly stall the induction of new validators into the beacon chain.

Proposer honestly assumption as justification of new design

What we lose by only considering "new" when casting repeat votes instead of opening up to "all" in the tail seems marginal, at best. Even in the case in which an attacker has all of the proposers during the first sqrt(SLOTS_PER_ETH1_VOTING_PERIOD) (32) slots, it is very likely that at least one honest proposer shows up before substantially further into the voting period. After which, honest votes would coalesce upon their vote. In fact, we actually already make the assumption that we have at least 1 honest proposer per 32 slots as a liveness assumption for getting attestations on chain to facilitate FFG finality. If this assumption consistently fails, we (1) are unlikely to be able to control the eth1 voting mechanism from the attacker and (2) might have bigger fish to fry.

Will the change affect the finality gadget?

One thing to consider is how this operates when SLOTS_PER_ETH1_VOTING_PERIOD and ETH1_FOLLOW_DISTANCE are reduced when introducing the finality gadget.

When SLOTS_PER_ETH1_VOTING_PERIOD is reduced, we rely more and more on our honesty assumption because our random sampling of proposers decreases. This seems unaffected by the proposal.

When ETH1_FOLLOW_DISTANCE is reduced into the range of potential forking (say 25-50 eth1 blocks), to ensure that proposers can coalesce on one vote even if the chain has a high degree of forking, we might need to have validators consider the recent eth1 block tree instead of just the PoW canonical chain. This requirement is debatable, but I don't think this proposal affects that potential requirement. It might require building a local cache (both with block depth or timestamp) so we should at least keep cache requirements in mind when researching and spec'ing finality gadget.

I'm in favor

I'm in favor.

This seems like something that we can release in a spec ASAP because it won't actually interfere interoperability in clients in 99%+ of the cases. With different clients running the existing algorithm and this new algorithm, validators might end up with disagreements on early votes, but would start to agree during the period_tail.

Let's gauge the temperature on the call tomorrow


edit: where are your return statements?! Do you even python?

@paulhauner
Copy link
Contributor Author

paulhauner commented Dec 19, 2019

This is to fight against the case in which an attacker has control of some proposers early in a voting period and just selects say the next block past the last eth1data.

I'm not following how all_eth1_data prevents this. As I understand it, the presence of all_eth1_data would make it more likely that honest clients end up voting on older (lower block #) eth1 blocks. This is because all_eth1_data must contain blocks that are either (a) in new_eth1_data or (b) older than all blocks in eth1_data.

Let's gauge the temperature on the call tomorrow

SGTM

edit: where are your return statements?! Do you even python?

Return statements are so passé.

@prestonvanloon
Copy link
Contributor

I like it. This is close to what we do in Prysm, except for the tie breaking for vote decisions. We don't consider anyone elses vote at the moment when determining our eth1data vote.

I am a bit concerned about caching and eth1 reorgs. If everyone caches the same way and no one invalidates their cache on eth1 re-org, that coud be a problem. Could you elaborate on the assumption that eth1 re-orgs cannot affect this cache method?

@paulhauner
Copy link
Contributor Author

I like it.

Glad to hear that!

Could you elaborate on the assumption that eth1 re-orgs cannot affect this cache method?

The present voting mechanism (when considered alongside the validator on-boarding logic) makes an assumption that eth1 will never re-org a block that is more than 128 blocks deep. This is fine, but my gripe is that eth1 voting is structured in such a way that a client is most likely going to need to keep a cache that occasionally includes all blocks up to the current eth1 head.

The mechanism I have proposed addresses my gripe by making a slightly different assumption; instead of saying "eth1 will never re-org a block that is more than 128 blocks deep", it says "eth1 will never re-org a block that is more than 128 times the expected block time (15s) deep". In other words, we judge a blocks depth in the eth1 chain by eth1_block.timestamp instead of eth1_block.number.

The important part about my mechanism is that it only requires the client to cache blocks that are at least SECONDS_PER_ETH1_BLOCK * ETH1_FOLLOW_DISTANCE seconds prior to the current eth1 head. So, if we can make the assumption that eth1 will never re-org a block with a timestamp more than 32 minutes old (15 secs * 128) then we can also assume that our caches will never experience a re-org.

Note: the assumption is not necessarily "eth1 will never re-org a block 32 minutes deep", it's more along the lines of "if eth1 re-orgs past 32 min then we need an extra-protocol solution to patch eth2".

@protolambda
Copy link
Contributor

Closed via #1553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants