-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify get_pow_block
block-not-found case
#2676
Conversation
Co-authored-by: Hsiao-Wei Wang <hsiaowei.eth@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add more clarity around transition block in the context of optimistic sync. But this could be done in a separate PR
specs/merge/fork-choice.md
Outdated
@@ -110,8 +117,12 @@ def on_block(store: Store, signed_block: SignedBeaconBlock) -> None: | |||
|
|||
# [New in Merge] | |||
if is_merge_block(pre_state, block.body): | |||
# Note: unavailable PoW block(s) may later become available. | |||
# Nodes should queue such beacon blocks for later processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is a bit more tricky and we should think what would be a better way of specifying it. For the case when CL and EL are in sync with the network, current specification works. But for the case when they are syncing we might want to add more clarity to the spec.
Queueing these blocks could raise a deadlock in an optimistic sync process. Suppose, the optimistic sync is happening and EL uses state sync to catch up with the head and the network passed the transition point a while ago, so the state for transition block is no longer available due to state trie pruning. By queueing beacon blocks at this point, CL will stuck and won't inform EL about the most recent head that is available in the network.
The right behaviour during the sync stage would be as follows:
- process transition block optimistically without terminal block verification
- once fully synced, get back to the transition block and check if it's been finalized or not
- verify terminal block if the transition block hasn't been finalized
- do re-org if the terminal block is invalid
We also may add this clarity in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the optimistic sync process as independent from the consensus specs.
"queuing blocks for later processing" in optimistic beacon chain sync means that you will eventually go back to fully validate them later. The architecture of CL/EL split client is independent of core consensus specs and I want to find the right balance here to not taint the specs with implementation details.
That said, I could see how this is confusing. But I do think that more notes about optimistic processing can/should be included in the Engine API as that is where this implemntation specific architecture design lives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the intention in general. But would also like to keep the Engine API spec lean. Probably, it should be described in an optimistic sync specification or an architecture design doc of a split client.
What confuses me is that with this change the spec considers the case when PowBlock
is unavailable after state_transition
have passed, which means that process_execution_payload
have succeeded with unavailable payload. This could only happen in the optimistic sync scenario. So, the spec already supports an optimistic sync implicitly. I would prefer not supporting it at all, i.e. keeping everything as it is in the spec, or add an explicit notion of it. IMO, something in the middle is a source of confusion for implementers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can keep the function as assuming availability here implicitly just blocking until it can be found. To avoid it all together.
Also committing to specifying optimistic sync wrt engineAPI somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the optimistic sync process as independent from the consensus specs.
A thought on that. Optimistic nature of the sync process is caused by the CL/EL separation which is deeper than CL/EL client software split hence could be reflected in the specs. Imagine the client that is represented by a single software unit or the world where there is only one physical blockchain, i.e. consensus and execution blockchains have been merged. In both cases optimistic sync would be still needed to be able to pull the head of the chain and then pull the EL state.
The analog is [fast | snap] sync today. It optimistically downloads the state using the information about the head from the wire, at the same time it pulls the consensus state which is currently a chain of headers originated from the genesis. And for the sync procedure to get converged it is required that headers constitute a valid PoW chain and the state has been pulled. We can replace headers with beacon state and blocks required for it verification in this analog and get the sync procedure that we are working on at the moment. Arguably we already have optimistic sync today that is heavily used by the network.
There is a couple of changes appear when we move from the analog to our case. First, we can't initiate the EL state pulling process until the consensus state is downloaded and verified, and it gives us an advantage because pulling the consensus state is cheaper and gives us more guarantee than in the PoW network. So, the optimistic sync in our case is rather applied to the consensus state than to the EL state. Second, we can do a sync only in an optimistic way as an EL block sync without CL isn't reliable anymore.
Recap, IMO it's worth considering an optimistic nature of the sync in the specs. As this nature is unlikely a subject to change until we have statelessness. Even in a stateless world it depends on the kind of the statelessness (weak vs strong) that is maintained and the type of the node (proposer/producer vs user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point and agree with it in general. But we have a specific case here. The logic we're changing in this proposal and are arguing about is Merge specific; it doesn't and shouldn't apply to a Ethereum Networks in general as some of them are started from scratch as a PoS networks and this is where your argument make a lot of sense to me.
But the Merge is a specific upgrade on the Mainnet and some time after and during the Merge Mainnet will rely on optimistic sync. I think it worth writing the spec around get_pow_block
more clear and close to the Mainnet rather than to the general network as clarity is always good for implementers.
Your point on lean specs made me think about the spec clean up after the Merge, i.e. removing transition logic from spec as a part of clean-up fork.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Off the topic, I think it's difficult to generate test vectors for syncing scenarios from consensus-specs
. What do you guys think about adding "client teams must implement the following integration testing cases: ..." to merge readiness for the known edge cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this tests should be a part of Hive test scenarios and we definitely want some test plan to exist. I would link a document with test plan to the merge readiness. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side topic, what do you think about moving the code under if is_merge_block(pre_state, block.body):
condition in a separate function? This code becomes more and more sophisticated and enclosing it in a function should aid unit test writing where we can cover with tests this particular function with e.g. mocked get_pow_block
.
This method could probably be called validate_merge_block
, for instance:
def validate_merge_block(block: BeaconBlock) -> None:
# Check the parent PoW block of execution payload is a valid terminal PoW block.
# Note: unavailable PoW block(s) may later become available.
# Nodes should queue such beacon blocks for later processing.
pow_block = get_pow_block(block.body.execution_payload.parent_hash)
assert pow_block is not None
pow_parent = get_pow_block(pow_block.parent_hash)
assert pow_parent is not None
assert is_valid_terminal_pow_block(pow_block, pow_parent)
# If `TERMINAL_BLOCK_HASH` is used as an override, the activation epoch must be reached.
if TERMINAL_BLOCK_HASH != Hash32():
assert compute_epoch_at_slot(block.slot) >= TERMINAL_BLOCK_HASH_ACTIVATION_EPOCH
if is_merge_block(pre_state, block.body):
validate_merge_block(block)
Back to the main discussion, validate_merge_block
is basically what CL must do after its fully synced. CL will have to scan its block tree (basically, every fork) for the merge block(s) and call to validate_merge_block
over this block. There are three possible cases:
- Such a block isn't available -- either it's behind the WS checkpoint the node has started with or it's not yet happened; already no need OR nothing yet to validate
- Such a block exist and it's valid -- great the transition went smooth on this chain
- Such a block exist and it's invalid -- invalidate the chain containing this block and do re-org and other according actions; if such a chain has been finalized then signal critical failure to the user and stop processing the chain
Validating this block doesn't require the BeaconState
object which is good and can be done with the merge beacon block only. But is_merge_block
function would probably need to be replaced to remove the dependency on the state.
And we could probably state something like that: "Note: unavailable PoW block(s) may later become available and a client software MAY delay a call to validate_merge_block
until the PoW block(s) become available" and remove "Nodes should queue such beacon blocks for later processing". As the latter will likely cause a deadlock in an optimistic sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the validate_merge_block
! It makes the warning more clear. :)
I implemented it in 6016347
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Issue
Discussed in #2636
It is possible that the requested block hash is not found via
get_pow_block
.How did I fix it
get_pow_block
returnsOptional[PowBlock]
None
, the operation is invalid.