Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Add retry mechanism for pov-recovery, fix full-node pov-recovery #2164

Merged
merged 29 commits into from
Feb 9, 2023

Conversation

skunert
Copy link
Contributor

@skunert skunert commented Feb 6, 2023

This PR introduces some changes to the PoV recovery mechanism.

Retry in case of unavailability

PoV recovery is listening to a stream of pending candidates that is derived from the relay chain import notification stream. Since the candidates are pending, availability is not yet guaranteed. In that case we now just retry the operation.

Currently we retry only once.

Ordered recovery queue

Until now for each recovery operation we added a randomized delay after which the recovery would start. This was not a problem for collators. For full nodes however, we have a configured recovery delay of 150 - 300 seconds. Since the recovery can start anywhere in that range, it could happen that we attempted to recover a child before its parent was available. There is already a mechanism to wait for parents, but it was only waiting for parent currently in recovery, which is not true for queued ones.

This is not changed into an ordered queue that will add randomized recovery slots on insertion, but start recoveries in order.

Finalization

When full-nodes recover a block via PoV (which should be very rare), they are behind by several minutes. Since we are following relay chain final and best block, the node had no way of knowing the status of the recovered block. We are now maintaining a cache of seen finalized blocks. If a block is imported and its hash is in the cache, we immediately set it as finalized.

Closes #2142

@skunert skunert added B0-silent Changes should not be mentioned in any release notes A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Feb 6, 2023
@skunert skunert requested review from bkchr, davxy and a team February 6, 2023 11:29
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (code-wise; don't know enough to reason about business logic)

client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/consensus/common/src/parachain_consensus.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
client/pov-recovery/src/lib.rs Outdated Show resolved Hide resolved
test/service/src/lib.rs Show resolved Hide resolved
client/service/src/lib.rs Show resolved Hide resolved
Copy link
Member

@davxy davxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🟢

Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some last things.

/// Clear `waiting_for_parent` from the given `hash` and do this recursively for all child
/// blocks.
fn clear_waiting_for_parent(&mut self, hash: Block::Hash) {
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`.
/// Clear `waiting_for_parent` and `waiting_recovery` for the candidate with `hash`.
///

Comment on lines 407 to 410
let parent_scheduled_for_recovery =
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery);
if self.active_candidate_recovery.is_recovering(&parent) ||
parent_scheduled_for_recovery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let parent_scheduled_for_recovery =
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery);
if self.active_candidate_recovery.is_recovering(&parent) ||
parent_scheduled_for_recovery
let parent_being_recovered =
self.candidates.get(&parent).map_or(false, |parent| parent.waiting_recovery);
if parent_being_recovered

You just updated the comment above waiting_recovery saying that this is only true when you either waiting for recovery or it is being recovered.

@@ -425,22 +533,7 @@ where

if do_recover {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This do_recover can be removed by replacing the break false above with return. The logic above also has an error, we set waiting_recovery to true directly, but later in the loop we may encounter an error and never reset this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines 449 to 451
// This is not necessary in the happy case, as the flag will be cleared on import
// If import fails for any reason however, we still want to have the flag cleared.
self.clear_waiting_recovery(&block_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assume that import works, if that fails we are trapped here anyway. But we should remove this code, as otherwise we may start downloading the block again while it is imported.

client/pov-recovery/src/lib.rs Show resolved Hide resolved
client/pov-recovery/src/lib.rs Show resolved Hide resolved
client/pov-recovery/src/lib.rs Show resolved Hide resolved
client/pov-recovery/src/lib.rs Show resolved Hide resolved
skunert and others added 4 commits February 9, 2023 09:11
Co-authored-by: Bastian Köcher <git@kchr.de>
This reverts commit 3809eed.
@skunert skunert merged commit afcfd24 into paritytech:master Feb 9, 2023
aurexav added a commit to darwinia-network/darwinia that referenced this pull request May 22, 2023
aurexav added a commit to darwinia-network/darwinia that referenced this pull request May 23, 2023
* Remove `account`

* Update and companions

* Companion of paritytech/cumulus#2164

* Companion of paritytech/substrate#13159

* Companion of paritytech/cumulus#1909

* Fmt

* Companion of paritytech/polkadot#6744

* Companion of paritytech/cumulus#2245

* Companion of paritytech/substrate#12828

* Companion of paritytech/cumulus#2287

* Companion of paritytech/substrate#13592

* Companion of paritytech/cumulus#2308

* Companion of paritytech/substrate#13410

* Companion of paritytech/substrate#13305

* Companion of polkadot-evm/frontier#1050

* TODO weight

* TODO weight

* Companion of polkadot-evm/frontier#1040

Signed-off-by: Xavier Lau <xavier@inv.cafe>

* Remove unused dep

* Try fix dev node paritytech/substrate#12828

* Fix the frontier part (#1154)

* Fix dev node

* Fmt

* Bump moonbeam

* Bump moonbeam

* Remove unnecessary clone

* Fix tests

---------

Signed-off-by: Xavier Lau <xavier@inv.cafe>
Co-authored-by: Guantong <i@gt.email>
Co-authored-by: bear <boundless.forest@outlook.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retry mechanism for pov-recovery
5 participants