Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate topology after restart if finality is lagging behind current session #6913

Merged
merged 10 commits into from
Feb 7, 2025

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Dec 16, 2024

There is a small issue on restart, where if finality is lagging across session boundary and the validator restarts, then the validator won't be able to contribute anymore with assginments/approvals and gossiping for the blocks from the previous session, because after restart it builds the Topology only for the new session, so without a topology it won't be able to distribute assignments and approvals because everything in approval-distribution is gated on having a topology for the block.

The fix is to also keep track of the last finalized block and its session and if it is different from the list of encountered sessions, build its topology and send it to the rest of subsystems.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh requested review from ordian, sandreim and AndreiEres and removed request for ordian December 16, 2024 16:20
@alindima
Copy link
Contributor

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

@alexggh
Copy link
Contributor Author

alexggh commented Jan 6, 2025

How will this work for subsystems like statement-distribution and bitfield-ditribution which need to keep backing new candidates on the latest session? AFAIU, with this PR, they would receive the topologies of the old session that has some unfinalized blocks, which I expect will break them

Good catch, Thanks! In statement-distribution we keep the topology in a per-session structure, so there is no problem there, however the bitfield-distribution does keep only the last received topology and that is a problem, looking into how to fix that.

@alindima
Copy link
Contributor

alindima commented Jan 6, 2025

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem. With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Jan 7, 2025

Maybe a better approach would be to have approval-distribution request an old topology from the gossip-support subsystem.

I've looked closer and it is not that simple because the flow right now is like:

  1. Gossip-support sends NetworkBridgeRxMessage::NewGossipTopology
  2. Network bridge rx receives this message adds some useful information and converts it to NetworkBridgeEvent::NewGossipTopology
  3. approval-distribution, statement-distribution and bitfield-distribution receive NetworkBridgeEvent::NewGossipTopology

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

With the current form it's also a bit confusing that we're sending NewGossipTopology for an old session

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

@alindima
Copy link
Contributor

alindima commented Jan 8, 2025

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

There is still code that is being run in bitfield-distribution even if the update_topologies did nothing.
Maybe a good compromise would be adding a new message type? So that this change is explicit and is only handled by subsystems that need it

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh
Copy link
Contributor Author

alexggh commented Jan 10, 2025

Requesting would mean reproducing the same flow, but in reverse: approval-distribution -> network-bridge-rx -> gossip-support, where which subsystem blocks waiting for the other to respond, I don't think we should introduce this because it is easy to produce deadlocks with this type of interactions..

To avoid this multi-hop, we could store them with the already enriched data in network-bridge-rx.

The message has session_index on it that clearly tells you which session the topology is for, so that allow callers to determine if they need the older ones or not.

You do have the session index there but the naming is confusing. Subsystems that use this code could easily overlook this (like a new leaf update containing an old leaf).

In the end I opted to ignore older topologies for bitfield-distributions, so let me know what you think about going this route.

There is still code that is being run in bitfield-distribution even if the update_topologies did nothing. Maybe a good compromise would be adding a new message type? So that this change is explicit and is only handled by subsystems that need it

As discussed on chat I ended doing a solution where only approval-distribution receives old topologies and no other subsystem receives them because they are not interested in them.

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
polkadot/node/network/bridge/src/rx/mod.rs Show resolved Hide resolved
polkadot/node/network/bridge/src/rx/tests.rs Outdated Show resolved Hide resolved
Co-authored-by: ordian <write@reusable.software>
alexggh and others added 2 commits February 6, 2025 17:19
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
@alexggh alexggh added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Feb 6, 2025
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/13183098604
Failed job name: test-linux-stable

@alexggh alexggh enabled auto-merge February 7, 2025 08:45
@alexggh alexggh added this pull request to the merge queue Feb 7, 2025
auto-merge was automatically disabled February 7, 2025 09:23

Pull Request is not mergeable

Merged via the queue into master with commit bb5e05c Feb 7, 2025
204 of 207 checks passed
@alexggh alexggh deleted the alexggh/update_topology_at_restart branch February 7, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants