Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

break the cycle of bounded channels for network bridge #3304

Closed
wants to merge 1 commit into from

Conversation

ordian
Copy link
Member

@ordian ordian commented Jun 18, 2021

Potential cause of #3242.

Previously, we had bounded channels between network bridge and distribution subsystems (statement, bitfield and approval). Imagine if network bridge's channel is full and it awaits on dispatch_validation_events_to_all().await, whereas a distribution subsystem awaits on the bridge with report_peer().await. This causes a deadlock.

This PR changes all distribution subsystems to use unbounded channels for communication with the bridge.

I'm worried a bit about unbounded channel growth, but we need proper backpressure implemented for peersets at the libp2p level.

@ordian ordian added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Jun 18, 2021
@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jun 18, 2021
Copy link
Contributor

@drahnr drahnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rphmeier
Copy link
Contributor

rphmeier commented Jun 18, 2021

I'm interested to see if this change has a practical effect, but if the underlying cause is correct, then it probably indicates a bug in the network-bridge subsystem.

The network bridge subsystem is designed like this:

Task Out: SubsystemContext -> NetworkService 
Task In: SubsystemSender <- NetworkService

Imagine if network bridge's channel is full and it awaits on dispatch_validation_events_to_all().await, whereas a distribution subsystem awaits on the bridge with report_peer().await. This causes a deadlock.

Given the above architecture, I don't believe that the explanation here is sufficient.

Task In should block on dispatch_validation_events_to_all, but that should not block Task Out. Task Out will then process the report_peer() message from the statement distribution subsystem, and then Task In will be unblocked.

At least, this is how it's meant to work in theory. If the explanation given is happening in practice, then I believe it indicates that Task In and Task Out are blocking each other in some way. Perhaps requests are a factor?

@ordian
Copy link
Member Author

ordian commented Jun 18, 2021

@rphmeier I suspect that you're right and the issue with #3270 was caused by (lack of) 6e02e99. I'll continue looking for the cause of #3242.

@ordian ordian closed this Jun 22, 2021
@ordian ordian deleted the ao-unbreak-the-bounded-cycle branch June 22, 2021 08:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants