Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Use new block requests protocol #5760

Merged
merged 3 commits into from
Apr 24, 2020

Conversation

tomaka
Copy link
Contributor

@tomaka tomaka commented Apr 23, 2020

Ok, this is the last step towards #5670 that needs to be done ASAP for backcompat reasons. The rest of this issue can be tackled later.

Remarks:

  • I added a CLI flag to restore the legacy behaviour, just in case.

  • Unless the CLI flag is enabled, this PR breaks compatibility with Polkadot 0.7.20 and below. I think that's fine. According to the telemetry, there are only 16 nodes which are concerned by this breakage, only 6 of which seem to stay in sync while the others are more or less in limbo. It seems that compatibility with versions 0.7.19 and below is already broken anyway (for a reason I don't know).

  • The protocol.rs file now emits requests that are then handled by the block_requests and finality_requests modules. When these module notify of a response, we notify protocol.rs.

  • Any failure (bad response, protocol not supported, timeout, ...) is handled by disconnecting the node. This keeps compatibility with the current protocol, but we should obviously do something about that later.

  • I removed the Toggle from finality_proofs, otherwise I can't use it to send requests. I made it so that if the finality proof provider is None then the list of negotiable inbound protocols is empty.

  • The limit of 16MiB for block response sizes is the one currently in use in the legacy protocol. I initially put it at 1MiB but that was too little and we were discarding responses.

  • I had to add an is_empty_justification field to the protocol, because tests send out empty justifications and we don't differentiate between an absence of justification and an empty justification. While this is not great, it was done in a backwards-compatible way.

  • The format of a "block request" and a "block response" are message::BlockRequest and message::BlockResponse. I unfortunately didn't have the courage to do the deeper refactorings required to have something a bit more strongly-typed.

  • I'm testing this on my Google Cloud node at the moment, and it seems to work totally fine.

@tomaka tomaka added A0-please_review Pull request needs code review. B1-clientnoteworthy labels Apr 23, 2020
@tomaka tomaka requested review from twittner and mxinden April 23, 2020 14:25
@tomaka tomaka requested a review from cecton as a code owner April 23, 2020 14:25
@tomaka tomaka added this to the 2.0 milestone Apr 23, 2020
Copy link
Contributor

@twittner twittner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what my review is worth, this looks good to me.

@@ -51,5 +51,8 @@ message BlockData {
bytes message_queue = 5; // optional
// Justification if requested.
bytes justification = 6; // optional
// True if justification should be treated as present be empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment does not make sense to me. Did you meant:

Suggested change
// True if justification should be treated as present be empty.
// True if justification should be treated as empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was meant to be:

Suggested change
// True if justification should be treated as present be empty.
// True if justification should be treated as present but empty.

@@ -51,5 +51,8 @@ message BlockData {
bytes message_queue = 5; // optional
// Justification if requested.
bytes justification = 6; // optional
// True if justification should be treated as present be empty.
// This hack is unfortunately necessary because of shortcomings in the protobuf format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding your details from the pull request description here?

@tomaka tomaka added A8-mergeoncegreen and removed A0-please_review Pull request needs code review. labels Apr 24, 2020
@tomaka tomaka merged commit ee098e9 into paritytech:master Apr 24, 2020
@tomaka tomaka deleted the use-new-block-requests branch April 24, 2020 11:48
@arkpar
Copy link
Member

arkpar commented Apr 24, 2020

Looks like timeout handling logic is broken. Also, Peer::block_request should be set immediately upon queueing the request. This is broken too as far as I can see.

response: message::BlockResponse<B>,
},
}

/// Configuration options for `BlockRequests`.
#[derive(Debug, Clone)]
pub struct Config {
max_block_data_response: u32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is reached? How's it different from max_response_len?

Copy link
Contributor Author

@tomaka tomaka Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the maximum number of blocks that we respond with, by default 128. If a node requests 129 blocks, we will only answer with 128. That's exactly the same as the old behaviour.

@@ -127,6 +151,8 @@ pub struct BlockRequests<B: Block> {
chain: Arc<dyn Client<B>>,
/// Futures sending back the block request response.
outgoing: FuturesUnordered<BoxFuture<'static, ()>>,
/// Events to return as soon as possible from `poll`.
pending_events: VecDeque<NetworkBehaviourAction<OutboundProtocol<B>, Event<B>>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unbounded?

Copy link
Contributor Author

@tomaka tomaka Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but libp2p guarantees that poll is called every time after inject_event.
Consequently, the only way to make this list grow past one element is to call send_request multiple times in a row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which we totally do for gossiping, propagating extrinsics and such, don't we?

response: message::BlockResponse<B>,
},
}

/// Configuration options for `BlockRequests`.
#[derive(Debug, Clone)]
pub struct Config {
max_block_data_response: u32,
max_request_len: usize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it is reached?

Copy link
Contributor Author

@tomaka tomaka Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's considered as a protocol error and we disconnect the node. There has to be some sort of limit, since we have to allocate the memory that contains the request. I think requests are typically less than 100 bytes, so the default 1MiB is way more than enough.

@arkpar
Copy link
Member

arkpar commented Apr 24, 2020

Please keep to existing style when adding log messages. In particular, each message should be capitalized. Existing messages with target: "sync" should be preserved, and not replaced with new messages with no target.

@tomaka
Copy link
Contributor Author

tomaka commented Apr 25, 2020

Looks like timeout handling logic is broken. Also, Peer::block_request should be set immediately upon queueing the request. This is broken too as far as I can see.

I've made sure that these fields were only ever used with the single substream. In other words, if use_legacy_network is true we use these fields (that's why I didn't remove them) and if it is false we never use them.

The logic of the timeout is handled directly by libp2p now, in the OneShotHandler. A timeout is a protocol error (which I agree isn't great if we have to debug that) and will result in a disconnect.
I'll improve the whole debugging story and add metrics next week.

In particular, each message should be capitalized. Existing messages with target: "sync" should be preserved, and not replaced with new messages with no target.

Will fix that too.

@arkpar
Copy link
Member

arkpar commented Apr 25, 2020

So what's the timeout in seconds for block requests now and where is it set? I've observed that timeouts are never triggered and sync stalls.

I've made sure that these fields were only ever used with the single substream. In other words, if use_legacy_network is true we use these fields (that's why I didn't remove them) and if it is false we never use them.

What about obsolete requests tracking that these fields were handling? Is there an alternative in the new protocol? I could not find it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants