-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Extra timeout handling in block_requests #5794
Conversation
if let Some(connections) = self.peers.get_mut(&peer) { | ||
if let Some(connection) = connections.iter_mut().find(|c| c.id == connection_id) { | ||
if let Some(ongoing_request) = &mut connection.ongoing_request { | ||
if ongoing_request.request == original_request { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this compare IDs or all of the request data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It compares all the data, but that's basically just a bitfield of the requested fields, the source block hash or number, and the number of blocks.
outgoing: FuturesUnordered::new(), | ||
pending_events: VecDeque::new(), | ||
} | ||
} | ||
|
||
/// Issue a new block request. | ||
/// | ||
/// Cancels any existing request targeting the same `PeerId`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it safe to invalidate requests to a peer if more requests are sent to the same peer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a matter of safety or correctness. It's a design decision by the sync code to only allow one request per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an API user point of view, previous requests are "cancelled", but internally we just forget about them. If a response comes and it doesn't match a known request, we discard that response.
It's been reported in #5760 that timeouts aren't working properly, which indeed seems the case.
There's indeed a problematic situation caused by the fact that we handle timeouts by disconnecting: even if a request times out, we will only kill that specific connection, but a different connection with the same peer might have been opened in the meanwhile (which would also be a logic explanation for the timeout happening). When that happens, we don't actually report a disconnect to the sync/protocol state machine (hacks on top of hacks).
This PR changes that mechanism by explicitly emitting an event when a request times out, and also an event when a request has been cancelled.
I've had planned to do this exact change anyway in order to add proper metrics to block requests. This will be done on top of this PR.
Additionally, responses to overridden requests are now discarded instead of being reported anyway.
Also changes the log target, as demanded