Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network improvements #1036

Merged
merged 5 commits into from
Jan 15, 2021
Merged

Network improvements #1036

merged 5 commits into from
Jan 15, 2021

Conversation

holgerd77
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Jan 8, 2021

Codecov Report

Merging #1036 (c110b70) into master (5ac5bc7) will increase coverage by 0.19%.
The diff coverage is 87.09%.

Impacted file tree graph

Flag Coverage Δ
block 77.65% <ø> (-0.28%) ⬇️
blockchain 77.92% <ø> (ø)
client 88.25% <100.00%> (+0.29%) ⬆️
common 91.87% <ø> (-0.25%) ⬇️
devp2p 82.73% <85.71%> (+0.39%) ⬆️
ethash 82.08% <ø> (ø)
tx 90.15% <ø> (+0.15%) ⬆️
vm 83.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

@holgerd77 holgerd77 force-pushed the network-improvements branch 2 times, most recently from 3398978 to 4afa40f Compare January 8, 2021 18:44
@holgerd77
Copy link
Member Author

Ah, forgot a bit about this PR, this is actually ready for review already for some days. 😄

As mentioned yesterday in the call, the initial goal here was to improve network connection quality on a level that there is a reliable connection to non-mainnet networks possible, before it was rather not possible at all any more to connect to e.g. Rinkey or Goerli. I didn't quite get there but the networking quality is generally improved with the PR for all networks and also the non-mainnet networks got better and now it is - sometimes - possible to connect after 1-2 minutes. As mentioned in the call yesterday a way to improve from some other angle would be here to add DNS discovery support, will open a dedicated issue on that.

This PR does the following:

  • Some additional RLPx type improvements along 75d224b
  • Analogue to the latest discovery changes from Client Network Improvements #1031 the refill() calls for RLPx are subdivided to better distribute networking traffic, this perceptibly improves the connection 838d6c4
  • There was a bug in the client reporting a repeated first block number on executions (in case you already wondered, this was just a logging error), this has been fixed in 88e5f03
  • "devp2p -> connection reliability: fixed an error in DPT not properly banning old peers and replacing with a new peer on KBucket ping" (from the commit message): this is a super important bug fix, before the ban functionality for discovery for peers just didn't work due to a generic catch block shading the error messages. This lead to bad peers just being used for connections over and over again. Bad. There are some more of these generic catch blocks (so not containing any code) in the code base. It's a bit less trivial to fix though than one would think on first thought since this needs some thought on well-balancing how to instead handling potential errors, since there is some mix of errors which are just expected, some which one might want to know about and some which are severe. And we have to decide how to react properly (re-throw, emit an error event, display a debug message, do nothing) without polluting the client log e.g.. 06f31ea
  • There was another traffic cluster caused along neighbour retrieval from the DPT server in _onServerPeers, this has been smoothened in c110b70

Copy link
Contributor

@cgewecke cgewecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

(Just have a question about the how the refill interval changes work.)

} else {
// Still keep peer in queue
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you know why this technique improves connection reliability?

With this change the connection attempts are spaced out and for any given peer there's a 1 in 10 chance it will be made?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This distributes the networking traffic over time, so there's not all connections hammered out at once - then followed by a longer non-networking-activity - and subsequently many of the replies from the peers also coming in at a similar PIT (in the old version). Instead the traffic gets out and in more frequently and distributed. I am not 100% sure about the reliability impacts in a narrower sense, so e.g. if this directly prevents connection losses by avoiding concurrent network handling, my deeper networking experience is too limited for that. But this definitely leads to a healthier protocol flow - this long waiting period is avoided - becoming e.g. very apparent when the last peer connection gets lost and there is another 30 second wait - but there is always some activity/connection reattempts going on. I would also think that these distributes better on available protocol respectively implementation-given resource restrictions like the maximum number of allowed peer connections or the number of slots for the queue, but not completely sure here.

However benefit reasons are distributed: this is one of the changes which can be really "felt" when running the client, so the first connection pickup after start is just directly and apparently getting so much faster that it gets a bit out of question if there "might be" some positive effect. 😄

@holgerd77 holgerd77 merged commit 816e0bc into master Jan 15, 2021
@holgerd77 holgerd77 deleted the network-improvements branch January 15, 2021 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants