-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the GDD governor #1015
Conversation
6846b81
to
d7c0a75
Compare
87f61c9
to
54a9604
Compare
e4fb67a
to
cd6818b
Compare
8ed9833
to
a31843a
Compare
} | ||
where | ||
canAwaitTimeout :: Maybe DiffTime | ||
canAwaitTimeout = shortWait -- 10s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, this comment might get outdated if shortWait
changes. Would it make sense to use something like Just _10Seconds
or just the numeric literal. Alternatively, could shortWait
be a local definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I instead removed the comment to avoid it going out of sync. shortWait
is used in various places, so knowing its value is already necessary in various places.
ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/PeerSimulator/BlockFetch.hs
Show resolved
Hide resolved
ouroboros-consensus-diffusion/test/consensus-test/Test/Consensus/PeerSimulator/Trace.hs
Show resolved
Hide resolved
ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Genesis/Governor.hs
Outdated
Show resolved
Hide resolved
...os-consensus-diffusion/test/consensus-test/Test/Consensus/Genesis/Tests/DensityDisconnect.hs
Show resolved
Hide resolved
pure Unit | ||
ReadGsmState -> do | ||
fmap ReadThisGsmState $ atomically $ readTVar varGsmState | ||
ReadMarker -> do | ||
fmap ReadThisMarker $ atomically $ readTVar varMarker | ||
StartIdling peer -> do | ||
atomically $ modifyTVar varIdlers $ Set.insert peer | ||
StartIdling peer -> atomically $ do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to guard against the case in which peer
is not in the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually part of the QSM precondition
👍
@@ -0,0 +1,7 @@ | |||
### Non-Breaking | |||
|
|||
- Set Genesis window for Byron and Shelley. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either use past tense here, or present in the other fragments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Set ..." is conveniently both present and past tense 😅
Co-authored-by: Nicolas BACQUEY <nicolas.bacquey@tweag.io>
This also adds the `GenesisWindow` to the `EraParams`, together with backwards-compatible serialisation. Co-authored-by: Facundo Domínguez <facundo.dominguez@tweag.io>
Co-authored-by: Torsten Schmits <git@tryp.io>
Co-authored-by: Facundo Domínguez <facundo.dominguez@tweag.io> Co-authored-by: Torsten Schmits <git@tryp.io>
Co-authored-by: Alexander Esgen <alexander.esgen@iohk.io>
Co-authored-by: Nicolas BACQUEY <nicolas.bacquey@tweag.io> Co-authored-by: Torsten Schmits <git@tryp.io>
696280a
to
cae9d98
Compare
The main contribution of this PR is the implementation of ChainSync Jumping (0e265c4), a mechanism to avoid overloading the honest peers in the network when one or more peers connect to them for syncing. The details of ChainSync Jumping are discussed in the [Jumping module](https://github.com/IntersectMBO/ouroboros-consensus/blob/78e16d0/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/MiniProtocol/ChainSync/Client/Jumping.hs). A range of supporting commits are provided in addition. ## More tests * The time limited leashing attack test is configured and enabled in 6c25310. This test is aimed at the GDD governor implemented in #1015. * A test to show that a node can resume syncing after being disconnected for a while has been implemented in 5d3c200. * There is a test checking that the GDD governor doesn't regret disconnecting nodes as more headers are known to it (dbbc3ab ). This was included in the previous milestone but we discovered later it wasn't checking as much as we intended. * In da7a6db and 3220e7d there is a test for ChainSync Jumping that ensures that a syncing node downloads headers from at most one peer when there is no disagreement between the peers. * There are various fixes to tests and documentation in d6f6ddf, b0ed1c8, d70e604 , cee466f, and dbbc3ab. ## GDD refinement The ability of the GDD governor to disconnect nodes has increased in dcb20f5. With respect to #1015, the GDD governor can now disconnect peers in the following scenarios: 1. when a peer has a chain that can't be denser than another chain served by another peer (before, we would disconnect the peer only if there was a denser chain); or 2. when a peer has a chain of density 0 and it claims to have more headers after the genesis window; or 3. when a peer sent a header for the last slot of the genesis window or later, and loses the density comparison to another peer (before, we would disconnect the peer only after it loses the density comparison to a peer that sent more than k headers in the genesis window). These changes help ChainSync Jumping make progress while downloading headers from only two disagreeing peers (even if both are adversarial). ## Other changes * A version of shrinking of honest schedules of messages has been implemented in 9794292. * 9572256 contains a fix to the implementation of Limit on Patience. * 915238e implements a recording tracer that doesn't affect scheduling in IOSim.
This PR improves the code documentation of Genesis, and implements some fixes and optimizations. ### Documentation, refactoring, and testing The code documentation of GDD, CSJ, and LoE is better connected in cb5b69a. Additionally, cd032b1 brings small changes to the LoE, motivated by issues found while integrating the LoE into the ChainDB q-s-m tests (see #1119). 2f7152b enables CSJ in all of the uniform tests and tests run more tries now. Shrinking of peer schedules has been simplified in 627634f, and 6b44bfb adds an extra field to the `PointSchedule` type to specify the time at the end of a test, allowing to clean up a lot of schedules. ### Profiling and optimizations We consider our baseline, the fastest possible syncing, to be the scenario in which the syncing peer connects to a caught up peer and syncs using Praos. From the profiling of Genesis on mainnet data, the GDD governor and chain selection was revealed as one of the major sinks of CPU time when enabling Genesis and syncing from multiple peers (between 2 and 30 peers). Chain selection has been made idempotent again, to save on useless recomputations in 82376e6. Idempotency had been lost in #1015. 64f3da3 also calls to change selection less often, by avoiding calls when the LoE fragment doesn't change in a significant way. The overhead of GDD is eliminated in a592c65 by replacing a frequent call to `AnchoredFragment.takeWhileOldest` with a faster variant. With these changes, the only significant overhead of syncing from multiple peers comes from the BlockFetch logic, which is the topic of ongoing work. ### CSJ tests The testing code has been adapted in a969b37 to allow to specify multiple honest peers in CSJ tests. CSJ tests are the first time we care of running multiple honest peers to check that header are downloaded from only one of them. We also added a test in 1d7fc95 to check that headers are downloaded from at most one peer even in the presence of adversarial peers. ### GSM integration In b525959 Limit on Eagerness is enabled only while GSM is in syncing state, and 83b12ee does similarly for LoP. f69fd1f makes possible to toggle genesis components with environment variables. This was helpful to measure the overheads of the individual components.
This PR builds upon #979 to provide the GDD governor, the component responsible for identifying and disconnecting peers offering sparser chains than the best. This has the effect of unblocking the Limit on Eagerness, since removing disagreeing peers allows the current selection to advance.
The GDD governor runs in a background thread and evaluates candidate chains whenever they change, or whenever a peer claims to have no more headers, or whenever a peer starts sending headers beyond the forecast horizon.
Whenever GDD disconnects peers, the chain selection is updated.
The implementation of the GDD governor is in commit 3fc6d28. A range of supporting changes is provided as well.
Peer simulator improvements in f02d8df include:
Post-GDD improvements in ea12e12 include:
Tests for the GDD in b8af4b4 include:
There is one other leashing attack test called "time limited leashing attack", which it is enabled in a follow up PR.
The classifiers added in cfefa6c keep track of whether peers are being disconnected by the GDD governor, and the LoP, and timeouts. It also keep tracks of adversaries who did a rollback of their chain. The commit also implements a check ensuring no peer is killed because of an unforeseen exception.
Chain generators implement the extended praos chain growth assumption (EPCG) in d8c6546. This causes the honest chain to have at least
k+1
blocks in every stability window. The former Praos Chain Growth assumption required at leastk
, and in practice chains with less thank+1
blocks in a stability window are still very rare. The EPCG assumption makes a difference to testing the GDD governor.We cannot let the almost caught up GDD governor disconnect honest peers which disagree on a few final blocks. Therefore, we ask the governor to require a disagreement of more than
k
blocks before deciding a disconnection. But asking for more thank
blocks without the EPCG assumption means that the GDD governor might not be able to make progress if thek+1
st block is after the first stability window after the intersection and after the forecast horizon. In this case, the candidate fragments wouldn't be extended with thek+1
st block, and the GDD governor wouldn't realize that there is a discrepancy of more thank
blocks.In c8468c2 we don't allow the chain selection to fork more than k blocks from the honest chain. The honest chain is considered to be the one containing the first intersection at which candidate fragments fork into different chains.
It used to be the case that we allowed the selection to choose up to k blocks after the intersection. But it turned out that if the selection tip is still X blocks behind the intersection, allowing chain selection to pick
X+k
blocks, there is no guarantee that the selection will only add blocks from the candidate fragments. It could select more thank
blocks of an older fork in the chain DB, and this is catastrophic because now the immutable tip would leave the honest chain.bf3f09a refactors and properly bundles the various state that a ChainSync client maintains.