Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the GDD governor #1015

Merged
merged 9 commits into from
Apr 16, 2024
Merged

Conversation

facundominguez
Copy link
Contributor

@facundominguez facundominguez commented Mar 22, 2024

This PR builds upon #979 to provide the GDD governor, the component responsible for identifying and disconnecting peers offering sparser chains than the best. This has the effect of unblocking the Limit on Eagerness, since removing disagreeing peers allows the current selection to advance.

The GDD governor runs in a background thread and evaluates candidate chains whenever they change, or whenever a peer claims to have no more headers, or whenever a peer starts sending headers beyond the forecast horizon.

Whenever GDD disconnects peers, the chain selection is updated.

The implementation of the GDD governor is in commit 3fc6d28. A range of supporting changes is provided as well.


Peer simulator improvements in f02d8df include:

  • Our simulator of peers now kills all miniprotocol threads whenever one of the miniprotocols is terminated. The GDD governor might terminate a ChainSync client, and this termination should also cause the termination of the BlockFetch client.
  • Our tests now also check that no threads are left lingering when the peer simulator is terminated.
  • Traces now print a timestamp when events occur between the start of consecutive ticks. This need arose when Limit on Patience started disconnecting peers because of exhausted buckets, and these events could happen at any moment between ticks. See this comment of Introduce a time-aware tracer #990 for an example trace.

Post-GDD improvements in ea12e12 include:

  • Started tracing the information about BlockFetch timeouts and BlockFetch logic decisions. This information became relevant when we had to analyze why blocks wouldn't be downloaded from the honest peer, or why a quiet adversarial peer wasn't being disconnected.
  • Have the Limit on Eagerness recalculated when the change selection changes. Limit on Eagerness does not depend on the chain selection, but this mechanism has the desired side effect of retriggering chain selection in succession until no more blocks are added to it. This change is necessary after chain selection stopped being idempotent in 7ace045 (later on this same PR). This change is included here to demonstrate that the GDD governor prevents the node from being leashed, but otherwise we intend to recover idempotency of chain selection in a later PR.

Tests for the GDD in b8af4b4 include:

  • Added a new test to check that GDD actually disconnects nodes in small examples. This test is subsumed by the already existing test "serve adversarial branches", but it tries only small peer schedules which are easier to approach on test failures.
  • Enabled one of the leashing attack tests. This test was waiting for the GDD governor to be implemented. It generates an honest peer schedule for every peer, and then drops from the adversary schedules some ticks chosen at random. This has the effect of having the adversaries introduce delays to send headers or blocks that shouldn't stop the node from reaching the honest tip by the end of the test.

There is one other leashing attack test called "time limited leashing attack", which it is enabled in a follow up PR.


The classifiers added in cfefa6c keep track of whether peers are being disconnected by the GDD governor, and the LoP, and timeouts. It also keep tracks of adversaries who did a rollback of their chain. The commit also implements a check ensuring no peer is killed because of an unforeseen exception.


Chain generators implement the extended praos chain growth assumption (EPCG) in d8c6546. This causes the honest chain to have at least k+1 blocks in every stability window. The former Praos Chain Growth assumption required at least k, and in practice chains with less than k+1 blocks in a stability window are still very rare. The EPCG assumption makes a difference to testing the GDD governor.

We cannot let the almost caught up GDD governor disconnect honest peers which disagree on a few final blocks. Therefore, we ask the governor to require a disagreement of more than k blocks before deciding a disconnection. But asking for more than k blocks without the EPCG assumption means that the GDD governor might not be able to make progress if the k+1st block is after the first stability window after the intersection and after the forecast horizon. In this case, the candidate fragments wouldn't be extended with the k+1st block, and the GDD governor wouldn't realize that there is a discrepancy of more than k blocks.


In c8468c2 we don't allow the chain selection to fork more than k blocks from the honest chain. The honest chain is considered to be the one containing the first intersection at which candidate fragments fork into different chains.

It used to be the case that we allowed the selection to choose up to k blocks after the intersection. But it turned out that if the selection tip is still X blocks behind the intersection, allowing chain selection to pick X+k blocks, there is no guarantee that the selection will only add blocks from the candidate fragments. It could select more than k blocks of an older fork in the chain DB, and this is catastrophic because now the immutable tip would leave the honest chain.


bf3f09a refactors and properly bundles the various state that a ChainSync client maintains.

@facundominguez facundominguez added the Genesis PRs related to Genesis testing and implementation label Mar 22, 2024
@Niols Niols force-pushed the genesis/milestone-7-9-10-squashed branch 5 times, most recently from 6846b81 to d7c0a75 Compare March 26, 2024 13:30
Base automatically changed from genesis/milestone-7-9-10-squashed to main March 27, 2024 12:20
@Niols Niols force-pushed the genesis/milestone-11-gdd-governor-squashed branch 2 times, most recently from 87f61c9 to 54a9604 Compare March 27, 2024 16:50
@nbacquey nbacquey force-pushed the genesis/milestone-11-gdd-governor-squashed branch from e4fb67a to cd6818b Compare March 28, 2024 14:07
@amesgen amesgen force-pushed the genesis/milestone-11-gdd-governor-squashed branch from 8ed9833 to a31843a Compare March 28, 2024 15:27
@amesgen amesgen marked this pull request as ready for review April 3, 2024 11:34
@amesgen amesgen requested a review from a team as a code owner April 3, 2024 11:34
}
where
canAwaitTimeout :: Maybe DiffTime
canAwaitTimeout = shortWait -- 10s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, this comment might get outdated if shortWait changes. Would it make sense to use something like Just _10Seconds or just the numeric literal. Alternatively, could shortWait be a local definition?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I instead removed the comment to avoid it going out of sync. shortWait is used in various places, so knowing its value is already necessary in various places.

pure Unit
ReadGsmState -> do
fmap ReadThisGsmState $ atomically $ readTVar varGsmState
ReadMarker -> do
fmap ReadThisMarker $ atomically $ readTVar varMarker
StartIdling peer -> do
atomically $ modifyTVar varIdlers $ Set.insert peer
StartIdling peer -> atomically $ do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to guard against the case in which peer is not in the map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually part of the QSM precondition 👍

@@ -0,0 +1,7 @@
### Non-Breaking

- Set Genesis window for Byron and Shelley.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either use past tense here, or present in the other fragments :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "Set ..." is conveniently both present and past tense 😅

Niols and others added 9 commits April 15, 2024 17:05
Co-authored-by: Nicolas BACQUEY <nicolas.bacquey@tweag.io>
This also adds the `GenesisWindow` to the `EraParams`, together with
backwards-compatible serialisation.

Co-authored-by: Facundo Domínguez <facundo.dominguez@tweag.io>
Co-authored-by: Torsten Schmits <git@tryp.io>
Co-authored-by: Facundo Domínguez <facundo.dominguez@tweag.io>
Co-authored-by: Torsten Schmits <git@tryp.io>
Co-authored-by: Alexander Esgen <alexander.esgen@iohk.io>
Co-authored-by: Nicolas BACQUEY <nicolas.bacquey@tweag.io>
Co-authored-by: Torsten Schmits <git@tryp.io>
@amesgen amesgen force-pushed the genesis/milestone-11-gdd-governor-squashed branch from 696280a to cae9d98 Compare April 15, 2024 15:06
@amesgen amesgen added this pull request to the merge queue Apr 16, 2024
Merged via the queue into main with commit 505a33f Apr 16, 2024
15 checks passed
@amesgen amesgen deleted the genesis/milestone-11-gdd-governor-squashed branch April 16, 2024 10:46
github-merge-queue bot pushed a commit that referenced this pull request May 29, 2024
The main contribution of this PR is the implementation of ChainSync
Jumping (0e265c4), a mechanism to avoid overloading the honest peers in
the network when one or more peers connect to them for syncing. The
details of ChainSync Jumping are discussed in the [Jumping
module](https://github.com/IntersectMBO/ouroboros-consensus/blob/78e16d0/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/MiniProtocol/ChainSync/Client/Jumping.hs).

A range of supporting commits are provided in addition.

## More tests

* The time limited leashing attack test is configured and enabled in
6c25310. This test is aimed at the GDD governor implemented in #1015.
* A test to show that a node can resume syncing after being disconnected
for a while has been implemented in 5d3c200.
* There is a test checking that the GDD governor doesn't regret
disconnecting nodes as more headers are known to it (dbbc3ab ). This was
included in the previous milestone but we discovered later it wasn't
checking as much as we intended.
* In da7a6db and 3220e7d there is a test for ChainSync Jumping that
ensures that a syncing node downloads headers from at most one peer when
there is no disagreement between the peers.
* There are various fixes to tests and documentation in d6f6ddf,
b0ed1c8, d70e604 , cee466f, and dbbc3ab.

## GDD refinement

The ability of the GDD governor to disconnect nodes has increased in
dcb20f5. With respect to #1015, the GDD governor can now disconnect
peers in the following scenarios:
1. when a peer has a chain that can't be denser than another chain
served by another peer (before, we would disconnect the peer only if
there was a denser chain); or
2. when a peer has a chain of density 0 and it claims to have more
headers after the genesis window; or
3. when a peer sent a header for the last slot of the genesis window or
later, and loses the density comparison to another peer (before, we
would disconnect the peer only after it loses the density comparison to
a peer that sent more than k headers in the genesis window).

These changes help ChainSync Jumping make progress while downloading
headers from only two disagreeing peers (even if both are adversarial).

## Other changes

* A version of shrinking of honest schedules of messages has been
implemented in 9794292.
* 9572256 contains a fix to the implementation of Limit on Patience.
* 915238e implements a recording tracer that doesn't affect scheduling
in IOSim.
github-merge-queue bot pushed a commit that referenced this pull request Jul 11, 2024
This PR improves the code documentation of Genesis, and implements some
fixes and optimizations.

### Documentation, refactoring, and testing

The code documentation of GDD, CSJ, and LoE is better connected in
cb5b69a.

Additionally, cd032b1 brings small changes to the LoE, motivated by
issues found while integrating the LoE into the ChainDB q-s-m tests (see
#1119).

2f7152b enables CSJ in all of the uniform tests and tests run more tries
now.

Shrinking of peer schedules has been simplified in 627634f, and 6b44bfb
adds an extra field to the `PointSchedule` type to specify the time at
the end of a test, allowing to clean up a lot of schedules.

### Profiling and optimizations

We consider our baseline, the fastest possible syncing, to be the
scenario in which the syncing peer connects to a caught up peer and
syncs using Praos.

From the profiling of Genesis on mainnet data, the GDD governor and
chain selection was revealed as one of the major sinks of CPU time when
enabling Genesis and syncing from multiple peers (between 2 and 30
peers).

Chain selection has been made idempotent again, to save on useless
recomputations in 82376e6. Idempotency had been lost in #1015. 64f3da3
also calls to change selection less often, by avoiding calls when the
LoE fragment doesn't change in a significant way.

The overhead of GDD is eliminated in a592c65 by replacing a frequent
call to `AnchoredFragment.takeWhileOldest` with a faster variant.

With these changes, the only significant overhead of syncing from
multiple peers comes from the BlockFetch logic, which is the topic of
ongoing work.

### CSJ tests

The testing code has been adapted in a969b37 to allow to specify
multiple honest peers in CSJ tests. CSJ tests are the first time we care
of running multiple honest peers to check that header are downloaded
from only one of them.

We also added a test in 1d7fc95 to check that headers are downloaded
from at most one peer even in the presence of adversarial peers.

### GSM integration

In b525959 Limit on Eagerness is enabled only while GSM is in syncing
state, and 83b12ee does similarly for LoP.

f69fd1f makes possible to toggle genesis components with environment
variables. This was helpful to measure the overheads of the individual
components.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Genesis PRs related to Genesis testing and implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants