Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genesis milestone 13: documentation / performance optimization #1113

Merged
merged 37 commits into from
Jul 11, 2024

Conversation

nbacquey
Copy link
Contributor

@nbacquey nbacquey commented May 21, 2024

This PR improves the code documentation of Genesis, and implements some fixes and optimizations.

Documentation, refactoring, and testing

The code documentation of GDD, CSJ, and LoE is better connected in cb5b69a.

Additionally, cd032b1 brings small changes to the LoE, motivated by issues found while integrating the LoE into the ChainDB q-s-m tests (see #1119).

2f7152b enables CSJ in all of the uniform tests and tests run more tries now.

Shrinking of peer schedules has been simplified in 627634f, and 6b44bfb adds an extra field to the PointSchedule type to specify the time at the end of a test, allowing to clean up a lot of schedules.

Profiling and optimizations

We consider our baseline, the fastest possible syncing, to be the scenario in which the syncing peer connects to a caught up peer and syncs using Praos.

From the profiling of Genesis on mainnet data, the GDD governor and chain selection was revealed as one of the major sinks of CPU time when enabling Genesis and syncing from multiple peers (between 2 and 30 peers).

Chain selection has been made idempotent again, to save on useless recomputations in 82376e6. Idempotency had been lost in #1015. 64f3da3 also calls to change selection less often, by avoiding calls when the LoE fragment doesn't change in a significant way.

The overhead of GDD is eliminated in a592c65 by replacing a frequent call to AnchoredFragment.takeWhileOldest with a faster variant.

With these changes, the only significant overhead of syncing from multiple peers comes from the BlockFetch logic, which is the topic of ongoing work.

CSJ tests

The testing code has been adapted in a969b37 to allow to specify multiple honest peers in CSJ tests. CSJ tests are the first time we care of running multiple honest peers to check that header are downloaded from only one of them.

We also added a test in 1d7fc95 to check that headers are downloaded from at most one peer even in the presence of adversarial peers.

GSM integration

In b525959 Limit on Eagerness is enabled only while GSM is in syncing state, and 83b12ee does similarly for LoP.

f69fd1f makes possible to toggle genesis components with environment variables. This was helpful to measure the overheads of the individual components.

@nbacquey nbacquey added documentation Improvements or additions to documentation do-not-merge Genesis PRs related to Genesis testing and implementation labels May 21, 2024
@nbacquey nbacquey changed the base branch from main to genesis/milestone-12-chain-sync-jumping May 21, 2024 13:12
@nbacquey nbacquey force-pushed the genesis/milestone-12-chain-sync-jumping branch from 980c35a to ce3b2c6 Compare May 29, 2024 15:37
Base automatically changed from genesis/milestone-12-chain-sync-jumping to main May 29, 2024 18:14
@facundominguez facundominguez force-pushed the genesis/milestone-13 branch 5 times, most recently from db11bc1 to 2acc4b3 Compare May 30, 2024 19:23
@Niols Niols marked this pull request as ready for review June 10, 2024 15:42
@Niols Niols requested a review from a team as a code owner June 10, 2024 15:42
@dnadales dnadales self-assigned this Jun 14, 2024
Copy link
Contributor

@jasagredo jasagredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments here and there. Also:

  • there are hlint warnings, please fix them before merge
  • There are commits which are Unverified, I doubt branch protection rules will allow you to merge this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this file will result in the same as the ResourceRegistry: we have an isolated component that does not depend on consensus, then it becomes difficult (in terms of bureaucracy) extracting it. I wonder if this should be a separate library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of our abstractions initially started in Consensus (eg the Fuse you added most recently); but we could change that and put it at least into a different library. Is it fine to do that later in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the GSM and GsmState live under a Node/Genesis/. directory?

Copy link
Member

@amesgen amesgen Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -247,7 +253,7 @@ initNodeKernel args@NodeKernelArgs { registry, cfg, tracers
, GSM.setCaughtUpPersistentMark = \upd ->
(if upd then GSM.touchMarkerFile else GSM.removeMarkerFile)
gsmMarkerFileView
, GSM.writeGsmState = \gsmState ->
, GSM.writeGsmState = \gsmState -> do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unneeded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this about the extraneous do at the end of the line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
, GSM.writeGsmState = \gsmState -> do
, GSM.writeGsmState = \gsmState ->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in general a bit concerned that Genesis seems to permeate into ouroboros-consensus and ouroboros-consensus-diffusion with subtle subtleties on which concepts go into each package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, partly, I think this is because we don't really have a great distinction already in main (eg why does the forging loop live in ouroboros-consensus-diffusion?). We might think about this and shuffle some things around, but probably in a later PR in case that is fine.

Copy link
Member

@dnadales dnadales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Co-authored with @jasagredo

@@ -194,8 +237,14 @@ data DensityBounds blk =
-- ChainSync instruction the peer sent, and whether the peer is idling (i.e. it
-- sent @MsgAwaitReply@).
--
-- @loeFrag@ is the fragment from the immutable tip to the first intersection
-- with a candidate fragment.
-- @loeFrag@ is the fragment anchored at the immutable tip and ending in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, variable loeFrag is not visible in the resulting haddock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is introduced in the first line of this comment.

-- | @densityDisconnect genWin k states candidateSuffixes loeFrag@                                                                                                                            
-- yields the list of peers which are known to lose the density comparison with                                                                                                               
-- any other peer, when looking at the genesis window after @loeFrag@.
-- ...

Let me know if that doesn't work still.

--
-- This type indicates whether the feature is enabled, and contains a value
-- if it is.
-- This type indicates whether LoE is enabled, and contains a value if it is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain what this value represents and give some examples?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of a depends on the context on which the LoE is used. There is no intended use at the definition.

LoE a is isomorphic to Maybe a, with the added meaning that Just is only used if LoE is enabled. I elaborated this point in 20d86b3.

@@ -649,7 +648,7 @@ chainSelectionForBlock cdb@CDB{..} blockCache hdr punish = electric $ do
-- ^ The current chain and ledger
-> LoE (AnchoredFragment (Header blk))
-- ^ LoE fragment
-> LoELimit
-> LoE Word64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which cases will LoE be parametrized with other types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example here:

type GetLoEFragment m blk = m (LoE (AnchoredFragment (Header blk)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this file not part of ouroboros-consensus like the GSM state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good answer (note that the GSM was introduced in #808, not in this PR), also see #1113 (comment)

-- Evaluating the GDD rule might cause peers to be disconnected if they have
-- sparser chains than the best chain.
--
-- The LoE fragment is the fragment anchored at the immutable tip and ending at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the reference to LoE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit of the comment has been removed, but there is also a reference for LoE in the module header.

import Ouroboros.Network.AnchoredFragment (AnchoredFragment)
import qualified Ouroboros.Network.AnchoredFragment as AF

-- We have multiple other Genesis-related types of a similar shape ('LoE', LoP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a REVIEW or TODO keyword maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has been removed.

, gcsCSJConfig :: !CSJConfig
}

-- TODO justification/derivation from other parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we want to address this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably after "real world" performance benchmarks on resource-limited machines. For now, the values are conservative; tuning them could be one part of a phased rollout of Genesis.

@@ -0,0 +1,4 @@
### Breaking

- Internal refactorings in the CSJ and GDD.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same grammatical form across all changelogs (at least within the same PR :))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed 👍

Niols and others added 13 commits July 10, 2024 09:32
This also involves some fairly important changes to the leaky bucket and
to the leaky bucket API. These changes make the leaky bucket
significantly more robust.

Co-authored-by: Alexander Esgen <alexander.esgen@iohk.io>
Also:

* Simplify chainDiffs computation in chainSelectionForBlock
* Revert "Don't allow to extend the selection by more than k blocks"
  This reverts commit c8468c2.
* Revert "Retrigger GDD when the selection changes"
  This reverts commit f990d41.
* Rewrite `Peers` to accept arbitrary number of peers
* Actually generate honest peers in CSJ happy path
* Support a field for extra honest peers in `GenesisTest`
* Allow `uniformPoints` to generate schedules with multiple honest peers
* Adapt CSJ test to use native multiple honest peers generation
* Share partial accessor functions used in tests
* Use partial accessor to retrieve the only honest peer
Also:
* Remove ill-defined LoELimit
* Remove unused updateLoEFragStall
* Rename runGdd to runGDDGovernor
* Print GDD in traces instead of GDG
* Remove the UpdateLoEFrag callback
* Reorder functions in the Governor module
…' and disable timeouts as they are supposed to be
Also:
* LoEEnabled: make payload strict to avoid NoThunks failures
* LoE: allow to dynamically en-/disable
* Depending on the state of the GSM, we want to either en- or disable the LoE.
* Refactor `runGdd` to use `Watcher`, share trigger logic
  Previously, we would duplicate the logic for when to trigger the GDD between the
  NodeKernel and the peer simulator.
* Add GDD tracing
  The `Show` instances are probably way too large ATM, but there are currently
  unused, so that isn't a pressing concern.
* LoP rate: fix typo
  500/s = 1/2ms, not 2/ms 🤦
Previously, it wasn't possible to eg run *just* CSJ.
Modify the honest shrinking function by no longer speeding up the other
schedules when an honest tick is deleted. This simplifies a lot of code
in the `Shrinking` module, at the cost of no longer ensuring that shrunk
schedules preserve the overall order of events.

Additionally, we re-enable shrinking in CSJ tests

Also:
* Extend adversarial schedules when shrinking an honest one
* Document the cases when we don't use shrinking
@facundominguez
Copy link
Contributor

I rebased main.

@facundominguez
Copy link
Contributor

I'm labeling "no changelog" since I don't think the changes to ouroboros-consensus-cardano deserve a changelog entry in this PR. But let me know if something else should be done about it.

@facundominguez facundominguez added this pull request to the merge queue Jul 11, 2024
Merged via the queue into main with commit 1fb0c16 Jul 11, 2024
15 checks passed
@facundominguez facundominguez deleted the genesis/milestone-13 branch July 11, 2024 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge documentation Improvements or additions to documentation Genesis PRs related to Genesis testing and implementation no changelog
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

6 participants