Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoE to the ChainDB QSM test #1119

Merged
merged 2 commits into from
Aug 30, 2024
Merged

Add LoE to the ChainDB QSM test #1119

merged 2 commits into from
Aug 30, 2024

Conversation

Niols
Copy link
Contributor

@Niols Niols commented May 27, 2024

Closes #541

This PR adds a new instruction to the ChainDB q-s-m test, which updates the Limit on Eagerness (LoE) fragment and retriggers chain selection. This requires a model implementation of the LoE.

As a preparatory change, we add a way to trigger chain selection synchronously (ie such that we can wait for it to finish), which is only used in tests.

@Niols Niols added the Genesis PRs related to Genesis testing and implementation label May 27, 2024
Copy link
Member

@amesgen amesgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure in getChain_addChain makes sense: The test is making sure that for every chain bc, adding its blocks to an empty model causes the selection to be bc. With the LoE enabled and at Genesis (as is the case in the counterexamples), this of course fails when the chain is sufficiently long.

One way to fix this would be to discard these test cases, or even just disable the LoE completely for them.


The failure of alwaysPickPreferredChain seems to be for a very similar reason.

@Niols
Copy link
Contributor Author

Niols commented May 29, 2024

The test failure in getChain_addChain makes sense: The test is making sure that for every chain bc, adding its blocks to an empty model causes the selection to be bc. With the LoE enabled and at Genesis (as is the case in the counterexamples), this of course fails when the chain is sufficiently long.

One way to fix this would be to discard these test cases, or even just disable the LoE completely for them.

The failure of alwaysPickPreferredChain seems to be for a very similar reason.

Indeed, it was as simple as that. Fixed in 660b331. Could you check it, and in particular tell me if the comment of alwaysPickPreferredChain is correct? (The one for getChain_addChain is exactly what you wrote.)

@Niols Niols mentioned this pull request May 29, 2024
@Niols Niols force-pushed the niols/loe-chaindb-qsm branch 2 times, most recently from 80db2f9 to 2da9483 Compare May 29, 2024 13:38
@Niols
Copy link
Contributor Author

Niols commented May 29, 2024

Now rebased on top of #1125.

@Niols Niols changed the base branch from genesis/milestone-13 to niols/fix-loe May 29, 2024 13:39
@Niols
Copy link
Contributor Author

Niols commented May 30, 2024

After 14 hours of tests, I find a counter-example to this PR. Reproducible on branch niols/loe-chaindb-qsm at commit 2da9483 with:

cabal run ouroboros-consensus:storage-test -- \
  -p '/ChainDB q-s-m/' \
  --quickcheck-tests 500_000 \
  --quickcheck-replay="(SMGen 16524770743761658109 16657631389868279761,58)"

@Niols
Copy link
Contributor Author

Niols commented May 30, 2024

Actually, upon inspecting this bug, it turns out that this failure is not LoE-related. The LoE is very much disabled and it looks like this has to do with GCing or streaming. I don't exactly know what this should do, but the behaviour of the model looks wrong to me.

@Niols
Copy link
Contributor Author

Niols commented May 30, 2024

Yup, I can reproduce with the same seed on genesis/milestone-13 and main with an additional:

+source-repository-package
+  type: git
+  location: https://github.com/UnkindPartition/tasty
+  tag: dcbf32078133aa2b569774c417cc49a49f5c573b
+  subdir: quickcheck
+  --sha256: 0rwfcv0iq50ckzdgdssgkil6g9n76w1k6yxs0q4fdrkf5nlh4vv5

in cabal.project. I don't understand why I manage to reproduce when the code of the test changed so much, but oh well.

@Niols Niols force-pushed the niols/loe-chaindb-qsm branch 2 times, most recently from 41085a1 to 41d8a9d Compare May 31, 2024 14:54
@Niols
Copy link
Contributor Author

Niols commented May 31, 2024

As noted by @amesgen, this is probably #299.

@Niols Niols force-pushed the niols/fix-loe branch 2 times, most recently from e85da74 to cd032b1 Compare June 4, 2024 12:08
Base automatically changed from niols/fix-loe to genesis/milestone-13 June 4, 2024 12:08
Copy link
Member

@amesgen amesgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice to finally get this

@facundominguez facundominguez force-pushed the genesis/milestone-13 branch 2 times, most recently from 5680c0b to 53ea76c Compare July 11, 2024 17:28
github-merge-queue bot pushed a commit that referenced this pull request Jul 11, 2024
This PR improves the code documentation of Genesis, and implements some
fixes and optimizations.

### Documentation, refactoring, and testing

The code documentation of GDD, CSJ, and LoE is better connected in
cb5b69a.

Additionally, cd032b1 brings small changes to the LoE, motivated by
issues found while integrating the LoE into the ChainDB q-s-m tests (see
#1119).

2f7152b enables CSJ in all of the uniform tests and tests run more tries
now.

Shrinking of peer schedules has been simplified in 627634f, and 6b44bfb
adds an extra field to the `PointSchedule` type to specify the time at
the end of a test, allowing to clean up a lot of schedules.

### Profiling and optimizations

We consider our baseline, the fastest possible syncing, to be the
scenario in which the syncing peer connects to a caught up peer and
syncs using Praos.

From the profiling of Genesis on mainnet data, the GDD governor and
chain selection was revealed as one of the major sinks of CPU time when
enabling Genesis and syncing from multiple peers (between 2 and 30
peers).

Chain selection has been made idempotent again, to save on useless
recomputations in 82376e6. Idempotency had been lost in #1015. 64f3da3
also calls to change selection less often, by avoiding calls when the
LoE fragment doesn't change in a significant way.

The overhead of GDD is eliminated in a592c65 by replacing a frequent
call to `AnchoredFragment.takeWhileOldest` with a faster variant.

With these changes, the only significant overhead of syncing from
multiple peers comes from the BlockFetch logic, which is the topic of
ongoing work.

### CSJ tests

The testing code has been adapted in a969b37 to allow to specify
multiple honest peers in CSJ tests. CSJ tests are the first time we care
of running multiple honest peers to check that header are downloaded
from only one of them.

We also added a test in 1d7fc95 to check that headers are downloaded
from at most one peer even in the presence of adversarial peers.

### GSM integration

In b525959 Limit on Eagerness is enabled only while GSM is in syncing
state, and 83b12ee does similarly for LoP.

f69fd1f makes possible to toggle genesis components with environment
variables. This was helpful to measure the overheads of the individual
components.
Base automatically changed from genesis/milestone-13 to main July 11, 2024 19:12
@amesgen amesgen force-pushed the niols/loe-chaindb-qsm branch 2 times, most recently from cd0f027 to 15a9be3 Compare July 25, 2024 08:17
This is useful for tests.

Co-authored-by: Nicolas “Niols” Jeannerod <nicolas.jeannerod@tweag.io>
@amesgen amesgen marked this pull request as ready for review August 9, 2024 13:55
Copy link
Member

@dnadales dnadales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we should wait for the synchronous review during a video call.

, maxClockSkew = maxClockSkew m
, isOpen = True
}
addBlock cfg blk m =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a pattern match on ignoreBlock be more readable than an if?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I condensed this code a bit 👍

diff --git a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
index 2c8f5c7eef..05f7cabe74 100644
--- a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
+++ b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
@@ -402,14 +402,10 @@ addBlock :: forall blk. LedgerSupportsProtocol blk
          -> blk
          -> Model blk -> Model blk
 addBlock cfg blk m =
-  if ignoreBlock
-    then m
-    else
-      chainSelection cfg $
-        m
-          { volatileDbBlocks =
-              Map.insert (blockHash blk) blk (volatileDbBlocks m)
-          }
+  | ignoreBlock = m
+  | otherwise   = chainSelection cfg m {
+        volatileDbBlocks = Map.insert (blockHash blk) blk (volatileDbBlocks m)
+      }
   where
     secParam = configSecurityParam cfg
     immBlockNo = immutableBlockNo secParam m

@@ -890,6 +909,10 @@ generator genBlock m@Model {..} = At <$> frequency
, (if empty then 1 else 10, return GetMaxSlotNo)
, (if empty then 1 else 10, genGetIsValid)

, (case loe of
LoEDisabled -> (0, return $ UpdateLoE $ AF.Empty AF.AnchorGenesis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't setting the frequency to 0 be equivalent to not generating this case at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we just use 0 here for convenience as we are inside of a list literal here; a similar pattern is used in a few other places; I slightly refactored this to at least not use a bogus value on the right.

diff --git a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs
index 510ebac6df..184f018a4b 100644
--- a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs
+++ b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs
@@ -909,9 +909,10 @@ generator loe genBlock m@Model {..} = At <$> frequency
     , (if empty then 1 else 10, return GetMaxSlotNo)
     , (if empty then 1 else 10, genGetIsValid)
 
-    , (case loe of
-         LoEDisabled   -> (0, return $ UpdateLoE $ AF.Empty AF.AnchorGenesis)
-         LoEEnabled () -> (if empty then 1 else 10, UpdateLoE <$> genLoEFragment))
+    , let freq = case loe of
+            LoEDisabled   -> 0
+            LoEEnabled () -> if empty then 1 else 10
+      in (freq, UpdateLoE <$> genLoEFragment)
 
     -- Iterators
     , (if empty then 1 else 10, uncurry Stream <$> genBounds)

@@ -0,0 +1,3 @@
### Breaking

- ChainDB: allow to trigger chain selection synchronously
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the PR description accordingly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and maybe provide a rationale on how this change relates to adding LoE to the ChainDB tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated the PR description. The reason for adding this is just testing (also see the first commit message).

Co-authored-by: Nicolas “Niols” Jeannerod <nicolas.jeannerod@tweag.io>
@amesgen amesgen enabled auto-merge August 30, 2024 15:17
@amesgen amesgen added this pull request to the merge queue Aug 30, 2024
Merged via the queue into main with commit 0cae17c Aug 30, 2024
15 checks passed
@amesgen amesgen deleted the niols/loe-chaindb-qsm branch August 30, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Genesis PRs related to Genesis testing and implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test the LoE in ChainSel
3 participants