Fix `GetIsValid` bugs (once and for all?) #279

jorisdral · 2023-08-07T16:01:05Z

Description

Closes #123.

This PR changes two parts of the ChainDB model:

We reinstate that the model knows about all valid and invalid blocks for all for all chains that can be constructed using blocks in the model immutable DB and model volatile DB. That is, the valid and invalid fields of the model are now updated more rigorously.
We simplify model chain selection by not approximating in which order candidate chains are considered by real chain selection.

This has two effects:

The fact that the model knows about all valid and invalid blocks is not a problem for the GetIsValid command. The model might know in advance which blocks are valid/invalid even before the SUT knows about them, but the only condition we require for test success in the case of the GetIsValid command is: if the SUT says that a block is valid or invalid, then the model should have the same response. Conversely, if the SUT says it does not know if a block is valid or invalid, then the model's response is ignored (or in other words, it can be anything).
Before this PR, the model already knew about all valid and invalid blocks of candidate fragments before running chain selection. In contrast, the SUT only finds out about valid/invalid blocks on the fly during real chain selection as it validates blocks on chain fragments that haven't been considered before. Known invalid blocks impact the sorting of considered fragments in both model and SUT chain selection. Since the SUT knows about fewer invalid blocks than the model, model and SUT chain selection might consider chain fragments in a different order. However: if the SUT chain selection finds new invalid blocks, then real considered fragments are reduced to valid prefixes and re-sorted into the list of considered chain fragments. The result of model and SUT chain selection is the same regardless: the model merely skips these "re-sorting" steps because it already knows about all invalid blocks. In other words, the model performs "best-case" chain selection where all invalid blocks are always known, whereas the SUT does not. Still, the fact that the SUT knows about some (not necessarily all) invalid blocks serves as an optimisation because longer chains containing invalid blocks can be reduced to valid prefixes before running real chain selection. Note also that knowledge about invalid blocks should not (and does not) influence which chain is eventually selected. Otherwise, chain selection would be non-deterministic.

Future test failures

Assuming that above arguments are correct, if we find a GetIsValid failure again in the future, the first thing to check would be if we are somehow not recording some valid/invalid blocks. Model chain selection should not have to change.

Logs

Known failing test seeds were used to rerun tests, and the tests pass.

ouroboros-consensus#123

Recreate test failure

... $ git checkout 439a15f
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=616511
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: 
          ...

          PostconditionFailed "AnnotateC \"real response didn't match model response\" (PredicateC (Resp {getResp = Right (IsValid (IsValidResult {real = True, isValid = Just True}))} :/= Resp {getResp = Right (IsValid (IsValidResult {real = False, isValid = Nothing}))}))" /= Ok
          Use --quickcheck-replay=616511 to reproduce.

1 out of 1 tests failed (9.74s)

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=616511
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (78.88s)
          +++ OK, passed 100000 tests.
          
         ...

All 1 tests passed (78.89s)

ouroboros-network#3689

ouroboros-network-3689.patch

diff --git a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
index ffc17f020..b1c944ef7 100644
--- a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
+++ b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
@@ -873,12 +873,8 @@ validChains cfg m bs =
     -- first, which results in the valid chain A->B', which is then chosen
     -- over the equally preferable A->B as it will be the first in the list
     -- after a stable sort.
-    sortChains $ pruneKnownInvalid <$> chains bs
+    sortChains $ chains bs
   where
-    pruneKnownInvalid chain =
-      -- We don't want known invalid blocks to influence the sorting of candidate chains.
-      Chain.takeWhile (\b -> blockHash b `Map.notMember` (invalid m)) chain
-
     sortChains :: [Chain blk] -> [Chain blk]
     sortChains =
       sortBy $ flip (

Recreate test failure

... $ git checkout 439a15f
... $ git apply ouroboros-network-3689.patch
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=455411
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: 
          ...

          PostconditionFailed "AnnotateC \"real response didn't match model response\" (PredicateC (Resp {getResp = Right (IsValid (IsValidResult {real = True, isValid = Just True}))} :/= Resp {getResp = Right (IsValid (IsValidResult {real = False, isValid = Nothing}))}))" /= Ok
          Use --quickcheck-replay=455411 to reproduce.

1 out of 1 tests failed (39.23s)

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=455411
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (80.63s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (80.63s)

ouroboros-network#3389

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=999301
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (83.41s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (83.41s)

ouroboros-network#3440

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=758015
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (78.40s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (78.40s)

amesgen

This looks very good!

I opened #284 to find out what the cumulative fix of all three approaches to fixing this getIsValid bug is in the end (spoiler: ~~I think it is a four-line diff~~ it is a bit more than that). Please double-check and see if you agree.

Also a good idea to test with the seeds for the previous iterations of this bug; of course, it might be that these do no longer result in the same commands being generated due to unrelated changes.

amesgen · 2023-08-09T16:52:17Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

@@ -436,109 +440,10 @@ addBlock cfg blk m = Model {
          (currentChain m)
      $ consideredCandidates

-    -- See note [Getting the valid blocks] just below
    valid' = foldl
               (Chain.foldChain (\s b -> Set.insert (blockHash b) s))
               (valid m)


Do we need this to extend the previous valid m? The tests seem to pass even with

Suggested change

(valid m)

Set.empty

Motivation is that this would be a difference to #284.

I'm hesitant towards this, because the model could "forget" valid blocks if the blocks are on a chain that isn't connected to the Genesis block anymore. That was the motivation for IntersectMBO/ouroboros-network#3651 (comment).

Assume we use Set.empty instead. Test.O.C.ChainDB.StateMachine mentions the following about garbage collection:

https://github.com/input-output-hk/ouroboros-consensus/blob/f658b1986cd0035ffad84e04cda52c905cba8b56/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs#L466-L474

Garbage collection on the model is instantaneous, while garbage collection on the SUT is not. The real VolatileDB knows about the validity of all blocks in it, even if those blocks are disconnected.

https://github.com/input-output-hk/ouroboros-consensus/blob/f658b1986cd0035ffad84e04cda52c905cba8b56/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/API.hs#L209-L211

If we ask the SUT if a disconnected block is valid, and it returns Just True, then the model will return Nothing (the model doesn't know), because the model only collected valid, connected blocks, and this results in a test failure. By extending valid m, we can be reasonably sure that valid' is a non-strict superset of the set of valid blocks that the SUT knows about.

I'm now also unsure whether we should have the Map.member hash (volatileDbBlocks m) check in Model.getIsValid, because it would suffer from the same non-determinism

Very good point, I went back to 90b3bdf, where I can reproduce IntersectMBO/ouroboros-network#3389. Now it turns out that c8a6186 (what I originally thought would be enough as the entire fix, ie that was the fifth commit in #284 before I force-pushed it) does not fix the bug.

Instead, maintaining the set of valid blocks cumulatively does fix the bug, I pushed that know to #284. Note that it does not restrict the model getIsValid to just blocks in the VolDB.

I'm now also unsure whether we should have the Map.member hash (volatileDbBlocks m) check in Model.getIsValid, because it would suffer from the same non-determinism

So I think we could indeed remove this.

amesgen · 2023-08-09T16:55:46Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

+    -- Considered candidates are candidates that extend the immutable part of
+    -- the chain. Furthermore, the current selected chain is not considered as a
+    -- candidate.


Maybe this could say that sortChains takes the current chain separately?

If we add your suggestion in #279 (comment), then the comment can be reverted to what it previously was.

amesgen · 2023-08-10T09:46:11Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

+    -- Considered candidates are candidates that extend the immutable part of
+    -- the chain. Furthermore, the current selected chain is not considered as a
+    -- candidate.
+    consideredCandidates = filter ((/= CPS.chainState (cps m)) . fst)


With the other changes, is this now actually necessary to make any tests pass? Again, motivation is that this would be a difference to #284.

You're right, I don't think it's necessary. If the current chain is the best candidate, then chain selection is idempotent (not sure if it's the right word). Including the current chain if it isn't the best candidate shouldn't have any effect on the outcome.

If we use consideredCandidates for valid' as in https://github.com/input-output-hk/ouroboros-consensus/pull/279/files#r1291126148, that would also motivate not filtering out the current chain.

jorisdral · 2023-08-10T13:56:07Z

I opened #284 to find out what the cumulative fix of all three approaches to fixing this getIsValid bug is in the end (spoiler: I think it is a four-line diff). Please double-check and see if you agree.

Good idea! It makes sense to compile everything (once and for all?)

Also a good idea to test with the seeds for the previous iterations of this bug; of course, it might be that these do no longer result in the same commands being generated due to unrelated changes.

I wasn't able to reproduce ouroboros-network#3389 and ouroboros-network#3440 with the seeds listed there, but the two more recent issues I could reproduce by reverting some changes to the model.

Maybe we should test the seeds on #284 too?

amesgen · 2023-08-11T09:41:39Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

--    by real algorithm, but the set of known invalid blocks is a superset of
--    the ones known by the real algorithm. See the note
--    Ouroboros.Storage.ChainDB.StateMachine.[Invalid blocks].
+               (map fst candidates)


As valid m is cumulative, shouldn't it always be enough to use consideredCandidates here? Both the model and the SUT will only ever validate blocks that are on a chain that extends the immutable chain.

I am asking this both as this is a difference to #284, and as consideredChains is currently only used as an input to selectChain, but not to build valid'.

dnadales · 2023-08-14T10:30:33Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

@@ -873,12 +778,8 @@ validChains cfg m bs =
    -- first, which results in the valid chain A->B', which is then chosen
    -- over the equally preferable A->B as it will be the first in the list
    -- after a stable sort.
-    sortChains $ pruneKnownInvalid <$> chains bs
+    sortChains $ chains bs


Is the comment above still accurate? Wouldn't the reader be puzzled about why we don't filter the invalid chains first as in the SUT?

dnadales · 2023-08-14T10:32:38Z

ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs

+    -- Considered candidates are candidates that extend the immutable part of
+    -- the chain. Furthermore, the current selected chain is not considered as a
+    -- candidate.


Suggested change

-- Considered candidates are candidates that extend the immutable part of

-- the chain. Furthermore, the current selected chain is not considered as a

-- candidate.

-- Considered candidates are candidates that extend the immutable part of

-- the chain, __except for__ the current selected chain.

jorisdral · 2023-08-14T13:34:45Z

Closed in favour of #284, which describes the cumulative fix of this and previous PRs related to GetIsValid bugs

jorisdral added 2 commits August 7, 2023 17:46

ChainDB QSM: don't use the currently selected chain as a candidate.

8633b90

ChainDB QSM: always record valid blocks.

d3c9999

jorisdral added the bug Something isn't working label Aug 7, 2023

jorisdral self-assigned this Aug 7, 2023

jorisdral marked this pull request as ready for review August 8, 2023 10:35

jorisdral requested a review from a team as a code owner August 8, 2023 10:35

jorisdral requested a review from amesgen August 8, 2023 10:39

amesgen mentioned this pull request Aug 10, 2023

"Cumulative" fix for the ChainDB q-s-m getIsValid model vs SUT discrepancy #284

Merged

amesgen approved these changes Aug 10, 2023

View reviewed changes

amesgen reviewed Aug 11, 2023

View reviewed changes

dnadales reviewed Aug 14, 2023

View reviewed changes

jorisdral closed this Aug 14, 2023

jorisdral deleted the jdral/123-getisvalid branch August 14, 2023 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `GetIsValid` bugs (once and for all?) #279

Fix `GetIsValid` bugs (once and for all?) #279

jorisdral commented Aug 7, 2023 •

edited

Loading

amesgen left a comment •

edited

Loading

amesgen Aug 9, 2023

jorisdral Aug 10, 2023 •

edited

Loading

jorisdral Aug 10, 2023

amesgen Aug 11, 2023

amesgen Aug 9, 2023

jorisdral Aug 10, 2023

amesgen Aug 10, 2023

jorisdral Aug 10, 2023

amesgen Aug 11, 2023

jorisdral commented Aug 10, 2023 •

edited

Loading

amesgen Aug 11, 2023

dnadales Aug 14, 2023

dnadales Aug 14, 2023

jorisdral commented Aug 14, 2023

Fix GetIsValid bugs (once and for all?) #279

Fix GetIsValid bugs (once and for all?) #279

Conversation

jorisdral commented Aug 7, 2023 • edited Loading

Description

Future test failures

Logs

amesgen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisdral Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisdral commented Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisdral commented Aug 14, 2023

Fix `GetIsValid` bugs (once and for all?) #279

Fix `GetIsValid` bugs (once and for all?) #279

jorisdral commented Aug 7, 2023 •

edited

Loading

amesgen left a comment •

edited

Loading

jorisdral Aug 10, 2023 •

edited

Loading

jorisdral commented Aug 10, 2023 •

edited

Loading