Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GetIsValid bugs (once and for all?) #279

Closed
wants to merge 2 commits into from
Closed

Conversation

jorisdral
Copy link
Contributor

@jorisdral jorisdral commented Aug 7, 2023

Description

Closes #123.

This PR changes two parts of the ChainDB model:

  • We reinstate that the model knows about all valid and invalid blocks for all for all chains that can be constructed using blocks in the model immutable DB and model volatile DB. That is, the valid and invalid fields of the model are now updated more rigorously.
  • We simplify model chain selection by not approximating in which order candidate chains are considered by real chain selection.

This has two effects:

  • The fact that the model knows about all valid and invalid blocks is not a problem for the GetIsValid command. The model might know in advance which blocks are valid/invalid even before the SUT knows about them, but the only condition we require for test success in the case of the GetIsValid command is: if the SUT says that a block is valid or invalid, then the model should have the same response. Conversely, if the SUT says it does not know if a block is valid or invalid, then the model's response is ignored (or in other words, it can be anything).
  • Before this PR, the model already knew about all valid and invalid blocks of candidate fragments before running chain selection. In contrast, the SUT only finds out about valid/invalid blocks on the fly during real chain selection as it validates blocks on chain fragments that haven't been considered before. Known invalid blocks impact the sorting of considered fragments in both model and SUT chain selection. Since the SUT knows about fewer invalid blocks than the model, model and SUT chain selection might consider chain fragments in a different order. However: if the SUT chain selection finds new invalid blocks, then real considered fragments are reduced to valid prefixes and re-sorted into the list of considered chain fragments. The result of model and SUT chain selection is the same regardless: the model merely skips these "re-sorting" steps because it already knows about all invalid blocks. In other words, the model performs "best-case" chain selection where all invalid blocks are always known, whereas the SUT does not. Still, the fact that the SUT knows about some (not necessarily all) invalid blocks serves as an optimisation because longer chains containing invalid blocks can be reduced to valid prefixes before running real chain selection. Note also that knowledge about invalid blocks should not (and does not) influence which chain is eventually selected. Otherwise, chain selection would be non-deterministic.

Future test failures

Assuming that above arguments are correct, if we find a GetIsValid failure again in the future, the first thing to check would be if we are somehow not recording some valid/invalid blocks. Model chain selection should not have to change.

Logs

Known failing test seeds were used to rerun tests, and the tests pass.

ouroboros-consensus#123

Recreate test failure
... $ git checkout 439a15f
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=616511
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: 
          ...

          PostconditionFailed "AnnotateC \"real response didn't match model response\" (PredicateC (Resp {getResp = Right (IsValid (IsValidResult {real = True, isValid = Just True}))} :/= Resp {getResp = Right (IsValid (IsValidResult {real = False, isValid = Nothing}))}))" /= Ok
          Use --quickcheck-replay=616511 to reproduce.

1 out of 1 tests failed (9.74s)
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=616511
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (78.88s)
          +++ OK, passed 100000 tests.
          
         ...

All 1 tests passed (78.89s)

ouroboros-network#3689

ouroboros-network-3689.patch
diff --git a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
index ffc17f020..b1c944ef7 100644
--- a/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
+++ b/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/Model.hs
@@ -873,12 +873,8 @@ validChains cfg m bs =
     -- first, which results in the valid chain A->B', which is then chosen
     -- over the equally preferable A->B as it will be the first in the list
     -- after a stable sort.
-    sortChains $ pruneKnownInvalid <$> chains bs
+    sortChains $ chains bs
   where
-    pruneKnownInvalid chain =
-      -- We don't want known invalid blocks to influence the sorting of candidate chains.
-      Chain.takeWhile (\b -> blockHash b `Map.notMember` (invalid m)) chain
-
     sortChains :: [Chain blk] -> [Chain blk]
     sortChains =
       sortBy $ flip (
Recreate test failure
... $ git checkout 439a15f
... $ git apply ouroboros-network-3689.patch
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=455411
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: 
          ...

          PostconditionFailed "AnnotateC \"real response didn't match model response\" (PredicateC (Resp {getResp = Right (IsValid (IsValidResult {real = True, isValid = Just True}))} :/= Resp {getResp = Right (IsValid (IsValidResult {real = False, isValid = Nothing}))}))" /= Ok
          Use --quickcheck-replay=455411 to reproduce.

1 out of 1 tests failed (39.23s)
... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=455411
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (80.63s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (80.63s)

ouroboros-network#3389

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=999301
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (83.41s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (83.41s)

ouroboros-network#3440

... $ cabal run storage-test -- -p 'ChainDB q-s-m' --quickcheck-replay=758015
ouroboros-storage
  Storage
    ChainDB
      ChainDB q-s-m
        sequential: OK (78.40s)
          +++ OK, passed 100000 tests.
          
          ...

All 1 tests passed (78.40s)

@jorisdral jorisdral added the bug Something isn't working label Aug 7, 2023
@jorisdral jorisdral self-assigned this Aug 7, 2023
@jorisdral jorisdral marked this pull request as ready for review August 8, 2023 10:35
@jorisdral jorisdral requested a review from a team as a code owner August 8, 2023 10:35
@jorisdral jorisdral requested a review from amesgen August 8, 2023 10:39
Copy link
Member

@amesgen amesgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good!

I opened #284 to find out what the cumulative fix of all three approaches to fixing this getIsValid bug is in the end (spoiler: I think it is a four-line diff it is a bit more than that). Please double-check and see if you agree.

Also a good idea to test with the seeds for the previous iterations of this bug; of course, it might be that these do no longer result in the same commands being generated due to unrelated changes.

@@ -436,109 +440,10 @@ addBlock cfg blk m = Model {
(currentChain m)
$ consideredCandidates

-- See note [Getting the valid blocks] just below
valid' = foldl
(Chain.foldChain (\s b -> Set.insert (blockHash b) s))
(valid m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to extend the previous valid m? The tests seem to pass even with

Suggested change
(valid m)
Set.empty

Motivation is that this would be a difference to #284.

Copy link
Contributor Author

@jorisdral jorisdral Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant towards this, because the model could "forget" valid blocks if the blocks are on a chain that isn't connected to the Genesis block anymore. That was the motivation for IntersectMBO/ouroboros-network#3651 (comment).

Assume we use Set.empty instead. Test.O.C.ChainDB.StateMachine mentions the following about garbage collection:

https://github.com/input-output-hk/ouroboros-consensus/blob/f658b1986cd0035ffad84e04cda52c905cba8b56/ouroboros-consensus/test/storage-test/Test/Ouroboros/Storage/ChainDB/StateMachine.hs#L466-L474

Garbage collection on the model is instantaneous, while garbage collection on the SUT is not. The real VolatileDB knows about the validity of all blocks in it, even if those blocks are disconnected.

https://github.com/input-output-hk/ouroboros-consensus/blob/f658b1986cd0035ffad84e04cda52c905cba8b56/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Storage/ChainDB/API.hs#L209-L211

If we ask the SUT if a disconnected block is valid, and it returns Just True, then the model will return Nothing (the model doesn't know), because the model only collected valid, connected blocks, and this results in a test failure. By extending valid m, we can be reasonably sure that valid' is a non-strict superset of the set of valid blocks that the SUT knows about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now also unsure whether we should have the Map.member hash (volatileDbBlocks m) check in Model.getIsValid, because it would suffer from the same non-determinism

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, I went back to 90b3bdf, where I can reproduce IntersectMBO/ouroboros-network#3389. Now it turns out that c8a6186 (what I originally thought would be enough as the entire fix, ie that was the fifth commit in #284 before I force-pushed it) does not fix the bug.

Instead, maintaining the set of valid blocks cumulatively does fix the bug, I pushed that know to #284. Note that it does not restrict the model getIsValid to just blocks in the VolDB.

I'm now also unsure whether we should have the Map.member hash (volatileDbBlocks m) check in Model.getIsValid, because it would suffer from the same non-determinism

So I think we could indeed remove this.

Comment on lines +427 to +429
-- Considered candidates are candidates that extend the immutable part of
-- the chain. Furthermore, the current selected chain is not considered as a
-- candidate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could say that sortChains takes the current chain separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add your suggestion in #279 (comment), then the comment can be reverted to what it previously was.

-- Considered candidates are candidates that extend the immutable part of
-- the chain. Furthermore, the current selected chain is not considered as a
-- candidate.
consideredCandidates = filter ((/= CPS.chainState (cps m)) . fst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the other changes, is this now actually necessary to make any tests pass? Again, motivation is that this would be a difference to #284.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I don't think it's necessary. If the current chain is the best candidate, then chain selection is idempotent (not sure if it's the right word). Including the current chain if it isn't the best candidate shouldn't have any effect on the outcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use consideredCandidates for valid' as in https://github.com/input-output-hk/ouroboros-consensus/pull/279/files#r1291126148, that would also motivate not filtering out the current chain.

@jorisdral
Copy link
Contributor Author

jorisdral commented Aug 10, 2023

I opened #284 to find out what the cumulative fix of all three approaches to fixing this getIsValid bug is in the end (spoiler: I think it is a four-line diff). Please double-check and see if you agree.

Good idea! It makes sense to compile everything (once and for all?)

Also a good idea to test with the seeds for the previous iterations of this bug; of course, it might be that these do no longer result in the same commands being generated due to unrelated changes.

I wasn't able to reproduce ouroboros-network#3389 and ouroboros-network#3440 with the seeds listed there, but the two more recent issues I could reproduce by reverting some changes to the model.

Maybe we should test the seeds on #284 too?

-- by real algorithm, but the set of known invalid blocks is a superset of
-- the ones known by the real algorithm. See the note
-- Ouroboros.Storage.ChainDB.StateMachine.[Invalid blocks].
(map fst candidates)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As valid m is cumulative, shouldn't it always be enough to use consideredCandidates here? Both the model and the SUT will only ever validate blocks that are on a chain that extends the immutable chain.

I am asking this both as this is a difference to #284, and as consideredChains is currently only used as an input to selectChain, but not to build valid'.

@@ -873,12 +778,8 @@ validChains cfg m bs =
-- first, which results in the valid chain A->B', which is then chosen
-- over the equally preferable A->B as it will be the first in the list
-- after a stable sort.
sortChains $ pruneKnownInvalid <$> chains bs
sortChains $ chains bs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment above still accurate? Wouldn't the reader be puzzled about why we don't filter the invalid chains first as in the SUT?

Comment on lines +427 to +429
-- Considered candidates are candidates that extend the immutable part of
-- the chain. Furthermore, the current selected chain is not considered as a
-- candidate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- Considered candidates are candidates that extend the immutable part of
-- the chain. Furthermore, the current selected chain is not considered as a
-- candidate.
-- Considered candidates are candidates that extend the immutable part of
-- the chain, __except for__ the current selected chain.

@jorisdral
Copy link
Contributor Author

Closed in favour of #284, which describes the cumulative fix of this and previous PRs related to GetIsValid bugs

@jorisdral jorisdral closed this Aug 14, 2023
@jorisdral jorisdral deleted the jdral/123-getisvalid branch August 14, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ChainDB q-s-m failure with GetIsValid (again)
3 participants