Skip to content
This repository has been archived by the owner on Jun 6, 2023. It is now read-only.

prevent faults from double subtracting cc upgrade power #1129

Merged
merged 25 commits into from
Sep 14, 2020

Conversation

acruikshank
Copy link
Contributor

@acruikshank acruikshank commented Sep 9, 2020

closes #1100

Motivation

All sectors live within an expiration set within an expiration queue within a partition within a deadline. This expiration set is the set at the sectors Expiration field. When we reschedule expirations (say for a CC upgrade or a fault), we move it into a different set to trigger an earlier expiration. We do not change the sector's Expiration field, so the new set can not necessarily be determined from the sector's on chain info.

Whenever we perform an operation that affects sector expirations, we group the sectors by expiration and operate on the set en mass. If sectors have been rescheduled, we end up altering expiration set stats for sectors that are no longer there, and failing to update the stats for the sets to which they have moved. In the case of declaring a fault for a cc upgraded sector, the end result is that we subtract the sectors power twice from the miner's claim (among other accounting mistakes).

The solution implemented in this PR is to check whether sectors expected to be in an expiration set are actually there. Sectors that are not are grouped by iterating through the expiration queue's sets instead. Not adding sectors to groups for expiration sets don't contain them should be sufficient to avoid breaking accounting invariants, whereas finding the sectors in the correct sets should deliver the intended behavior.

Proposed Changes

  1. Implement findSectorsByExpiration that groups sectors in the correct expiration set even if they have been rescheduled.
  2. Use findSectorsByExpiration in ExpirationQueue.RescheduleAsFaults and ExpirationQueue.removeActiveSectors.
  3. Fix regression discovered in this PR that prevented faulty expiring sectors from being penalized in their last proving period cron.
  4. Add test for cc upgrading a sector before it is proven.
  5. Add test for declaring a fault for a cc replaced sector before it expires.
  6. Add test for skipping a cc replaced sector in its last PoSt.
  7. Add test for missed PoSt in proving deadline where cc replaced sector expires.
  8. Add test for terminating cc replaced sector before it expires.
  9. Add test for extending expiration of cc replaced sector before it expires.
  10. Add test for faulting and recovering a cc replaced sector before it expires.
  11. Rearranged miner tests so tests related to cc upgrades have their own section.

Without the changes to grouping, only 4 and 7 would have behaved as expected.

Note 11.: Tests that do not appear in the list have been moved without change and don't require the scrutiny of the rest.

@codecov-commenter
Copy link

codecov-commenter commented Sep 9, 2020

Codecov Report

Merging #1129 into master will decrease coverage by 0.0%.
The diff coverage is 68.6%.

@@           Coverage Diff            @@
##           master   #1129     +/-   ##
========================================
- Coverage    75.3%   75.3%   -0.1%     
========================================
  Files          51      51             
  Lines        6320    6396     +76     
========================================
+ Hits         4763    4819     +56     
- Misses        927     938     +11     
- Partials      630     639      +9     

Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully reviewed this yet, but I think this problem is more widespread. Every call to groupSectorsByExpiration is suspect if it expects to find the sectors there. Checking the call tree:

  • EQ.removeActiveSectors, must be updated to support replaced sectors for:
    • EQ.RemoveSectors (from TerminateSectors)
    • EQ.ReplaceSectors (from ExtendSectorExpiration)
    • EQ.RescheduleExpirations (from CC upgrade - should work but do nothing)
  • EQ.AddActiveSectors: ok, these sectors aren't in the queue yet
  • EQ.RescheduleAsFaults: handled in this PR already

I think we should rename for clarity:

  • groupSectorsByExpiration -> groupNewSectorsByDeclaredExpiration (only for AddActiveSectors)
  • groupSectorsByExpirationWithRescheduled -> eq.findSectorsByExpiration

Tests we need:

  • Terminate a replaced sector before it expires
  • Extend a replaced sector's expiration
  • Replace a sector that's already been replaced before it expires
  • Declare fault on a replaced sector before it expires
  • Fail to prove a replaced sector before it expires (detected fault)

cc @wadealexc

@acruikshank acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 7b4a5c9 to 4ae141e Compare September 9, 2020 13:56
@wadealexc
Copy link

I don't have the bandwidth to review this fully yet (but I will make time before the end of the week).

Regarding @anorth's comments, yes - I agree that the problem is more widespread than what was mentioned in #1100:

  • TerminateSectors invokes EQ.RemoveSectors, which doesn't support replaced CC (I believe it aborts)
  • ExtendSectorExpiration invokes EQ.ReplaceSectors, which doesn't support replaced CC (I believe it aborts)
  • SubmitWindowedPoSt invokes:
    • EQ.RescheduleAsFaults - handled in this PR
    • EQ.RescheduleRecovered - doesn't support replacements. This means that declaring a replacement faulty, then recovering it would result in its expiry being set to its original, on-time expiry. (via EQ.AddActiveSectors)
  • ConfirmSectorProofsValid invokes EQ.RescheduleExpirations, which won't find replacements (I think it aborts). I think this one is working as intended (but a test would be nice).
  • CompactPartitions invokes EQ.AddActiveSectors - I didn't get a chance to review this method fully, so tests would be good to make sure this works as intended.
  • handleProvingDeadline invokes:
    • EQ.RescheduleAllAsFaults: this iterates over the EQ rather than using groupSectorsByExpiration. I think it works, but a test is needed.
    • EQ.PopUntil: (when popping expired sectors). This iterates over the EQ rather than grouping by expiry. I think it works, but a test is needed.

Edge cases to consider:

  • The "new expiry" calculated for a replaced CC sector may be either before, equal to, or even after its expected expiry
  • Replaced CC sectors may be Unproven

Tests needed (@anorth's tests plus a few more):

  • In the same epoch a CC sector is proven (ConfirmSPValid), precommit a replacement for it
  • Extend a replaced sector's expiry via ExtendSectorExpiration
  • Terminate a replaced sector before it expires
  • Replace a sector that's already been replaced before it expires
  • Replace an Unproven sector
  • Declare a replaced sector faulty via DeclareFaults
    • Then declare it recovered
  • "Declare" a replaced sector faulty via "skipping" (SubmitWindowedPoSt)
    • Then declare it recovered
  • "Declare" a replaced sector faulty via missed post (picked up in handleProvingDeadline)
    • Then declare it recovered

@acruikshank acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 98d87d6 to 9287481 Compare September 9, 2020 19:54
@acruikshank acruikshank marked this pull request as ready for review September 9, 2020 21:43
@acruikshank
Copy link
Contributor Author

@wadealexc I believe I've added all the tests you've requested. "Replace a sector that's already been replaced before it expires" was already in place.

@acruikshank acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 5f90105 to 4c45988 Compare September 10, 2020 14:49
Copy link
Member

@anorth anorth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, nearly there, ❤️ the tests.

actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
if err != nil {
return false, err
}
sectorEpochSets = append(sectorEpochSets, group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there's already a sectorEpochSet for this epoch, because some other sector had a declared expiration there? I think this will produce another ses for the epoch.

It might be that callers are ok with this, but it's a bit of a trap. Also the sort.Slice below is unstable so non-deterministic with equal keys.

I think best to accumulate these in a map by epoch instead, and handle the possibility of the ses already existing at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I rework the algorithm altogether. It now just intersects es.OnTimeSectors with all the sectors it hasn't found yet to form the group and then computes the remainder. It iterates through the expected iterations first and then goes sequentially for what is left. I think it's a little more straightforward, and fixes this issue.

intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
if err != nil {
return false, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: this needs to check es.EarlySectors as well. Especially for the case where the sector has moved because it was rescheduled. I'm a bit surprised this escaped the great testing.

I think to hit this case we need to fault a sector (thus rescheduling it) and then terminate it (the other call path to this point, since it can't be faulted again).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, findSectorsByExpiration is only used in the context of non-faulty sectors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @wadealexc that I don't think we could exercise this code. If we stop assuming that grouped sectors can be a mix of faulty and non-faulty sectors, our downstream accounting gets a lot more complicated. We could add an assertion that no sectors in the set are early. I tried it an the tests still pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring es.EarlySectors (now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.

actors/builtin/miner/miner_test.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_test.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_test.go Outdated Show resolved Hide resolved
actors/builtin/miner/miner_test.go Show resolved Hide resolved
actors/builtin/miner/miner_test.go Outdated Show resolved Hide resolved
Copy link

@wadealexc wadealexc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part. Left some comments.

actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
actors/builtin/miner/expiration_queue.go Show resolved Hide resolved
Comment on lines 739 to 786
if err := q.mustGet(expiration, &es); err != nil {
return nil, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re mustGet: Is it always the case that an ExpirationSet will exist in the queue at this expiration?

I know that when expiration sets are empty they get deleted - if we then query for that set here it'll fail. This would occur when querying for the on-time expiry, but the sectors in question were rescheduled and the set deleted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question goes for the use of mustGet in EQ.remove. I wonder if this method is safe to use - would appreciate input from @anorth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no consequence for using mayGet here. I went ahead and changed it. If a set is erroneously missing, the sector will be left in allNotFound and trigger an error at the end of the grouping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure mustGet is safe to use, now that we've internalized that we don't know where the queue entries are. I'm in favour of removing it (inlining into the one remaining call site).

intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
if err != nil {
return false, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, findSectorsByExpiration is only used in the context of non-faulty sectors.

intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
if err != nil {
return false, err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.

quantizedExpiration: {pIdx},
}, dQueue)

// but partitions expiration set at that epoch is empty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it empty?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't DeclareFaults place the old sector in EarlySectors at the fault expiration epoch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the sector had an on time expiration earlier than the fault expiration, it is left in its original expiration set, but its power is moved from active to faulty. This is checking that the expiration set at the fault expiration is empty, indicating it was not moved. Prior to this PR, it would have been moved to this expiration set and left in its original expiration set causing it to expire twice.

actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
declaredExpirations := make(map[abi.ChainEpoch]bool, len(sectors))
sectorsByNumber := make(map[uint64]*SectorOnChainInfo, len(sectors))
emptyBitfield := bitfield.New()
allNotFound := &emptyBitfield
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed to pass Bitfields by value type a month or so back. ChangegroupExpirationSet to return a non-pointer to bitfield and remove the pointers here.

But also consider just using a map[uint64]struct{}, which will save the underlying unnecessary RLE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little sad we can't efficiently do proper set math on bitfields. I spent some time trying to figure out how this could work. If we could copy RunIterators, we could use runs to perform multiple set operations and then convert back to bitfields efficiently (if necessary).

Using a map here ended up being less code anyway, so I went ahead and made that optimization.

actors/builtin/miner/expiration_queue.go Outdated Show resolved Hide resolved
intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
if err != nil {
return false, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring es.EarlySectors (now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.

@acruikshank acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 122afdb to 01f6169 Compare September 14, 2020 14:24
@acruikshank acruikshank merged commit 9b29212 into master Sep 14, 2020
@acruikshank
Copy link
Contributor Author

@dkkapur Yes, this PR fixes all the issues brought up in #1100. In our migration to V2 actors we now have a mechanism for correcting miner actors affected by this problem. We've added quite a bit of testing around the issue as well.

@filecoin-project filecoin-project deleted a comment from dkkapur Oct 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Miner: Declaring a replaced CC sector faulty can result in a sector existing in an expiration queue twice
4 participants