prevent faults from double subtracting cc upgrade power #1129

acruikshank · 2020-09-09T03:48:26Z

Motivation

All sectors live within an expiration set within an expiration queue within a partition within a deadline. This expiration set is the set at the sectors Expiration field. When we reschedule expirations (say for a CC upgrade or a fault), we move it into a different set to trigger an earlier expiration. We do not change the sector's Expiration field, so the new set can not necessarily be determined from the sector's on chain info.

Whenever we perform an operation that affects sector expirations, we group the sectors by expiration and operate on the set en mass. If sectors have been rescheduled, we end up altering expiration set stats for sectors that are no longer there, and failing to update the stats for the sets to which they have moved. In the case of declaring a fault for a cc upgraded sector, the end result is that we subtract the sectors power twice from the miner's claim (among other accounting mistakes).

The solution implemented in this PR is to check whether sectors expected to be in an expiration set are actually there. Sectors that are not are grouped by iterating through the expiration queue's sets instead. Not adding sectors to groups for expiration sets don't contain them should be sufficient to avoid breaking accounting invariants, whereas finding the sectors in the correct sets should deliver the intended behavior.

Proposed Changes

Implement findSectorsByExpiration that groups sectors in the correct expiration set even if they have been rescheduled.
Use findSectorsByExpiration in ExpirationQueue.RescheduleAsFaults and ExpirationQueue.removeActiveSectors.
Fix regression discovered in this PR that prevented faulty expiring sectors from being penalized in their last proving period cron.
Add test for cc upgrading a sector before it is proven.
Add test for declaring a fault for a cc replaced sector before it expires.
Add test for skipping a cc replaced sector in its last PoSt.
Add test for missed PoSt in proving deadline where cc replaced sector expires.
Add test for terminating cc replaced sector before it expires.
Add test for extending expiration of cc replaced sector before it expires.
Add test for faulting and recovering a cc replaced sector before it expires.
Rearranged miner tests so tests related to cc upgrades have their own section.

Without the changes to grouping, only 4 and 7 would have behaved as expected.

Note 11.: Tests that do not appear in the list have been moved without change and don't require the scrutiny of the rest.

codecov-commenter · 2020-09-09T03:49:47Z

Codecov Report

Merging #1129 into master will decrease coverage by 0.0%.
The diff coverage is 68.6%.

@@           Coverage Diff            @@
##           master   #1129     +/-   ##
========================================
- Coverage    75.3%   75.3%   -0.1%     
========================================
  Files          51      51             
  Lines        6320    6396     +76     
========================================
+ Hits         4763    4819     +56     
- Misses        927     938     +11     
- Partials      630     639      +9

anorth

I haven't fully reviewed this yet, but I think this problem is more widespread. Every call to groupSectorsByExpiration is suspect if it expects to find the sectors there. Checking the call tree:

EQ.removeActiveSectors, must be updated to support replaced sectors for:
- EQ.RemoveSectors (from TerminateSectors)
- EQ.ReplaceSectors (from ExtendSectorExpiration)
- EQ.RescheduleExpirations (from CC upgrade - should work but do nothing)
EQ.AddActiveSectors: ok, these sectors aren't in the queue yet
EQ.RescheduleAsFaults: handled in this PR already

I think we should rename for clarity:

groupSectorsByExpiration -> groupNewSectorsByDeclaredExpiration (only for AddActiveSectors)
groupSectorsByExpirationWithRescheduled -> eq.findSectorsByExpiration

Tests we need:

Terminate a replaced sector before it expires
Extend a replaced sector's expiration
Replace a sector that's already been replaced before it expires
Declare fault on a replaced sector before it expires
Fail to prove a replaced sector before it expires (detected fault)

cc @wadealexc

actors/builtin/miner/miner_test.go

wadealexc · 2020-09-09T18:07:30Z

I don't have the bandwidth to review this fully yet (but I will make time before the end of the week).

Regarding @anorth's comments, yes - I agree that the problem is more widespread than what was mentioned in #1100:

TerminateSectors invokes EQ.RemoveSectors, which doesn't support replaced CC (I believe it aborts)
ExtendSectorExpiration invokes EQ.ReplaceSectors, which doesn't support replaced CC (I believe it aborts)
SubmitWindowedPoSt invokes:
- EQ.RescheduleAsFaults - handled in this PR
- EQ.RescheduleRecovered - doesn't support replacements. This means that declaring a replacement faulty, then recovering it would result in its expiry being set to its original, on-time expiry. (via EQ.AddActiveSectors)
ConfirmSectorProofsValid invokes EQ.RescheduleExpirations, which won't find replacements (I think it aborts). I think this one is working as intended (but a test would be nice).
CompactPartitions invokes EQ.AddActiveSectors - I didn't get a chance to review this method fully, so tests would be good to make sure this works as intended.
handleProvingDeadline invokes:
- EQ.RescheduleAllAsFaults: this iterates over the EQ rather than using groupSectorsByExpiration. I think it works, but a test is needed.
- EQ.PopUntil: (when popping expired sectors). This iterates over the EQ rather than grouping by expiry. I think it works, but a test is needed.

Edge cases to consider:

The "new expiry" calculated for a replaced CC sector may be either before, equal to, or even after its expected expiry
Replaced CC sectors may be Unproven

Tests needed (@anorth's tests plus a few more):

In the same epoch a CC sector is proven (ConfirmSPValid), precommit a replacement for it
Extend a replaced sector's expiry via ExtendSectorExpiration
Terminate a replaced sector before it expires
Replace a sector that's already been replaced before it expires
Replace an Unproven sector
Declare a replaced sector faulty via DeclareFaults
- Then declare it recovered
"Declare" a replaced sector faulty via "skipping" (SubmitWindowedPoSt)
- Then declare it recovered
"Declare" a replaced sector faulty via missed post (picked up in handleProvingDeadline)
- Then declare it recovered

acruikshank · 2020-09-09T21:45:44Z

@wadealexc I believe I've added all the tests you've requested. "Replace a sector that's already been replaced before it expires" was already in place.

anorth

Great work, nearly there, ❤️ the tests.

actors/builtin/miner/expiration_queue.go

anorth · 2020-09-11T02:01:28Z

actors/builtin/miner/expiration_queue.go

+			if err != nil {
+				return false, err
+			}
+			sectorEpochSets = append(sectorEpochSets, group)


What if there's already a sectorEpochSet for this epoch, because some other sector had a declared expiration there? I think this will produce another ses for the epoch.

It might be that callers are ok with this, but it's a bit of a trap. Also the sort.Slice below is unstable so non-deterministic with equal keys.

I think best to accumulate these in a map by epoch instead, and handle the possibility of the ses already existing at this point.

Good catch. I rework the algorithm altogether. It now just intersects es.OnTimeSectors with all the sectors it hasn't found yet to form the group and then computes the remainder. It iterates through the expected iterations first and then goes sequentially for what is left. I think it's a little more straightforward, and fixes this issue.

anorth · 2020-09-11T02:03:33Z

actors/builtin/miner/expiration_queue.go

+			intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
+			if err != nil {
+				return false, err
+			}


Blocking: this needs to check es.EarlySectors as well. Especially for the case where the sector has moved because it was rescheduled. I'm a bit surprised this escaped the great testing.

I think to hit this case we need to fault a sector (thus rescheduling it) and then terminate it (the other call path to this point, since it can't be faulted again).

As far as I can tell, findSectorsByExpiration is only used in the context of non-faulty sectors.

i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.

I'm with @wadealexc that I don't think we could exercise this code. If we stop assuming that grouped sectors can be a mix of faulty and non-faulty sectors, our downstream accounting gets a lot more complicated. We could add an assertion that no sectors in the set are early. I tried it an the tests still pass.

Ignoring es.EarlySectors (now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.

actors/builtin/miner/miner_test.go

wadealexc

Looks good for the most part. Left some comments.

actors/builtin/miner/expiration_queue.go

wadealexc · 2020-09-11T15:57:46Z

actors/builtin/miner/expiration_queue.go

+		if err := q.mustGet(expiration, &es); err != nil {
+			return nil, err
+		}


re mustGet: Is it always the case that an ExpirationSet will exist in the queue at this expiration?

I know that when expiration sets are empty they get deleted - if we then query for that set here it'll fail. This would occur when querying for the on-time expiry, but the sectors in question were rescheduled and the set deleted.

Same question goes for the use of mustGet in EQ.remove. I wonder if this method is safe to use - would appreciate input from @anorth

There is no consequence for using mayGet here. I went ahead and changed it. If a set is erroneously missing, the sector will be left in allNotFound and trigger an error at the end of the grouping.

Yes, I'm not sure mustGet is safe to use, now that we've internalized that we don't know where the queue entries are. I'm in favour of removing it (inlining into the one remaining call site).

wadealexc · 2020-09-11T16:31:56Z

actors/builtin/miner/expiration_queue.go

+			intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
+			if err != nil {
+				return false, err
+			}


As far as I can tell, findSectorsByExpiration is only used in the context of non-faulty sectors.

wadealexc · 2020-09-11T16:33:40Z

actors/builtin/miner/expiration_queue.go

+			intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
+			if err != nil {
+				return false, err
+			}


i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.

wadealexc · 2020-09-11T17:03:22Z

actors/builtin/miner/miner_test.go

+			quantizedExpiration:            {pIdx},
+		}, dQueue)
+
+		// but partitions expiration set at that epoch is empty


Why is it empty?

Shouldn't DeclareFaults place the old sector in EarlySectors at the fault expiration epoch?

Since the sector had an on time expiration earlier than the fault expiration, it is left in its original expiration set, but its power is moved from active to faulty. This is checking that the expiration set at the fault expiration is empty, indicating it was not moved. Prior to this PR, it would have been moved to this expiration set and left in its original expiration set causing it to expire twice.

actors/builtin/miner/expiration_queue.go

anorth · 2020-09-11T23:24:03Z

actors/builtin/miner/expiration_queue.go

+	declaredExpirations := make(map[abi.ChainEpoch]bool, len(sectors))
+	sectorsByNumber := make(map[uint64]*SectorOnChainInfo, len(sectors))
+	emptyBitfield := bitfield.New()
+	allNotFound := &emptyBitfield


We changed to pass Bitfields by value type a month or so back. ChangegroupExpirationSet to return a non-pointer to bitfield and remove the pointers here.

But also consider just using a map[uint64]struct{}, which will save the underlying unnecessary RLE

It's a little sad we can't efficiently do proper set math on bitfields. I spent some time trying to figure out how this could work. If we could copy RunIterators, we could use runs to perform multiple set operations and then convert back to bitfields efficiently (if necessary).

Using a map here ended up being less code anyway, so I went ahead and made that optimization.

actors/builtin/miner/expiration_queue.go

anorth · 2020-09-11T23:40:51Z

actors/builtin/miner/expiration_queue.go

+			intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled)
+			if err != nil {
+				return false, err
+			}


Ignoring es.EarlySectors (now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.

…by cc upgrades

…t entirely

…ut of set

acruikshank · 2020-10-01T21:07:58Z

@dkkapur Yes, this PR fixes all the issues brought up in #1100. In our migration to V2 actors we now have a mechanism for correcting miner actors affected by this problem. We've added quite a bit of testing around the issue as well.

anorth reviewed Sep 9, 2020

View reviewed changes

actors/builtin/miner/miner_test.go Outdated Show resolved Hide resolved

acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 7b4a5c9 to 4ae141e Compare September 9, 2020 13:56

acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 98d87d6 to 9287481 Compare September 9, 2020 19:54

acruikshank requested a review from anorth September 9, 2020 21:43

acruikshank marked this pull request as ready for review September 9, 2020 21:43

acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 5f90105 to 4c45988 Compare September 10, 2020 14:49

anorth reviewed Sep 11, 2020

View reviewed changes

wadealexc reviewed Sep 11, 2020

View reviewed changes

anorth approved these changes Sep 11, 2020

View reviewed changes

acruikshank added 17 commits September 14, 2020 09:52

rework expiration group calculation to deal with sectors rescheduled …

703f861

…by cc upgrades

capture faulty power before it expires (fixing earlier regression)

ffa8663

lint

f2d6e29

rebase issue

1e85101

test and fix terminating replaced sector

afb6f92

test extend replaced sector expiration

94118c7

tests for skipping replaced sector in last PoSt and skipping last PoS…

66bd8eb

…t entirely

add test for recovered replaced sector

2d2a512

refactor grouping a bit

6f5d6b1

add test for sector upgraded before proven

d604ffb

rebase fix

1224f65

refactor with new struct that always sets expiration set

8ec0dbb

improved findSectorsByExpiration

248e684

trim test of redundant assertions

3538cb3

clarifications and corrections

25a95f3

add check that sectors are not left over after all have been grouped

c38c99e

use mayGet to avoid problem where all sectors have been rescheduled o…

ca2724e

…ut of set

acruikshank added 3 commits September 14, 2020 09:59

pass bitfield by value

49a0ee1

clarify comment

a3072f4

assert that no attempt is made to group a sector found in EarlySector

01f6169

acruikshank force-pushed the bug/power_deducted_from_replaced_sector_twice branch from 122afdb to 01f6169 Compare September 14, 2020 14:24

acruikshank added 5 commits September 14, 2020 10:28

inline mustGet

1081b71

uninitialize empty slice

6c0bc22

one more rename

141fc0c

optimize set math by using map to record remaining sectors

da154e8

expand comment a little more

8b7c42d

acruikshank merged commit 9b29212 into master Sep 14, 2020

filecoin-project deleted a comment from dkkapur Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent faults from double subtracting cc upgrade power #1129

prevent faults from double subtracting cc upgrade power #1129

acruikshank commented Sep 9, 2020 •

edited

Loading

codecov-commenter commented Sep 9, 2020 •

edited

Loading

anorth left a comment

wadealexc commented Sep 9, 2020

acruikshank commented Sep 9, 2020

anorth left a comment

anorth Sep 11, 2020

acruikshank Sep 11, 2020

anorth Sep 11, 2020

wadealexc Sep 11, 2020

wadealexc Sep 11, 2020

acruikshank Sep 11, 2020

anorth Sep 11, 2020

wadealexc left a comment

wadealexc Sep 11, 2020

wadealexc Sep 11, 2020

acruikshank Sep 11, 2020

anorth Sep 11, 2020

wadealexc Sep 11, 2020

wadealexc Sep 11, 2020

wadealexc Sep 11, 2020

wadealexc Sep 11, 2020

acruikshank Sep 11, 2020

anorth Sep 11, 2020

acruikshank Sep 14, 2020

anorth Sep 11, 2020

acruikshank commented Oct 1, 2020

prevent faults from double subtracting cc upgrade power #1129

prevent faults from double subtracting cc upgrade power #1129

Conversation

acruikshank commented Sep 9, 2020 • edited Loading

Motivation

Proposed Changes

codecov-commenter commented Sep 9, 2020 • edited Loading

Codecov Report

anorth left a comment

Choose a reason for hiding this comment

wadealexc commented Sep 9, 2020

acruikshank commented Sep 9, 2020

anorth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wadealexc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acruikshank commented Oct 1, 2020

acruikshank commented Sep 9, 2020 •

edited

Loading

codecov-commenter commented Sep 9, 2020 •

edited

Loading