-
Notifications
You must be signed in to change notification settings - Fork 102
prevent faults from double subtracting cc upgrade power #1129
prevent faults from double subtracting cc upgrade power #1129
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1129 +/- ##
========================================
- Coverage 75.3% 75.3% -0.1%
========================================
Files 51 51
Lines 6320 6396 +76
========================================
+ Hits 4763 4819 +56
- Misses 927 938 +11
- Partials 630 639 +9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully reviewed this yet, but I think this problem is more widespread. Every call to groupSectorsByExpiration is suspect if it expects to find the sectors there. Checking the call tree:
- EQ.removeActiveSectors, must be updated to support replaced sectors for:
- EQ.RemoveSectors (from TerminateSectors)
- EQ.ReplaceSectors (from ExtendSectorExpiration)
- EQ.RescheduleExpirations (from CC upgrade - should work but do nothing)
- EQ.AddActiveSectors: ok, these sectors aren't in the queue yet
- EQ.RescheduleAsFaults: handled in this PR already
I think we should rename for clarity:
groupSectorsByExpiration
->groupNewSectorsByDeclaredExpiration
(only for AddActiveSectors)groupSectorsByExpirationWithRescheduled
->eq.findSectorsByExpiration
Tests we need:
- Terminate a replaced sector before it expires
- Extend a replaced sector's expiration
- Replace a sector that's already been replaced before it expires
- Declare fault on a replaced sector before it expires
- Fail to prove a replaced sector before it expires (detected fault)
cc @wadealexc
7b4a5c9
to
4ae141e
Compare
I don't have the bandwidth to review this fully yet (but I will make time before the end of the week). Regarding @anorth's comments, yes - I agree that the problem is more widespread than what was mentioned in #1100:
Edge cases to consider:
Tests needed (@anorth's tests plus a few more):
|
98d87d6
to
9287481
Compare
@wadealexc I believe I've added all the tests you've requested. "Replace a sector that's already been replaced before it expires" was already in place. |
5f90105
to
4c45988
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, nearly there, ❤️ the tests.
if err != nil { | ||
return false, err | ||
} | ||
sectorEpochSets = append(sectorEpochSets, group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there's already a sectorEpochSet for this epoch, because some other sector had a declared expiration there? I think this will produce another ses for the epoch.
It might be that callers are ok with this, but it's a bit of a trap. Also the sort.Slice
below is unstable so non-deterministic with equal keys.
I think best to accumulate these in a map by epoch instead, and handle the possibility of the ses already existing at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I rework the algorithm altogether. It now just intersects es.OnTimeSectors
with all the sectors it hasn't found yet to form the group and then computes the remainder. It iterates through the expected iterations first and then goes sequentially for what is left. I think it's a little more straightforward, and fixes this issue.
intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled) | ||
if err != nil { | ||
return false, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking: this needs to check es.EarlySectors
as well. Especially for the case where the sector has moved because it was rescheduled. I'm a bit surprised this escaped the great testing.
I think to hit this case we need to fault a sector (thus rescheduling it) and then terminate it (the other call path to this point, since it can't be faulted again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, findSectorsByExpiration
is only used in the context of non-faulty sectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors
treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @wadealexc that I don't think we could exercise this code. If we stop assuming that grouped sectors can be a mix of faulty and non-faulty sectors, our downstream accounting gets a lot more complicated. We could add an assertion that no sectors in the set are early. I tried it an the tests still pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring es.EarlySectors
(now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. Left some comments.
if err := q.mustGet(expiration, &es); err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re mustGet
: Is it always the case that an ExpirationSet
will exist in the queue at this expiration?
I know that when expiration sets are empty they get deleted - if we then query for that set here it'll fail. This would occur when querying for the on-time expiry, but the sectors in question were rescheduled and the set deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question goes for the use of mustGet
in EQ.remove
. I wonder if this method is safe to use - would appreciate input from @anorth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no consequence for using mayGet
here. I went ahead and changed it. If a set is erroneously missing, the sector will be left in allNotFound
and trigger an error at the end of the grouping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not sure mustGet is safe to use, now that we've internalized that we don't know where the queue entries are. I'm in favour of removing it (inlining into the one remaining call site).
intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled) | ||
if err != nil { | ||
return false, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, findSectorsByExpiration
is only used in the context of non-faulty sectors.
intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled) | ||
if err != nil { | ||
return false, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. when terminating a faulty sector as you mentioned, EQ.RemoveSectors
treats faulty vs non-faulty terminations differently. The faulty sectors are removed by traversing the expiration queue.
quantizedExpiration: {pIdx}, | ||
}, dQueue) | ||
|
||
// but partitions expiration set at that epoch is empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't DeclareFaults
place the old sector in EarlySectors
at the fault expiration epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the sector had an on time expiration earlier than the fault expiration, it is left in its original expiration set, but its power is moved from active to faulty. This is checking that the expiration set at the fault expiration is empty, indicating it was not moved. Prior to this PR, it would have been moved to this expiration set and left in its original expiration set causing it to expire twice.
declaredExpirations := make(map[abi.ChainEpoch]bool, len(sectors)) | ||
sectorsByNumber := make(map[uint64]*SectorOnChainInfo, len(sectors)) | ||
emptyBitfield := bitfield.New() | ||
allNotFound := &emptyBitfield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed to pass Bitfields by value type a month or so back. ChangegroupExpirationSet
to return a non-pointer to bitfield and remove the pointers here.
But also consider just using a map[uint64]struct{}
, which will save the underlying unnecessary RLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little sad we can't efficiently do proper set math on bitfields. I spent some time trying to figure out how this could work. If we could copy RunIterators
, we could use runs to perform multiple set operations and then convert back to bitfields efficiently (if necessary).
Using a map here ended up being less code anyway, so I went ahead and made that optimization.
intersect, err := bitfield.IntersectBitField(es.OnTimeSectors, allRescheduled) | ||
if err != nil { | ||
return false, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring es.EarlySectors
(now down in groupExpirationSet) still makes me quite uncomfortable. Adding an error check that no such sectors are being sought would help.
122afdb
to
01f6169
Compare
closes #1100
Motivation
All sectors live within an expiration set within an expiration queue within a partition within a deadline. This expiration set is the set at the sectors
Expiration
field. When we reschedule expirations (say for a CC upgrade or a fault), we move it into a different set to trigger an earlier expiration. We do not change the sector'sExpiration
field, so the new set can not necessarily be determined from the sector's on chain info.Whenever we perform an operation that affects sector expirations, we group the sectors by expiration and operate on the set en mass. If sectors have been rescheduled, we end up altering expiration set stats for sectors that are no longer there, and failing to update the stats for the sets to which they have moved. In the case of declaring a fault for a cc upgraded sector, the end result is that we subtract the sectors power twice from the miner's claim (among other accounting mistakes).
The solution implemented in this PR is to check whether sectors expected to be in an expiration set are actually there. Sectors that are not are grouped by iterating through the expiration queue's sets instead. Not adding sectors to groups for expiration sets don't contain them should be sufficient to avoid breaking accounting invariants, whereas finding the sectors in the correct sets should deliver the intended behavior.
Proposed Changes
findSectorsByExpiration
that groups sectors in the correct expiration set even if they have been rescheduled.findSectorsByExpiration
inExpirationQueue.RescheduleAsFaults
andExpirationQueue.removeActiveSectors
.Without the changes to grouping, only 4 and 7 would have behaved as expected.
Note 11.: Tests that do not appear in the list have been moved without change and don't require the scrutiny of the rest.