fully deterministic, reproducible test scenarios #146

raulk · 2020-07-15T14:44:12Z

Context

With #143 (proper synchronized mining), we have the ability to synchronise mining across a fleet of miners, so that they're advancing in lockstep via a global clock.

However, there are other chain-dependent async processes like window PoSt (fault declaration, recovery declaration, posting proofs), sealing, etc. that we need to wait for at every chain height before we proceed. We need to tap into those processes before we allow the clock to proceed.

Additionally, some/all of those processes generate messages asynchronously, and we might need to coordinate across miners to only advance the clock globally when all those messages have been received in the corresponding mempools.

Dependent downstream processes

Windowed PoSt runner (Lotus: `storage` package)

Currently subscribes to head changes, and uses those to drive these three processes (at least):

sector proving
fault declaration
recovery declaration

The way it works is: upon a new HEAD that starts a new proving window, we wait for StartConfidence epochs before we actually do anything (to avoid computational wastage in case of reorgs). We then:

check recoveries of sectors up for challenge in the NEXT proving window.
check faults of sectors up for challenge in the NEXT proving window.
calculate the proofs of sectors up for challenge in THIS proving window.

Each of those steps generates and broadcasts messages. For each step that generated a message, we wait for build.MessageConfidence epochs on top of it before continuing with the next step.

^^ This would pose a catch-22 on synchronised mining: the logic whose completion we're waiting on before we advance the chain is in turn waiting for the chain to advance. IMO this logic is wrong to begin with: we should be having a message sentinel that we delegate watching and rebroadcasting messages to. We should not BLOCK window PoSt waiting for messages to appear. The current logic also has other weaknesses.

Proposed mechanics

Make the window PoSt runner push messages to the mpool, but not wait for them to appear. Make it run linearly, in one shot.
On every run, emit an event that reports how many sectors were faulted, recovered, and which sectors we proved. Also include the messages we posted (both message CID and full message).
A mpool sentinel would subscribe to these events to know which messages it needs to watch appear on chain.
Our synchronised mining logic would subscribe to these events to know when window PoSt has run and to learn which messages were pushed. All instances would wait for window PoSt to run and for all generated messages to appear in their local mpool (MpoolSub), before advancing to the next epoch.

Deal/sector sealing

Deals have epoch deadlines. If the chain advances too fast (as is the case with #143) sealing will not have enough time to run, ever, and therefore deals will always fail. Ideas:

Fake/dummy sealing => reduces the time it takes to seal. Oni is not testing the sealing procedures themselves so we should be fine stubbing this out.
Sealing callbacks/subscriptions.

The text was updated successfully, but these errors were encountered:

raulk · 2020-07-15T15:33:25Z

Had a chat with @magik6k.

window post fault and recovery checks will happen in the background now: Consume miner actor refactor lotus#2413
no need for a message sentinel, because with the above PR, we no longer wait in the foreground for messages to be confirmed; and the mpool apparently already takes care of rebroadcasting.
for window PoSt notifications, instead of adding yet another event source, @magik6k suggests that we wire up the window PoSt functions to the journal: https://github.com/filecoin-project/lotus/pull/2101/files#diff-e92de16f120d132cacd0a0d647d5d219R328. For deals/sealing, etc. we'd do the same.

raulk · 2020-07-15T15:39:03Z

I quite like the journal idea, as it stops the proliferation of ad-hoc subscriptions for watching purposes. I think of it as an authoritative audit trail of system processes and decisions.

A better, normalised event subscription solution long-term might be to use https://github.com/libp2p/go-eventbus for typed events all around (killing the ad-hoc methods like SubHeadChanges, SubscribeHeadChanges). And the audit trail / journal could do a wildcard subscription and dump to disk. But this is out of scope right now — maybe in the future when a refactor is due.

I'll probably create an in-memory journal that I can clear/flush on every epoch to avoid unnecessary history build-up, and on Oni side, we can use https://github.com/mitchellh/mapstructure to consume journal entries into typed structs.

This was referenced Jul 16, 2020

journal the world! filecoin-project/lotus#2441

Closed

proper synchronized mining #143

Open

raulk added the workstream/e2e-tests Workstream: End-to-end Tests label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fully deterministic, reproducible test scenarios #146

fully deterministic, reproducible test scenarios #146

raulk commented Jul 15, 2020

raulk commented Jul 15, 2020

raulk commented Jul 15, 2020

fully deterministic, reproducible test scenarios #146

fully deterministic, reproducible test scenarios #146

Comments

raulk commented Jul 15, 2020

Context

Dependent downstream processes

Windowed PoSt runner (Lotus: storage package)

Proposed mechanics

Deal/sector sealing

raulk commented Jul 15, 2020

raulk commented Jul 15, 2020

Windowed PoSt runner (Lotus: `storage` package)