proper synchronized mining #143

raulk · 2020-07-13T21:30:03Z

This PR introduces proper synchronized mining via two control planes:

global clock control plane.
mining control plane.

With this PR, we no longer need to backdate genesis nor decrease the BlockDelaySecs and PropagationDelaySecs configurations to accelerate mining.

Global clock control plane

Both miners and clients participate in this control plane. It is used to drive time in mining. Time starts at now, and moves into the future, one BlockDelaySecs at a time.

This is essentially modelled as a distributed semaphore, where all miners and clients wait until they have locally advanced their clocks. Once they all signal that fact via a topic, the semaphore allows them all to proceed at once.

The semaphore fires through a channel, which is then fed to the mining control plane.

Mining control plane

Only miners participate in this control plane. It feeds off the global clock control plane channel to conduct mining rounds.

As soon as the clock progresses, we attempt to mine a block. We then publish a message on a topic indicating who we are (address), and if we did produce a block or not.

Next, we subscribe to HEAD changes (which sends us a snapshot of our current head, plus any subsequent changes), and await until our HEAD becomes the tipset that includes all blocks we know were produced in that round.

At that point, we release a signal via a channel that backfeeds into the global clock control plane, which triggers the next collective epoch.

Fixes #135.
Partially addresses #87.
Uses filecoin-project/lotus#2362.

Future

This solution synchronises all miners to perfection, and does not allow us to simulate disruptions, faults, delays, etc. That said, the modularity and the delineation through clean input/output channels should allow us to mangle and manipulate things locally. However, we will need to think of ways to advertise to peers when we deliberately "opt out" of a round, as right now, the code expects full participation of all miners in all mining rounds.

lotus-soup/testkit/role_miner.go

yusefnapora

This is awesome.

I notice a lot of messages like this:

Jul 14 14:52:33.298557  INFO    101.6153s      ERROR << instance   3 >> 2020-07-14T10:52:33.298-0400    WARN    miner   miner/miner.go:179      mined block in the past {"block-time": "2020-07-15T10:44:12.000-0400", "time": "2020-07-15T10:44:12.100-0400", "duration": 0.1}   {"req_id": "8ac15ff3"}
Jul 14 14:52:33.312636  INFO    101.6294s      ERROR << instance   4 >> 2020-07-14T10:52:33.312-0400    WARN    sub     sub/incoming.go:81      Received block with large delay 25 from miner t01001    {"req_id": "8ac15ff3"}

I think the mined block in the past message is happening because we start at genesisTime + 100ms and advance by 1 epoch interval at each step, so we're always 100ms after the epoch when we actually mine the block. Maybe we should mine at a point midway between epochs? I think that would be closer to the real-world scenario, since miners want to push blocks before the epoch so they can propagate.

lotus-soup/testkit/node_time.go

raulk · 2020-07-14T16:13:42Z

@yusefnapora 55d8844 (#2362)

raulk · 2020-07-14T16:21:08Z

@yusefnapora

Jul 14 14:52:33.312636  INFO    101.6294s      ERROR << instance   4 >> 2020-07-14T10:52:33.312-0400    WARN    sub     sub/incoming.go:81      Received block with large delay 25 from miner t01001    {"req_id": "8ac15ff3"}

is a bit of a red herring. The block validator is not discarding blocks we've already received, so it appears that gossipsub delivers can deliver the same block multiple times. The first time we receive the block, we incorporate it to our HEAD, and when all blocks are received, we advance time. If due to gossipsub amplification, a duplicate block is received after we've fast-forwarded the clock 25s, it will still be delivered and will cause this warning.

schomatis · 2020-07-14T22:07:20Z

The block validator is not discarding blocks we've already received,

Not totally familiar with gossipsub internals but we should drop (valid) repeated blocks in

https://github.com/filecoin-project/lotus/blob/13de81b3b27926fd5db1e7334d540f8e6d72c774/chain/sub/incoming.go#L220-L225

schomatis · 2020-07-14T23:53:14Z

lotus-soup/testkit/node_time.go

+	build.Clock = n.MockClock
+
+	// start at genesis + 100ms.
+	n.MockClock.Set(genesisTime.Add(100 * time.Millisecond))


This still triggers (after last set of patches) the mined block in the past and related. In total we are advancing BlockDelaySecs plus this slight drift, putting us outside of the mining epoch window: the miner only advances BlockDelaySecs from its tipset and sees that it lost the mining window by 100ms. Ideally this should be a subtraction but in my local tests this breaks other assumptions (understandably because I'm drifting before genesis time). Maybe we should jump a smaller amount than BlockDelaySecs the first time (to remain inside the mining window when adding the full BlockDelaySecs amount in subsequent rounds).

I think the simplest solution here would be to advance the clock after the mining round and not before as it seems to be done now. (I think the change would translate to something like moving the clock advance logic from the local to the global case in SynchronizeClock.)

That would make sense. I did try different variants and in some cases I ended up hitting the "got a block from the future" error. I think we do need to get this absolutely right, so I'm willing to invest time in nailing the state machine right now so that we're all happy and we can build on a solid base.

I think advancing the clock just before the mining round won't work because of races between miners. Miners who advance their clock sooner will end up propagating what appear to be "blocks from the future" to miners who still haven't advanced their clocks. That's why I opted for this approach, as we have the assurance that all miners are in the right epoch when mining commences in concert.

After removing the drift, and testing atop this patch, we no longer have the spurious logs.

The only one we're left with is the occasional:

WARN sub sub/incoming.go:81 Received block with large delay 25 from miner t01001 {"req_id": "a305c648"}

Which I think has to do with gossipsub propagation, and not worth investigating right now IMO.

schomatis · 2020-07-14T23:55:54Z

I'm going to give it an in-depth review and run tomorrow but the new logic decoupling of this last iteration is beautiful 🎨

schomatis

Looking good, left some comments before giving this a run, nothing major.

lotus-soup/testkit/node_time.go

lotus-soup/testkit/role_miner_sync.go

lotus-soup/testkit/node_time.go

lotus-soup/testkit/role_miner.go

Co-authored-by: Lucas Molas <schomatis@gmail.com>

schomatis

This is looking very good. All the major comments have been addressed. I still need to run this and check the "mined in the past" delay but from previous comments it seems this has been fixed.

One last request. I need some more clarity on the channels that control this system. I left some untidy notes about this which might not add too much value (apologies for that, I can add more clarifications later when I have more time); their intent is to try to illustrate some of the confusions I have when I try to reason about this code. I actually know what the code is doing, but that is mainly because I'm very familiar with the protocol and I think how would I go about doing this, and it's basically aligned with this implementation. But when I actually read through the synchronization logic I still take a noticeable cognitive hit when following the channels from one function to another.

schomatis · 2020-07-16T15:00:59Z

lotus-soup/testkit/role_miner_sync.go

+//    all blocks that were produced in this epoch.
+//  * once that happens, we emit an epoch end signal, which the caller can
+//    interpret as a signal to advance the clock and begin a new epoch.
+func (m *LotusMiner) SynchronizeMiner(ctx context.Context, epochStart <-chan abi.ChainEpoch, epochEnd chan<- abi.ChainEpoch) {


epochStart:

what do we start? Mining. It seems basically a "go mine" signal.

epoch is confusing, it may seem to imply that we mine on the epoch passed in the channel when in fact the epoch is used just for debugging purposes

the channel seems to be doing to things at once: signal to mine (this could just be an empty struct) and also tell you in which epoch the miner is standing on (do we need to obtain that information from here?)

epochEnd: similar, do we even read the epoch we're sending? or is this just i'm done syncing, the terms epoch and end seem to imply to me we are controlling the clock here but i think this is just a signal that i (miner) did something (either mined or synced the chain, unrelated to start/end epoch)

Basically this routine is intended to behave like an "actor", as in the actor-oriented programming model programming. It receives signals to work on epochStart and emits "work done" signals on epochEnd. In Akka, you would assimilate the epochStart channel to an inbox, and epochEnd to an outbox (using their mailbox analogy).

I haven't thought too much about how these epochs are actually used; I started making some of these channels of element type struct{}, but then I realised that I might as well propagate useful info. Although in hindsight, I see how that decision (and loose usage of the sent/recvd values) might confuse others.

How about we name these channels mineStart and mineDone? From the viewpoint of this routine, it establishes the clear boundary of signals that demarcate each unit of work.

Those names sound good.

schomatis · 2020-07-16T15:10:47Z

lotus-soup/testkit/node_time.go

+//
+// This method will return immediately, and will spawn a background goroutine
+// that performs the clock synchronisation work.
+func (n *LotusNode) SynchronizeClock(ctx context.Context) (localAdvance chan<- abi.ChainEpoch, globalEpoch <-chan abi.ChainEpoch, err error) {


global vs local: i have trouble grappling with the topology implied by this locality

the user requests local advancement: it doesn't seem to me the user (miner) requests advancement, it just says
i've synced to this epoch, yes, you can interpret it as "i'm done let's move one", but the local/global implies to me some equality that is differentiated between how we order the group, but here we have miners and a central time/chain control, the miners don't have any leverage on the control (in more advance use cases even if the miner says i'm done in this round we may still not advance until some other event happens), i don't see the relation that could classify them into local/global with a centralized control system

SynchronizeClock is another actor (actor-oriented programming). It encapsulates all the control of the clock. No other process in the system (not even the SynchronizeMiner actor) is entitled to change the clock (although we don't enforce that prohibition).

The idea behind this is that when the local node, when enlisted predicates are satisfied, decides that it's ready to advance the clock, it would send a signal to localAdvance.

Right now we ACTUALLY advance the local clock and advertise that fact to the world. In the future, we could use some form of 3PC (three phase commit), where we PREPARE [phase 1] the advancement, nodes opt in or opt out of that round via the sync service, and once all nodes have PREPARED, we actually COMMIT [phase 2] the local clock advancement by manipulating the mock clock, and CERTIFY [phase 3] that we are at the new epoch.

Once all nodes have passed the CERTIFY phase, that's when we release the signal on globalEpoch to trigger mining and/or other downstream processes.

Yes, I think I was having the 3PC model you're describing in my mind. I'm fine if this is not implemented right now.

schomatis

90% of the Received block with large delay warnings are coming from the client (instance 4) which is running one epoch ahead (not sure why), so I'm not sure I agree with the earlier diagnostic of gossipsub amplification. I would like to understand and preferably fix (or justify) this to look in depth at the other 10% of miner delays. (From what I've seen miner delays come from blocks the same miner mined, which the validator does not discard because in fact it is the first time it sees, but I would like to confirm this.)

edit: There is also a soft race condition (for lack of a better term) where each node advances its clock once all miners are done mining but not necessarily after all nodes are done syncing, which also might explain the reported delays (not totally sure, I'm still having trouble following the local/end global/start channels back and forth).

schomatis · 2020-07-17T02:01:04Z

lotus-soup/testkit/role_client.go

+		go func() {
+			// for now, clients are always one epoch ahead locally with regards to miners.
+			// keeping the flywheel spinning!
+			for epoch := range globalEpochCh {
+				localAdvanceCh <- epoch + 1
+			}
+		}()
+	}


I don't follow the rationale behind this advancement from the comment.

raulk · 2020-07-17T10:51:08Z

@schomatis Good job catching the fact that it's the client that reports the "Received block with large delay" message. In general, we need to revisit the client's behaviour in all this synchronisation. I'm not convinced by the current solution.

The decision I had to take was: "do I want for the client to wait until its HEAD is updated, just like miners do?" It seemed a bit too heavy-handed to me, so that's why I decided not to wait for that condition right now, because this PR is just the first step to get fully deterministic tests (a) working [see #146], (b) doing useful stuff (we will need to simulate faults, forks, mining halting, etc.)

Right now the advancement logic is hardcoded, but to enable (b), we need some richer semantics/API where we can attach dynamically-evaluated predicates such that:

Miners can decide to opt-out of a mining round (advance the clock but do not attempt to mine a block).
Clients can watch the mining happen passively, then 1000 tipsets into the chain, decide to enroll in the ensemble, forward their clock to the current time, bootstrap against the network, and sync.
etc.

In conclusion, I'm not so worried about getting everything right at this precise instance, because this PR is just laying the groundwork and we'll iterate on this code a lot as we introduce those predicates, checkpoints, etc.

schomatis · 2020-07-17T11:20:57Z

In conclusion, I'm not so worried about getting everything right at this precise instance, because this PR is just laying the groundwork and we'll iterate on this code a lot as we introduce those predicates, checkpoints, etc.

I don't fully understand all the details here so I still don't have complete confidence in this implementation (that is on me, not an issue with the code) but I likely won't have time to review anything else today so if this is blocking the ongoing effort feel free to go ahead and merge this, we can iterate on this as we go.

raulk · 2020-07-17T11:29:53Z

lotus-soup/init.go

+	// build.BlockDelaySecs = 2
+	// build.PropagationDelaySecs = 1


Need to sort this out. Synchronised mining doesn't need to reduce the block delay and propagation delay.

schomatis

In general none of the future features proposed here seem to me will need two different nodes to see a different time. The biggest goal of this PR should be just to make sure all the nodes are in sync, what each node does (mine or not mine, stay in consensus or fork) can be controlled by other means. Having these "local" clocks seems like adding too much complexity for a first iteration without much benefit, and as shown in #154 it also causes clock misalignments that result in unwanted behavior. Still, I'm approving this PR to explicitly signal I don't want to stall this effort moving forward and what I'm describing here could potentially be fixed in later iterations (at a bigger cost IMO than discussing and fixing it here though.)

schomatis reviewed Jul 14, 2020

View reviewed changes

lotus-soup/testkit/role_miner.go Outdated Show resolved Hide resolved

raulk force-pushed the fix/sync-mining branch from a0e7a6c to 9866452 Compare July 14, 2020 13:55

synchronized clock and mining.

8695873

raulk force-pushed the fix/sync-mining branch from 9866452 to 8695873 Compare July 14, 2020 14:03

raulk added this to the Sprint 5 milestone Jul 14, 2020

raulk marked this pull request as ready for review July 14, 2020 14:11

raulk requested review from schomatis, yusefnapora, nonsense and vyzo July 14, 2020 14:11

raulk changed the title ~~WIP proper synchronized mining~~ proper synchronized mining Jul 14, 2020

yusefnapora reviewed Jul 14, 2020

View reviewed changes

lotus-soup/testkit/node_time.go Outdated Show resolved Hide resolved

refine godocs.

193d78f

raulk mentioned this pull request Jul 14, 2020

introduce the ability to mock time. filecoin-project/lotus#2362

Merged

schomatis reviewed Jul 14, 2020

View reviewed changes

schomatis reviewed Jul 15, 2020

View reviewed changes

raulk mentioned this pull request Jul 15, 2020

fully deterministic, reproducible test scenarios #146

Open

raulk and others added 8 commits July 16, 2020 11:08

fix comment

4b0929d

Co-authored-by: Lucas Molas <schomatis@gmail.com>

Apply suggestions from code review

d3a593f

Co-authored-by: Lucas Molas <schomatis@gmail.com>

upgrade dependencies.

072e022

save addresses in LotusNode.

1f2903e

remove 100ms drift off genesis.

486f936

SynchronizeMiner: now blocking; called as a goroutine.

89c07dc

refactor SynchronizeClock.

0dc62c4

add comment.

c467b32

raulk requested review from schomatis and yusefnapora July 16, 2020 13:14

schomatis reviewed Jul 16, 2020

View reviewed changes

schomatis mentioned this pull request Jul 17, 2020

Decouple instance number from role in the logs #152

Closed

schomatis suggested changes Jul 17, 2020

View reviewed changes

raulk commented Jul 17, 2020

View reviewed changes

schomatis mentioned this pull request Jul 18, 2020

single/global clock advancement #154

Open

rename SynchronizeMiner channels.

2e3de49

nonsense mentioned this pull request Jul 20, 2020

trying to reduce window periods filecoin-project/specs-actors#595

Closed

schomatis approved these changes Jul 22, 2020

View reviewed changes

raulk mentioned this pull request Jul 27, 2020

testground: sync service - barriers - get rid of 1s poll #79

Closed

raulk added the workstream/e2e-tests Workstream: End-to-end Tests label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proper synchronized mining #143

proper synchronized mining #143

raulk commented Jul 13, 2020 •

edited

Loading

yusefnapora left a comment

raulk commented Jul 14, 2020 •

edited

Loading

raulk commented Jul 14, 2020

schomatis commented Jul 14, 2020

schomatis Jul 14, 2020

schomatis Jul 15, 2020

raulk Jul 16, 2020

raulk Jul 16, 2020

schomatis commented Jul 14, 2020

schomatis left a comment

schomatis left a comment

schomatis Jul 16, 2020

schomatis Jul 16, 2020

raulk Jul 17, 2020

schomatis Jul 17, 2020

schomatis Jul 16, 2020

raulk Jul 17, 2020 •

edited

Loading

schomatis Jul 17, 2020

schomatis left a comment •

edited

Loading

schomatis Jul 17, 2020

raulk commented Jul 17, 2020

schomatis commented Jul 17, 2020

raulk Jul 17, 2020

schomatis left a comment

		// build.BlockDelaySecs = 2
		// build.PropagationDelaySecs = 1

proper synchronized mining #143

Are you sure you want to change the base?

proper synchronized mining #143

Conversation

raulk commented Jul 13, 2020 • edited Loading

Global clock control plane

Mining control plane

Future

yusefnapora left a comment

Choose a reason for hiding this comment

raulk commented Jul 14, 2020 • edited Loading

raulk commented Jul 14, 2020

schomatis commented Jul 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis commented Jul 14, 2020

schomatis left a comment

Choose a reason for hiding this comment

schomatis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schomatis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk commented Jul 17, 2020

schomatis commented Jul 17, 2020

Choose a reason for hiding this comment

schomatis left a comment

Choose a reason for hiding this comment

raulk commented Jul 13, 2020 •

edited

Loading

raulk commented Jul 14, 2020 •

edited

Loading

raulk Jul 17, 2020 •

edited

Loading

schomatis left a comment •

edited

Loading