single/global clock advancement #154

schomatis · 2020-07-18T01:53:32Z

This PR fixes a race condition where each node (attached to a miner) advances its clock once all miners are done mining but only after that same node is done syncing, not all nodes are done syncing. The result was a temporary disruption in the synchronization where some nodes advanced their clock to the next epoch before others (while blocks were still being relayed and validated). This was manifested in the Received block with large delay logs reported by non-client roles (the client roles were running one epoch ahead so the error was expected in that case).

With this patch the clock is advanced once all nodes are synced to the mined blocks in the epoch. This hopefully also illustrates my apprehension regarding the local/global distinction. In this version the case <-localEpochAdvanceCh is reduced to just an empty relay to the "global" ClockSyncTopic (handled in the case after it in the select). It seems simpler to me (at least for a first iteration) to have a single time ("global") reflected in all clocks, all the time, to avoid these misalignments. The resulting algorithm then is just:

Set epoch.
Adjust all clocks to the time reflecting that epoch.
Signal miners to start mining.
Wait for all miners to be done.
Wait for all nodes to be synced.
Restart process with next epoch.

The motivation here is not to improve #143 or strain that PR to achieve more than it can at the moment, I agree with the iterative approach charted. The motivation is to gain confidence in our understanding of the current implementation (whatever its state of completeness) before moving forward. The logs we were seeing in non-clients that we couldn't explain were just a symptom that we were having trouble grappling with parts of the code (or at least I was, especially with the channels dynamic). I'm fine with going with a different approach than this one as long as we can understand and justify the results we see.

raulk

Hm, this seems to me like it would cause a race that would lead to "received a block from the future" error? Let me take it out for a spin to verify that.

Behaviourally speaking, I am 100% with you that this is where we want to go. But in order to make it not racy, we need to introduce a 2PC/3PC-like procedure. We need a third state in this state machine that is "advanced and ready".

Once all nodes have reached that state, that's when we have the guarantee that all nodes have forwarded their clock.

schomatis · 2020-07-20T13:03:08Z

Hm, this seems to me like it would cause a race that would lead to "received a block from the future" error?

Could you expand on how please?

Let me take it out for a spin to verify that.

Also please confirm if this fixes the Received block with large delay race that we were seeing in the original PR but we couldn't actually explain.

We need a third state in this state machine that is "advanced and ready".

What does this state represent? Done mining, done syncing? Something else beyond what has been described here?

single/global clock advancement

83909d7

schomatis requested a review from raulk July 18, 2020 01:53

raulk reviewed Jul 20, 2020

View reviewed changes

raulk added the workstream/e2e-tests Workstream: End-to-end Tests label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single/global clock advancement #154

single/global clock advancement #154

schomatis commented Jul 18, 2020

raulk left a comment

schomatis commented Jul 20, 2020

single/global clock advancement #154

Are you sure you want to change the base?

single/global clock advancement #154

Conversation

schomatis commented Jul 18, 2020

raulk left a comment

Choose a reason for hiding this comment

schomatis commented Jul 20, 2020