Skip to content
This repository has been archived by the owner on Nov 26, 2020. It is now read-only.

single/global clock advancement #154

Open
wants to merge 1 commit into
base: fix/sync-mining
Choose a base branch
from

Conversation

schomatis
Copy link
Contributor

This PR fixes a race condition where each node (attached to a miner) advances its clock once all miners are done mining but only after that same node is done syncing, not all nodes are done syncing. The result was a temporary disruption in the synchronization where some nodes advanced their clock to the next epoch before others (while blocks were still being relayed and validated). This was manifested in the Received block with large delay logs reported by non-client roles (the client roles were running one epoch ahead so the error was expected in that case).

With this patch the clock is advanced once all nodes are synced to the mined blocks in the epoch. This hopefully also illustrates my apprehension regarding the local/global distinction. In this version the case <-localEpochAdvanceCh is reduced to just an empty relay to the "global" ClockSyncTopic (handled in the case after it in the select). It seems simpler to me (at least for a first iteration) to have a single time ("global") reflected in all clocks, all the time, to avoid these misalignments. The resulting algorithm then is just:

  • Set epoch.
  • Adjust all clocks to the time reflecting that epoch.
  • Signal miners to start mining.
  • Wait for all miners to be done.
  • Wait for all nodes to be synced.
  • Restart process with next epoch.

The motivation here is not to improve #143 or strain that PR to achieve more than it can at the moment, I agree with the iterative approach charted. The motivation is to gain confidence in our understanding of the current implementation (whatever its state of completeness) before moving forward. The logs we were seeing in non-clients that we couldn't explain were just a symptom that we were having trouble grappling with parts of the code (or at least I was, especially with the channels dynamic). I'm fine with going with a different approach than this one as long as we can understand and justify the results we see.

@schomatis schomatis requested a review from raulk July 18, 2020 01:53
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this seems to me like it would cause a race that would lead to "received a block from the future" error? Let me take it out for a spin to verify that.

Behaviourally speaking, I am 100% with you that this is where we want to go. But in order to make it not racy, we need to introduce a 2PC/3PC-like procedure. We need a third state in this state machine that is "advanced and ready".

Once all nodes have reached that state, that's when we have the guarantee that all nodes have forwarded their clock.

@schomatis
Copy link
Contributor Author

Hm, this seems to me like it would cause a race that would lead to "received a block from the future" error?

Could you expand on how please?

Let me take it out for a spin to verify that.

Also please confirm if this fixes the Received block with large delay race that we were seeing in the original PR but we couldn't actually explain.

We need a third state in this state machine that is "advanced and ready".

What does this state represent? Done mining, done syncing? Something else beyond what has been described here?

@raulk raulk added the workstream/e2e-tests Workstream: End-to-end Tests label Jul 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
workstream/e2e-tests Workstream: End-to-end Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants