Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

[NC-1273] Start of fast sync downloader #613

Merged
merged 27 commits into from
Jan 23, 2019
Merged

Conversation

ajsutton
Copy link
Contributor

@ajsutton ajsutton commented Jan 21, 2019

PR description

Introduces a FastSyncDownloader and the first few steps of the fast sync process. Specifically:

  • NC-234 - wait for suitable peers to be available, starting fast sync after either 5 peers are connected or a timeout is reached (both set in SynchronizerConfiguration
  • NC-2135 - select an appropriate pivot block. Based on pivot distance in SynchronizerConfiguration
  • NC-2136 - Request the block header. The header is requested from each connected peer (excluding those not sync'd that far along the chain yet) and at least half the peers have to provide the same header before we start fast sync. This provides a little extra confirmation that we're syncing the right chain.

Together this gets us to the point where the world state download could be started if we wanted to see how it integrates and be able to test it in the real client.

The --sync-mode flag has been re-enabled but set to hidden so that we can test fast sync progress without it being fully exposed as a supported and working option yet. If fast sync fails or reaches the end of what we've implemented it falls back to full sync.

… it can return a single header, handle retrying and provide a place to do additional validation.
@@ -80,6 +81,7 @@ public void fullSyncFromGenesis() throws Exception {
}

@Test
@Ignore("Fast sync implementation in progress.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that NC-2178 has been filed to re-enable this once the configuration options are fully exposed (currently it fails because it's waiting for 5 peers and the test times out before fast sync does).

syncConfig, protocolSchedule, protocolContext, ethContext, syncState, ethTasksTimer);

ChainHeadTracker.trackChainHeadForPeers(
ethContext, protocolSchedule, protocolContext.getBlockchain(), syncConfig, ethTasksTimer);
if (syncConfig.syncMode().equals(SyncMode.FAST)) {
if (syncConfig.syncMode() == SyncMode.FAST) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious - why prefer == here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely ease of reading and because I was confused by the .equals for a bit thinking it wasn't actually an enum.

if (error != null) {
LOG.error("Fast sync failed. Switching to full sync.", error);
}
if (!result.isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a new pattern - we usually return any errors via the throwable. I think this pattern is a little counter-intuitive because the future can appear to complete "successfully" (non-exceptionally) when it actually failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it was more meaningful in earlier drafts and got whittled away to this which isn't particularly useful at all. Switched to exceptions.

LOG.warn(
"Maximum wait time for fast sync reached but no peers available. Continuing to wait for any available peer.");
final WaitForPeerTask waitForPeerTask = WaitForPeerTask.create(ethContext, ethTasksTimer);
return ethContext.getScheduler().scheduleSyncWorkerTask(waitForPeerTask::run);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you scheduling this in the worker pool? Also, I don't think we should be running any tasks that can't timeout. If we can't find any peers for some reason, we could just hang here waiting forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't find any peers then there is literally nothing we can do but wait. If we added a timeout, when it's reached we'd simply have to go back to waiting for a peer again, so why not just keep waiting? We can't even fall back to full sync because it needs a peer to sync from as well.

It went to the worker pool for consistency but it doesn't actually need to. Changed to call directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly I'm just thinking about how debugging would go in this case. It might look like the synchronizer has hung if it just stops printing any logs. If we did add a timeout, I would just go back and start a new WaitForPeerTask when the timeout triggered, but at least it's more obvious that it's still trying to sync. Not a huge deal though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good point. I've changed it to repeatedly wait for peers so it prints the "Waiting for 1 peers" message repeatedly while it's waiting.

.handle(
(waitResult, error) -> {
if (ExceptionUtils.rootCause(error) instanceof TimeoutException) {
if (ethContext.getEthPeers().bestPeer().isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you just checking that we have any peers here? If so, getEthPeers().availablePeerCount() > 0 might be clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


import com.google.common.base.MoreObjects;

public class FastSyncState {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another FastSyncState in tech.pegasys.pantheon.ethereum.eth.sync.state - these 2 classes should be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the old one which wasn't used anywhere.

return fastSyncActions
.waitForSuitablePeers()
.thenApply(state -> fastSyncActions.selectPivotBlock())
.thenCompose(fastSyncActions::downloadPivotBlockHeader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about how the full sync downloader works and how that maps over to this downloader. I think for the most part the flow should be the same. In the full sync downloader, the first thing we do is find the "sync target". We use this sync target to make sure that we're downloading a coherent chain (downloading the chain based off of hashes from our best peer rather than block numbers). I think we probably want to download the pivot block based on a header that we get from our target peer. We can still "confirm" that block header, but the confirmation would be based on the header / hash rather than block number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The risk is that we'll pick a header that's on a non-canonical chain. If we request it by hash, other clients will likely have it and return it "confirming" it, but what we actually want is to check that they consider the canonical block at that number to be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind that when you fast-sync to a block, you can't apply any re-org that goes back before the fast-sync point. This was a problem in the Ropsten fork where geth clients wound up fast-syncing off Parity nodes and so got stuck on the incorrect chain with no ability to re-org back even when the correct chain eventually had a greater total difficulty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.

.filter(peer -> peer.chainState().getEstimatedHeight() >= pivotBlockNumber)
.collect(Collectors.toList());

final int confirmationsRequired = peersToQuery.size() / 2 + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to have a cap on how many peers we want to query for this. For example, if we have 50 peers, do we really need 25 confirmations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to be as certain as possible that it's the right node before fast syncing - it's a big point of weakness because we're trusting that the pivot block we select is correct, we can't perform any real checks that it actually is right. I was actually wondering if we should abort if any of our peers returned a header that didn't match and maybe have a minimum number of peers that confirm the block (currently we might have 10 peers connected but 9 of them haven't reached the pivot block yet).

}

@Override
protected boolean isRetryableError(final Throwable error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably fold the implementation of this method into AbstractRetryingPeerTask along with the assignPeer() functionality in CompleteBlocksTask and that implemenation of isRetryableError. The difference in that implementation is that if there is an assigned peer, and that peer is missing or errors out there's no reason to retry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.pivotBlockNumber = pivotBlockNumber;
}

public static GetPivotBlockHeaderFromPeerTask forPivotBlock(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this task more generically - something like RetryingGetHeadersFromPeerByNumberTask since the functionality isn't actually tied to the pivot. Better yet, you could probably make a completely generic retrying wrapper that takes a supplier for the retryable task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a generic retrying wrapper would be nice but I've had a couple of goes and keep hitting issues making it work. It's starting to introduce delegation vs the current inheritance approach which I generally like but it keeps leading to more knock on effects. I'm not keen to take on that big a change as part of this work and I suspect the better approach is to just move to something like RxJava which appears to give us a lot of benefit.

For now I have renamed this to RetryingGetHeaderFromPeerByNumberTask.

"Fast sync timed out before minimum peer count was reached. Continuing with reduced peers.");
result.complete(null);
} else {
waitForAnyPeer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would just have the action fail at this point and get handled at a higher level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is you wind up having to duplicate most of this handle lambda at the higher level so you'd basically have to have this in FastSyncDownloader:

private CompletableFuture<Void> ensurePeersAvailable() {
    final CompletableFuture<Void> result = new CompletableFuture<>();

    fastSyncActions
        .waitForSuitablePeers()
        .handle(
            (waitResult, error) -> {
              if (ExceptionUtils.rootCause(error) instanceof TimeoutException) {
                fastSyncActions
                    .waitForAnyPeer()
                    .thenAccept(result::complete)
                    .exceptionally(
                        taskError -> {
                          result.completeExceptionally(error);
                          return null;
                        });
              } else if (error != null) {
                result.completeExceptionally(error);
              } else {
                result.complete(null);
              }
              return null;
            });
    return result;
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd just cut the waitForAnyPeer() part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit of a toss up. The penalty for not waiting is a month long full sync so it's unlikely the user would want that, but maybe they'd want to be strict about waiting for a minimum number of peers.

We probably should also allow the user to specifically set a pivot block by hash and we wait indefinitely until that block is available. That would be the most secure option since then users can guarantee they sync onto a chain they actually trust.

return fastSyncActions
.waitForSuitablePeers()
.thenApply(state -> fastSyncActions.selectPivotBlock())
.thenCompose(fastSyncActions::downloadPivotBlockHeader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! Definitely makes sense to query for the pivot based on block number. I guess my main point is just that whatever sync target we pick needs to agree that the pivot we find is on the main chain. I was looking at this from the perspective of sync target, but just as valid to choose the pivot first and make sure our sync target is consistent with that.

"Fast sync timed out before minimum peer count was reached. Continuing with reduced peers.");
result.complete(null);
} else {
waitForAnyPeer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd just cut the waitForAnyPeer() part - seems a little iffy to continue with fast sync if we don't have enough peers. Then the calling code could decide whether to run another round of waiting, or to abort fast sync. Anyway - just thinking out loud.

…eer to connect to avoid the appearance of hanging forever.
@ajsutton ajsutton merged commit aa58d67 into PegaSysEng:master Jan 23, 2019
@ajsutton ajsutton deleted the NC-2136 branch January 23, 2019 22:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants