Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync 2.0 #534

Open
3 of 4 tasks
arkpar opened this issue Jan 26, 2022 · 11 comments
Open
3 of 4 tasks

Sync 2.0 #534

arkpar opened this issue Jan 26, 2022 · 11 comments
Labels
D2-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. I4-refactor Code needs refactoring.

Comments

@arkpar
Copy link
Member

arkpar commented Jan 26, 2022

This is a tracking for the new implementation of syncing protocol and algorithm in substrate. It mostly concerns full/fast sync, but not warp sync.

Current implementation issues.

  1. Reliance on the longest chain fork rule.
  2. No clear separation between initial and keep-up sync.
  3. Does not consider finality in full sync mode.
  4. Complicated common ancestry search.

Proposed new protocol.

On connection peers exchange their best finalized block and all leaf blocks that are not yet finalized or discarded.
On import of each block, node sends out an announcement to all connected peers. Announcement includes newly imported block information along with the best finalized block for a peer.

Proposed syncing algorithm.

For each connected peer we maintain a set of leaves that they have and common finalized number. These are updated on each announcement.

There are two major syncing modes:

Initial sync. In this mode the node syncs a single finalized chain. It is triggered on startup or when majority of peers start announcing blocks with unknown parents (i.e. after an offline period). Since this is a canonical chain, series of blocks may be requested by number allowing for parallel block download from multiple peers. There's no need to do ancestry search as we can simply assume that the common block is the one with a min number in a finalized chain.

Keep-up sync. Node switches to this mode when it finalizes a recent block. In this mode sync simply requests chains that peers have, starting from each leaf and back to the best finalized block (or existing parent block).

Tasks

Preview Give feedback
  1. I4-refactor
    dmitry-markin
  2. I4-refactor
    dmitry-markin
  3. I4-refactor
    dmitry-markin
  4. I4-refactor
    dmitry-markin
@arkpar
Copy link
Member Author

arkpar commented Jan 26, 2022

cc @bkchr

@nazar-pc
Copy link
Contributor

nazar-pc commented Aug 9, 2022

Is this supposed to work with chain that doesn't have finality? I'm worried about growing number of places where finality is implicitly assumed to exist in Substrate, making it less generic.

Also I'm interested in clearer separation between initial and keep-up sync here.

@arkpar
Copy link
Member Author

arkpar commented Aug 9, 2022

It may not be finality in the same sense, as the the actual finality engine used. Substrate indeed already has a limit on maximum possible re-org. So in the simplest case sync may consider "final" any block past that threshold. But yes, this strategy uses the notion of finality to make thing simpler. We may make the sync strategy pluggable to support other cases though.

@burdges
Copy link

burdges commented Aug 10, 2022

We must support chains without their own finality, since this describes parachains. We do not afaik need chains without any notion of finality, because afaik they're not really secure anyways.

A priori, I think this proposal makes sync easier to understand, which maybe helps us do async time thing by Handan (and Peter Czaban) and do my slashing reform idea. I've not really thought about the details though.. At least slashing reform cares most about sync ahead of finality, which is not really the topic here I guess, so maybe just orthogonal, not sure about async time.

@nazar-pc
Copy link
Contributor

We do not afaik need chains without any notion of finality, because afaik they're not really secure anyways.

Do you suggest for Substrate to not support PoW and similar consensus mechanisms that don't have deterministic finality at all?

@burdges
Copy link

burdges commented Aug 13, 2022

Yes exactly..

Proof-of-work is not secure. It's especially bad on smaller chains ala https://www.crypto51.app/ but bitcoin shall be double spent eventually too ala https://economics.princeton.edu/working-papers/on-the-instability-of-bitcoin-without-the-block-reward/

We'll never fully support proof-of-work parachains anyways. We need pre-collation proofs in too many places, like parachain block submission, XCMP message transport, etc, so attackers could halt proof-of-work parachains fairly easily. And proof-of-work parachains would de facto pay 10x more for their parachain slot, assuming they miss like 9/10 relay chain slots.

Intuitively, if you've a properly engineered protocol then you've somewhat tight tolerances everywhere, so proof-of-work anywhere typically enables DoS attacks upon your larger protocol.

@timorl
Copy link

timorl commented Jun 23, 2023

We are working on a similar rewrite in aleph-node for probably very similar reasons. Some crucial differences in our approach:

  1. We want the sync to be using the network as an external resource rather than being intertwined with it.
    1. I'm not completely clear as to how much this intertwining will be the case for you – the handshake will still be the way in which nodes connect to each other at all, right?
  2. We only want to sync/import forks that are in some way interesting, while if I understand correctly you want to get all of them.
    1. Aren't you worried that this will somehow cause a problem due to the limit on forks in the DB?
    2. Due to that we have a data structure for gathering information about known blocks and whether they are of interest. (Sorry for linking to code, we have quite a detailed design doc, but it's not cleaned up properly to be made public yet. :/) This might be of interest, as it's designed, among other goals, to optimize syncing forks (perhaps related to Fork sync targets are handled inefficiently #525).
  3. We are currently not planning to differentiate between initial sync and keep-up sync.
    1. We might do that if it turns out getting respectable performance otherwise is unfeasible.
    2. If we do, we likely won't be switching to the other mode as aggressively as you (if I'm reading your design correctly), but rather only if the node is significantly (~hundreds of blocks) behind.
  4. We want to be able to synchronize justifications by themselves, without dragging a block with them always. This is a relatively minor requirement though.

Some of these differences are due to us using our own finality gadget with somewhat different base assumptions than GRANDPA (thus 2 and 4), some are motivated by a much lower block production time (thus 2.1 and 3.2), and some are probably just random results of different approaches of people doing the design. I'm mentioning this in case we might be able to avoid some doubled effort in this area, or at least maybe share some insights.

@altonen
Copy link
Contributor

altonen commented Jun 23, 2023

We would also like syncing to be decoupled from networking as much as possible but given that syncing is the default set protocol, it has some implications which we can't get rid of such as other protocols expecting it to exist and syncing being "handshake protocol", and it will always be treated as a sort of a different protocol.

As far as keep-up/initial sync is concerned, it doesn't have to be as aggressive as it's the document right now. It can be adjusted to any threshold we find suitable. Obviously if the history download takes a long time, some of the announced blocks can spill over into the initial sync and it only switches to keep-up sync only after finalizing a very recent block. One reason to keep the syncing modes separate is to make the code more clearly separate and I fail to see why this would be unfeasible for your version. Could you elaborate on that?

@bkchr
Copy link
Member

bkchr commented Jun 24, 2023

ut given that syncing is the default set protocol,

Why can we not change this?

as other protocols expecting it to exist and syncing being "handshake protocol"

Other protocols that want to follow sync (like grandpa, transaction sync etc) should use a generic stream of events PeerConnected/PeerDisconnected. This could be provided by everybody. AFAIR we already started doing this and not relying on events from network service telling us that a sync peer was connected.

@altonen
Copy link
Contributor

altonen commented Jun 26, 2023

Why can we not change this?

The main limitation has to do with handshaking (basically the assumption that the Role of the node is learned in the syncing handshaking) but that is fixable. That's also in the Polkadot spec. It's doable for sure I think but it's work that we may do in the future if we find it useful.

Right now the only protocol that is being followed is the sync protocol because it's considered the default but that already happens through a mechanism you're describing. But it's only implemented for syncing because we only need it for the sync protocol. But it can be implemented for every protocol if it's needed

@timorl
Copy link

timorl commented Jun 26, 2023

Why can we not change this?

In our case we were planning to replace the normal sync/default set protocol with a near-noop, that essentially only performs the handshake.

I fail to see why this would be unfeasible

I meant the other way around – we are trying to have a single mode, which would handle both syncing when a node is up to date, as well as when it's behind. This way we don't need any switching mechanism between modes and we don't need separate logic at all. The thing that might be unfeasible, is to have both that and the ability to "catch up" sufficiently fast from being far behind – we suspect we should be able to do that, but in case it's not possible we will introduce a separate "initial sync" mode.

The switching between modes being too aggressive is not hypothetical for us btw – we had some problems with nodes "flapping" between major and normal sync when under load or with some network disruptions (in sync 1.0 with a 1s block prod time, and some other modifications), which sometimes resulted with the node ending up in a broken state.

@altonen altonen transferred this issue from paritytech/substrate Aug 24, 2023
@the-right-joyce the-right-joyce added I4-refactor Code needs refactoring. D2-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. and removed I7-refactor labels Aug 25, 2023
dmitry-markin added a commit that referenced this issue Sep 27, 2023
…e` (#1650)

Move request-response handling from `ChainSync` to `SyncingEngine` as
part of [Sync
2.0](#534) refactoring
aimed at making `ChainSync` a pure state machine.

Resolves #502.

---------

Co-authored-by: Aaro Altonen <48052676+altonen@users.noreply.github.com>
Co-authored-by: Bastian Köcher <git@kchr.de>
dmitry-markin added a commit that referenced this issue Sep 29, 2023
This PR is part of [Sync
2.0](#534) refactoring
aimed at making `ChainSync` a pure state machine.

Resolves #501.
claravanstaden pushed a commit to Snowfork/polkadot-sdk that referenced this issue Dec 8, 2023
* upped the version of tracing

* find/replace

* added scale-info

* added scale-info to more pallets

* more build fixes

* fixes

* fixed imports

* removed metadata

* upped serde and codec versions

* changed BaseCallFilter to Everything

* added max instructions to FixedWeightBounds

* added OperationalFeeMultiplier

* import Everything for BaseCallFilter

* added missing implementations

* added force_unreserve implementation

* removed commented code

* return Vec from WeakBoundedVec

* fixed OpaqueMetadata conversion

* use BenchmarkError::Stop

* using struct instead of method call

* use struct patterns

* changed mock CallFilter to use Everything

* added scale-info/std to std feature

* removed native_executor_instance macro

* updated jsonrpc

* use tokio_handle

* renamed task_executor

* removed unused imports

* updated @polkadot/api to 6.8.1 because of new metadata

* rustfmt

* rustfmt with the right settings
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
…e` (paritytech#1650)

Move request-response handling from `ChainSync` to `SyncingEngine` as
part of [Sync
2.0](paritytech#534) refactoring
aimed at making `ChainSync` a pure state machine.

Resolves paritytech#502.

---------

Co-authored-by: Aaro Altonen <48052676+altonen@users.noreply.github.com>
Co-authored-by: Bastian Köcher <git@kchr.de>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
This PR is part of [Sync
2.0](paritytech#534) refactoring
aimed at making `ChainSync` a pure state machine.

Resolves paritytech#501.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D2-substantial Can be fixed by an experienced coder with a working knowledge of the codebase. I4-refactor Code needs refactoring.
Projects
Status: In Progress 🛠
Status: Open
Development

No branches or pull requests

7 participants