Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State sync improvement plan #8545

Closed
mm-near opened this issue Feb 9, 2023 · 0 comments
Closed

State sync improvement plan #8545

mm-near opened this issue Feb 9, 2023 · 0 comments
Assignees
Labels
Node Node team T-node Team: issues relevant to the node experience team

Comments

@mm-near
Copy link
Contributor

mm-near commented Feb 9, 2023

State sync improvements - current progress & plan

Our goal is to drastically improve the performance of the state sync (so that we can both scale to 10 shards and allow chunk producers to track only a single shard).

The work consists of 4 separate pieces:

  • improving the speed of state sync part creation
  • improving the reliablity of fetching these parts
  • improving the speed of applying these parts
  • making it work also in the case where chain is under load.

Improving speed of part creation (in progress)

TL;DR - current code was creating the parts by iterating over trie nodes (which is slow - as these are basically random accesses).

The idea is to iterate over flat storage instead (which should be faster - as this is a linear scan).

The small-scale experiment has finished succesfully (we're able to generate the parts from flat storage that are matching the ones generated from Trie).

Currently working on setting up the large scale experiment to see the performance.

@Longarithm: let's track progress on #8899.

Reliability of fetching the state parts (not started)

Currently the parts are sent via our Tier2 network (as RoutedMessages). This might put a very large load on the network - especially on the uplinks from some of the peers.

The idea, is to add additonal sources (as alternatives) for nodes to download the parts from.

We're thinking about doing it with S3 - where some Pagoda nodes would put the current parts into S3 - and other nodes would be free to download their from those location.

Each part can be verified independently, so there is no additional trust needed.

Improving application speed (not started)

Currently the application of the state sync is done only after all the parts are fetched, and it happens 'single threaded'. We can drastically improve it by doing things in parallel and as soon as each part is received.

Making it work when chain is under load (in progress)

Currently, the node downloads the state (which might take couple hours) - and afterwards it has to run a 'catchup' - that is - apply all the transactions that happened while it was downloading the state.

This means, that if the network is full, nodes are under large time pressure to dowload the state ASAP. (Otherwise, if epoch is 12h and you spend 7h downloading the state, you have remaining 5h to basically apply all the transactions for this epoch - so you'll have to process transactions at 2.3x speed - which might not be possible if network is under load).

To fix this, we're experimenting with the ShardShadowing after StateSync.

The idea is following:

  • node does the state sync
  • then it does the shard-shadow (basically receiving the state-deltas)
  • and then it switches to the transaction application (catchup)

The assumption is, that the state-deltas can be downloaded and applied a lot faster than the catchup blocks.

@nikurt nikurt self-assigned this Mar 30, 2023
@nikurt nikurt added the T-node Team: issues relevant to the node experience team label Mar 30, 2023
@gmilescu gmilescu added the Node Node team label Oct 19, 2023
@nikurt nikurt closed this as completed Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Node Node team T-node Team: issues relevant to the node experience team
Projects
None yet
Development

No branches or pull requests

3 participants