Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔷 [ProjectTracking] Forknet improvements #10542

Open
5 of 20 tasks
posvyatokum opened this issue Jan 31, 2024 · 13 comments
Open
5 of 20 tasks

🔷 [ProjectTracking] Forknet improvements #10542

posvyatokum opened this issue Jan 31, 2024 · 13 comments
Assignees

Comments

@posvyatokum
Copy link
Member

posvyatokum commented Jan 31, 2024

Goals

The toolbox of infrastructures we have for creating and managing test mock networks is large and versatile, with Forknet and mirror tool standing out as being the most powerful. The goal of this project is to continue the development of Forknet by unifying it with the mirror test infrastructure. The two have a lot in common, but some parts (e.g. control plane) are done differently due to the different design philosophies. By merging them together we will simplify toolbox and will better utilise it.

Background

We had several tests (regular betanet, and on-demand spoon test) that were creating mocknet with simple predefined traffic. We wanted to have a way to make traffic more meaningful, so we developed a mirror toolset – a way to test binary (or binary release) via mocknet with a real traffic slice from mainnet. This tool was utilized only during releases, and only by the Node team. And it was not very easy to do.

Independently, we also developed a toolbox to speed up creation of mainnet forks for testing (Forknet). Now, we want to combine fast test setup with working traffic mirroring, to create a better testing system. In order for it to be as useful as it can be, we are also focusing on user-friendliness.

Context

The project is split into three pillars: Correctness, Performance and Infrastructure.

Why should NEAR One work on this

Short term value:

  • Building confidence in releases
  • Making release testing faster, easier, and cheaper
  • Decreasing debugging time of the mocknet test

Medium term value:

  • Protocol upgrade testing of specific features
  • More granular testing of features included in the next release
  • Improving confidence in the next release before the cut

Long term value:

  • Improving confidence in master without relying on manual testing

What needs to be accomplished

Correctness goal:

  • Gain confidence in the forknet test runs

How we will do that:

  • Build reliable mainnet images for traffic mirroring
  • Support all types of nodes
  • Improve test evaluation

Performance goals:

  • Improve the performance testing on Forknet

How we will do that:

  • Enhance the TPS of mainnet mirrored traffic
  • Support additional traffic from locust
  • Expose metrics for the transactions

Infrastructure goals:

  • Support easy testing experience for all engineers

How we will do that:

  • One click setup for forknet tests
  • Improve dev experience of writing a custom test (nayduck)
  • Support multiple test runs on one setup
  • Support multiple types of forknet images (mainnet mirror, custom image, empty image)
  • Unify forknet, localnet and mainnet nodes under the same interface

Links to external documentations and discussions

Assumptions

N/A

Pre-requisites

N/A

Out of scope

Custom test flow development is out of scope. If some feature need specific network orchestration to be properly tested, we expect the feature engineer to write the orchestration script using provided tools and examples.

Task list:

Roadmap

Correctness

Performance

Infrastructure

Real life progress

Bugs

  1. wacban
  2. VanBarbascu

Backlog

@posvyatokum
Copy link
Member Author

2024-01-31 Meeting notes
CC: @marcelo-gonzalez @gmilescu @posvyatokum

We are again focusing our attention on the forknet.
Forknet MVP would support:

  • CI correctness testing of nearcore master
  • correctness and load testing of custom binaries on demand

We have developed two toolboxes for mocknet: Marcelo's mirror tools, and Vlad's forknet tools. We need to combine them into a working solution that hides away most of the complexity from the users, while providing enough flexibility.

To achieve this we will:

  • utilize Vlad's scripts to make creation of a custom network with a slice of mainnet traffic and arbitrary additional load an easy and cheap process
  • utilize Marcelo's neard_runner.py and mirror.py scripts to handle different test scenarios

For a feature developer the flow of using forknet will look like this:

  • Run github action to create an image with desired traffic slice, if needed.
  • Run github action to create a network with desired setup, slice of traffic, and additional load.
  • Implement supporting functions in neard_runner.py, if needed (start/stop/switch binary/update binary/update config are already implemented)
  • Implement and execute test scenarios using mirror.py. This script uploads and starts neard_runner.py on your forknet cluster. You then can manipulate nodes however you want.

CI testing using forknet will use the same flow automatically. New traffic slice image will be created monthly. New forknet instance will be created/destroyed weekly. Flow of the test will simply start every node with the latest master binary.

Our first goal is a stable setup for forknet CI with one of existing traffic slices.

@posvyatokum
Copy link
Member Author

2024-02-07 Meeting notes
CC: @marcelo-gonzalez @gmilescu @posvyatokum
The first goal for the project is to test resharding on mocknet with split storage nodes. #10581
This effort will allow us to add split storage nodes to any testing setup in the future.
We are allowing ourselves to build on top of the established mirror infrastructure, as it is fully working at the moment. The downside of it is non-optimal performance that leads to long waiting times between starting to set up the test and the test start itself. Right now it makes sense for us to gradually adjust mirror infra to use improved tools (like https://github.com/near/nearcore/tree/master/tools/fork-network), rather than build new testing infrastructure from ground up. In the end we are aiming to have fully optimized performance, without ever losing the ability to do a complete test in the process.

@posvyatokum
Copy link
Member Author

2024-02-14 Meeting notes
CC: @marcelo-gonzalez @gmilescu @posvyatokum
We are still working on supporting split storage, and have made some progress in the issue #10581.
At the same time we took time to create a better roadmap for the project, and agree on the order of delivering features.
For people with access, full doc can be found in google drive.

Important previews:

Conclusions

As we limit the scope for the continuation of a forknet project, we hope to have a mocknet test that is:

  • easy to start
  • comfortable to evaluate
  • flexible enough to represent realistic mainnet-like setup

We are actively trying to learn from our mistakes and avoid overengineering. We do not want to improve non-crucial tools, or tools that are not causing significant problems.

Our core values for this project can be summarised in two statements:

  • Result is achieved iff it is a part of a complete product that we can use
  • The best resource we have is the energy of our engineers. We should make sure that it is not used in vain.

TLDR table

image

Roadmap table

image

Future plans

@posvyatokum will focus on switching to forknet approach in mirror test setup
@marcelo-gonzalez will focus on supporting all types of nodes in mirror test

@posvyatokum
Copy link
Member Author

2024-02-21 Meeting notes
CC: @marcelo-gonzalez @gmilescu @posvyatokum
We are focusing on pre-release testing of 1.37.0 #10642. We will use results from #10581 to be sure that all mainnet nodes will be able to go through resharding without problems.
This will close one of our 3 goals for the short term of this project:

Build confidence in resharding on mainnet

@posvyatokum
Copy link
Member Author

2024-02-28 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Past week

For the past week we were focusing on testing the 1.37.0 release.
We were actively developing tools for mocknet management:

  • on-node restarting tool allows nodes to be automatically restarted depending on a metric-based trigger
  • pour db tool can combine two dbs and delete some columns. That allows us to have pseudo-archival nodes -- nodes that have a lot of archival data, that is meaningless for blockchain, so it will not interfere with block production, but at the same time approximates size and keys distribution of a real production archival node.

@marcelo-gonzalez discovered problems with resharding, using the restarting tool.
@posvyatokum created pseudo-archival dbs for mirror testing.

Next week

@marcelo-gonzalez will focus on testing the bug fixes for the resharding issue.
@posvyatokum will continue working on a realistic node setup MVP. This includes:

  • Adding different test initialisation process to mirror.py and neard_runner.py
  • Manually preparing one image for every type of node.
  • Documenting instruction on the usage of the setup.

Our first priority is to thoroughly test resharding before mainnet release. We will gear towards incorporating forknet tools whenever possible, if it fits our timeline. We will keep documentation of all steps taken for easy test reproduction, and future automatisation.

Progress overview

These tasks contribute to all Stage 1 goals:

  • We will build confidence in the 1.37.0 release by testing its biggest feature.
  • We will improve testing experience by adding more types of nodes and allowing a faster setup.
  • We will create a draft instructions to follow for the 1.38.0-rc.1 testing.

At the end of the week we aim to complete 1.37.0 testing, and have an updated set of instructions for 1.38 testing.
Testing improvements will come after we have transitioned to 5 shards on mainnet.

@posvyatokum
Copy link
Member Author

2024-03-04 Update

CC: @gmilescu @marcelo-gonzalez @posvyatokum

Past week

For the past week we were focusing on testing the 1.37.0 release.
@marcelo-gonzalez made sure that issues with resharding after node restart are fixed
@posvyatokum made sure that resharding works on split storage nodes

Next week

@marcelo-gonzalez will focus on helping @VanBarbascu to test 1.38.0-rc.1 and create a comprehensive documentation for the process
@posvyatokum will create a permanent mocknet for feature testing. This will make mocknet feature testing easier and faster for developers and decrease debugging time during the release testing.

Progress overview

We achieved the goal of increasing confidence in the 1.37.0 release.
@posvyatokum is in the process of making mocknet testing more accessible to the engineers
@marcelo-gonzalez is in the process of creating a clearly established process of pre-release mocknet testing

RoadMap adjustment

@khorolets raised a point of developing easy test result evaluation methods. Right now we have all of the work regarding automatic test evaluation planned for Stage 3. Looking back, this seems like a very narrow timeline for an important feature.
@posvyatokum will rethink the roadmap for test evaluation, with a focus on some POC in Stage 1. Accuracy of POC solution is out of the scope for Stage 1, we just need to enable developers to have some form of evaluation automation.

@gmilescu gmilescu changed the title 🔷 [ProjectTracking] Forknet 🔷 [ProjectTracking] Forknet improvements Mar 5, 2024
@posvyatokum
Copy link
Member Author

2024-03-11 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Past week

@marcelo-gonzalez tested resharding of shard 2 in 1.38 release

Next week

@marcelo-gonzalez will continue to test and support 1.38 resharding release
@posvyatokum will hold a protocol discussion about forknet testing for developers. The goal of discussion is to collect feedback and feature requests from engineers to adjust the project roadmap.
@posvyatokum will create a new draft of forknet testing instructions before protocol discussion. It will not contain commands that need to be executed, as it is a subject to change, but will rather describe the process that engineers may go through when testing their feature. This document should help us have a more productive protocol discussion.

Progress overview

  • We will focus on defining the scope of issues that we can complete by the end of stage 1 that yield the most profit for improving testing experience
  • We have confidence in 1.37 release
  • We will invest more time into solid instructions for testing, as we are developing new features, but right now we will do some ground work to make creating instructions an easier process in the future.

@posvyatokum
Copy link
Member Author

2023-03-18 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Due to the high density of complicated releases, we don't have significant progress for this project.
Everything from the past update plans transfers to this week.

@posvyatokum
Copy link
Member Author

2024-04-01 Stage 1 Update

CC: @marcelo-gonzalez @gmilescu @posvyatokum

Context

At the start of the project, we broke it down into three stages of increasing length. That allowed us to have concrete long term goals, while not over-focusing on implementing the final product right away. As a result, we prioritized immediate needs, and right now we feel like it is the right approach to this project. Thus, we will again restructure our roadmap in a way that creates some milestones and carves out some vision of the final product, but only gives full definition to the tasks for the next month.

Expectations for Stage 1 (March)

Goals

  • Improve release testing experience
  • Build confidence in resharding on mainnet
  • Have established process for testing 1.38.0-rc.1

Tasks

Not done:

  • Speed up mirror test setup using tools/fork-network

Ad-hoc implementation:

  • Support use of rpc, legacy archival, and split storage archival nodes in the mocknet
  • Allow artificial inflation of DB size to approximate real DB latency in tests

Conclusion

We focused on building tools that we needed in the moment, and didn't prioritize polishing them and making them a part of an established flow. This was mainly due to an extreme workload of the pre-release testing process, that required us to move fast and not break things.

Expectation for April

We see that we were able to successfully manually incorporate different new tools into the established mocknet flow. Now we need to make them a permanent part of the process.
We will focus on creating an end-to-end MVP product tailored to testing transition from stateful to stateless
validation. We will decide on a concrete roadmap in the next Forknet meeting (planned for April 2nd).
We should be mindful about distinguishing MVP for the whole project, and an MVP solution for this particular case, as the full solution requires a lot more automation and areas of freedom for feature developers.

TLDR

In March we did a lot of ad-hoc things to support releases, in April we will create a mocknet CI for stateless validation.

@marcelo-gonzalez
Copy link
Contributor

#11034 implements the speedup of network startup. after this PR, we can make things faster by changing the way we setup the images so that in ~/.near/setup/ there's a full NEAR home dir that has had neard fork-network init and neard fork-network amend-access-keys run on it. The scripts should work the same way with no big difference to the interface

@VanBarbascu
Copy link
Contributor

We continue the work on improving the Forknet by dividing it in 3 pillars:
Correctness goal: Gain confidence in the Forknet test runs
Performance goals: Improve the performance testing on the Forknet
Infrastructure goals: Support easy testing experience for all engineers

We will start by tackling the primary issue of correctness, with a focus on building reliable mainnet images, as this is currently preventing Forknet from reaching high TPS and supporting a wide range of transaction types.

While running the traffic mirroring, we observed two issues:

  • Accounts running out of balance
  • Invalid accounts being created

The first issue can be addressed by reducing the gas price for the forked network in genesis.json.
For the second issue we need to debug the Forknet image creation.

@walnut-the-cat
Copy link
Contributor

Sept 9-13

  • investigating invalid transactions in the forknet, which mirrors mainnet.
    • Initial tests using 1 mirror node and 1 validator without memtries have faced congestion issues.
    • switched to a setup with 1 node per shard, including memtries and 1 mirror node, but found that the top of the nearcore master is too unstable to keep the chain running beyond two epochs.
    • So far, the only invalid transactions identified are related to insufficient balance.

@VanBarbascu
Copy link
Contributor

Sept 15-27

The use of Forknet V2 images is currently blocked on 2 items listed in the Roadmap section:

  • Correctness: Reliable Forknet v2 image creation using tools/fork-network
  • Performance: Generate the desired TPS on forknet by mirroring transactions

Addressing these 2 is the key to start using forknet for performance testing.
So far we identified and addressed the invalid accounts in the forknet image (the first blocker). We are now validating the new image.
Last week we identified a crash in the mirror tool and we are working on a fix. This will fix the second item in the list.

After we are done with these 2 blockers, the next items in the roadmap improve the usability of the forknet bringing it closer to a one click setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritised
Development

No branches or pull requests

5 participants