-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Conversation
a9094e3
to
4334d81
Compare
7fd50bc
to
90de432
Compare
90de432
to
0832281
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Here's my initial comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts about the general design of this PR. Feel free to schedule a VC if that's something can speed up the discussion.
d25bb3c
to
a6619d1
Compare
9037968
to
7b26a02
Compare
@steviez please see my note here #27729 (comment) Will be happy to test this branch when it is ready for mainnet. Let me know. |
Hi @offerm, no need to post in both places, I saw your comment but just hadn't gotten around to responding. In regards to timing / testing, thank you for offering. I have been running a node for a bit, and have promising data. However, the issue is that this PR changes the layout of SlotMeta in the blockstore. As such, it requires some tact to rollout while leaving a potential path to fallback. See the compatibility & rollout strategy sections of this PR for more details. But, to state plainly, if you start running this branch and wish to revert, you will have to wipe your blockstore (rocksdb). Getting the network fully on v1.13 (with QUIC only) is a priority, once that has happened and we have confidence that there are no issues, I'll begin executing the steps in rollout strategy. Given that the above is a blocker, I haven't been pushing this PR along super hard. |
7b26a02
to
781bea6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
I still feel that it's probably better to have three functions:
- a function (I don't have a proper name) that sets
01
bit to true. In the normal case, the condition is is_full.- the only exception where it can bypass is_full check is the
set_and_chain_connected_on_root_and_next_slots
function introduced in this PR, which set both11
bits to true, and the function should only be called when the slot is from a snapshot.
- the only exception where it can bypass is_full check is the
- set_parent_connected() that sets
10
bit to true. - is_connected() checks whether we have both
11
bits to be true.
nice! |
04d19ab
to
060208d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running experiments. The results look promising!
The PR looks good. Only minor comments.
060208d
to
d1ac6f3
Compare
Pull request has been modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
d1ac6f3
to
f0b8331
Compare
Pull request has been modified.
f0b8331
to
4ba4777
Compare
Rebased on the tip of master for the sake of leaving a test node running - no code changes, but also no need for reviewers to look again. This previously had ship it, and we are still waiting for mnb to get onto v1.13.6 or v1.14.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This previously had ship it, and we are still waiting for mnb to get onto v1.13.6 or v1.14.x
should be fine to merge into master now that 1.15.0 is cut. just hold the bp
After a nudge from @t-nelson to check,
As noted in the main PR description (see compatibility and rollout strategy sections), I wanted to see this happen in the name of being very cautious. But, with above I think we're there so I'm going to rebase on tip of master to make sure nothing broke while this PR has been sitting on the back-burner. With a good CI run, I feel good shipping to master and we can then separately consider backporting |
4ba4777
to
028287d
Compare
The tracking of connected status was previously based upon an assumption that would be practically false for all validators. The connected status of slots played into whether Blockstore would signal ReplayStage that it had new shreds ready to be replayed. Prior to the change, we would never signal and ReplayStage would always wait the entire duration of a 100ms timeout before restarting its' main processing loop. This commit introduces a change where we mark snapshot slots as connected. A validator may not have a path all the way back to genesis itself; however, snapshots are taken at known roots so we extend the connected status to these slots. Once a node has been bootstrapped once to have is connected, the logic persists in Blockstore such that all children on the main fork also get their connected status updated properly.
028287d
to
a692c2c
Compare
Glad I checked, clippy requirements have changed and something I was doing is no longer allowed: |
Codecov Report
@@ Coverage Diff @@
## master #28069 +/- ##
=========================================
- Coverage 81.7% 81.6% -0.1%
=========================================
Files 723 723
Lines 201802 201891 +89
=========================================
+ Hits 164891 164917 +26
- Misses 36911 36974 +63 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Ship it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remind me of the prospects for making this sufficiently for/backward compatible for a safe 1.14 backport? 1.13?
See "Rollout Strategy" for more details, but this PR is compatible in both directions with |
First off, kudos to @offerm for initially spotting something was awry in #27729 and #27786.
Problem
The SlotMeta
is_connected
field was found to not be functioning as expected. Namely,is_connected
was permanently false for nodes starting from snapshots because the supplied definition of is_connected requires a "path" back to genesis. See more details in #27820Summary of Changes
Change
is_connected
boolean to bitflags to allow tracking more stateNow, when
load_frozen_forks()
is called, the first slot will be marked as connected, as starting will either be at slot 0 or from a snapshot. Snapshots are root slots, so we want to allow this slot to start anis_connected
chain as well.Compatability
The offered solution as-is in this PR offers compatibility in the scenario where an operator upgrades their client to include this features, but keeps their ledger from a version that did not include this feature.
However, the scenario of downgrading client to a pre-this-feature version and keeping a ledger around that was populated with the this feature is not supported. Trying to deserialize a byte into a bool where any bit except the LSB is 1 will error.
To address this concern, #29001 was introduced and backported. This PR introduces the new bitflags and fields; however, it only uses one bit (the LSB). With this PR rebased on top of 29001, this PR doesn't introduce any new bits; it just starts using the bit that was introduced in 29001.
Perf Impact
#28069 (comment) details an experiment I ran that shows promising results between two unstaked nodes (one with the change, one without) running against mnb.
Rollout Strategy
The following things need to happen before this commit can be pushed out:
SlotMeta
instead of a boolFixes #27820