Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD 0072: Feature Gate Threshold Automation #72

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

buffalojoec
Copy link
Contributor

@buffalojoec buffalojoec commented Oct 12, 2023

This is SIMD 3/4 expected for Multi-Client Feature Gates. See #76

Summary

This SIMD outlines a proposal for automating the feature activation process
based on a stake-weighted support threshold, rather than manual human action.

With this new process, contributors no longer have to assess stake support for a
feature before activation. Instead, the assessment is done by the runtime.


A reference implementation for the Feature Gate program introduced in #89 and mentioned in this SIMD can be found here.

@buffalojoec buffalojoec changed the title feature-gate-threshold-automation init 0066: Feature Gate Threshold Automation Oct 12, 2023
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0066-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
@buffalojoec buffalojoec changed the title 0066: Feature Gate Threshold Automation 0072: Feature Gate Threshold Automation Oct 19, 2023
Copy link
Contributor

@ripatel-fd ripatel-fd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this SIMD - I'd like to ask about some clarification.

On first read, it wasn't immediately clear to me whether this change will install more human approval or less. Would it be possible to clarify in the summary whether:

  1. Validator software will automatically signal all of its supported features when started, without further prompt from the operator.
  2. Validator operators will now have to manually arm for feature gates via a validator CLI command.

The summary mentions that the process is automated and manual human action is removed, which would imply 1. But further below it mentions that the operator runbook now includes the new transaction step, which would imply 2.

FWIW, I oppose mechanism 1 because it'd make it virtually impossible to deploy a Labs-independent client safely. For various reasons, I expect Firedancer (validator software) to take a bit longer to implement new feature gates, and have a small minority of stake for the forseeable future. I'm worried that in mechanism 1, Labs nodes would just overrule other nodes all the time, and activate features that are not supported by all implementations.

@CriesofCarrots
Copy link
Contributor

  1. Validator software will automatically signal all of its supported features when started, without further prompt from the operator.

    1. Validator operators will now have to manually arm for feature gates via a validator CLI command.

The summary mentions that the process is automated and manual human action is removed, which would imply 1. But further below it mentions that the operator runbook now includes the new transaction step, which would imply 2.

@ripatel-fd :
In this SIMD, we describe mechanism #1, but that is actually a client-specific design. Other clients or forks of clients could remove that automatic behavior and have validators signal feature support another way or on another cadence.

However, the more fundamental principle of this proposal is that actual activation of the feature is automatic based on percentage of stake support, default 95%. While the Labs client no longer represents anywhere near 95% of stake, it's true that no one entity or client implementation (including Firedancer) would be able to block activation en masse until they represent 5+% of stake.

Even if this SIMD is approved and implemented, feature-gate key-holders can continue to negotiate with all the clients to determine when to queue the feature for activation, as we do now -- and we envision separate SIMDs around formal governance on approving/queuing proposed features. So this SIMD wouldn't prevent the level of human control that currently exists.
But if we should use something other than stake-weight to determine how many validators are compatible with a particular feature, this is the place to discuss it.

Copy link
Contributor

@lheeger-jump lheeger-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this proposal is a much needed one. As I have mentioned though, in my comments, I think we need a separate SIMD for the new program and everything about how it will work.

I also think some consideration as to how this can tie into a governance-based mechanism would be good. For example, governance could approve the creation of new features, the description of such a vote would specify exactly what the new feature will do.

Good stuff so far!

proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
@lheeger-jump
Copy link
Contributor

lheeger-jump commented Oct 19, 2023

We (@ripatel-fd and I) spent some more time discussing this proposal and we had several more comments on it:

  • The proposed changes implicitly hard fork on client software updates. There is a significant issue with the implicit activation of features on hard forks. This delegates the power to change consensus rules entirely to Solana Labs, which contradicts the expectation that the blockchain is governed by validators, stakers, holders, etc. Although it is a conscious choice for validators to upgrade their software, regular updates to software are a forcing function causing the activation of features. Similarly, a security fix shipped with a new features will certainly cause majority of stake to take the new update, even if they do not want the new feature activation.
  • This poses a particularly hard problem for adoption of Firedancer: If we start to deploy to mainnet beta, we will most definitely start out with less than 5% of stake. This would that in order to even have a chance of being stable on mainnet we would need more than 5% of stake. Its a major bootstrapping problem.
  • We need to see a specification for the structure of the new feature accounts.

We propose that there is 3 phase system for feature gate ratification:

  • Creation: anyone can create a new feature gate account, which is owned by the feature program. Creating a new feature gate must not have any additional runtime overhead.
  • Version availability: Each leader at the end of each slot, increments a counter in the feature accounts for all features they support, but are not activated. They also update a field in the feature account with the slot they are currently on. If this "last_updated_slot" is from the previous epoch during this process, store the previous counter and last updated field in the account in "previous epoch fields" and set the counter to 0. The feature becomes "available" or "armed for activation" when: 1) that previous epoch counter is very nearly 100% of slots_per_epoch and, 2) the previous epoch last_updated_slot field is in the immediately previous epoch. A feature is disarmed when the conditions are not met.
  • Activation: This should be an explicit, manual, on chain governance vote based on validator stake weight with some quorum. The properties of this vote (its duration, quorum, delay to activation after passage, participants) are to be determined but a majority vote for the new feature must result in the automated activation of the feature at an epoch boundary. This gives all stakeholders time to vie for, or protest the new feature. The feature must remain armed until the completion of the vote, if it becomes disarmed and then later rearmed, the vote should probably restart. An on-chain program should be devised for this process prior to the feature gate change.
  • After activation, a feature becomes active forever and leaders can stop updating the feature account. Clients can remove the gates in the code.

We would be happy to collaborate on proposal(s)/specification(s) which go in the above direction. As it stands, it introduces the aforementioned bootstrapping issue, which would give no recourse for the community to dissent to new features. This could prolong the development of clients with less resources.

More importantly than our own concerns though (which are admittedly self-interested) this proposal would further Solana Labs' hegemony on the chain via their control of client features. This may be against the wants of validators who actually control the blockspace of Solana. It should not be the version of the Solana Labs client or the clients implicit capabilities which control the blockspace, it must be the people.

@CriesofCarrots
Copy link
Contributor

We propose that there is 3 phase system for feature gate ratification:

@lheeger-jump , thanks for your thoughts. To clarify, the system we envision is 2 phases:

  • Feature queuing: currently done manually by a single person, but to transition to an explicit governance vote. This creates the feature account (but this could be split into two steps -- account creation by "anyone" and some update of account state by the governance signer to indicate queuing -- if this division is needed).
  • Version availability: the proposal in this SIMD

To me, these two steps are synonymous to the Version Availability and Activation steps you describe, with the same essential stake and participation requirements, just in the opposite order. Can you please expand on how changing the order of these steps changes the amount of control any particular client wields?

@ripatel-fd
Copy link
Contributor

ripatel-fd commented Oct 19, 2023

Thanks for clarifying @CriesofCarrots. I should have read the SIMD more carefully. I think we are ideologically aligned in that this is only a "feature support discovery" mechanism, and that the actual activation logic can be specified in a separate PR. This is definitely a step in the right direction. The above criticism regarding risk to another clients that me and @lheeger-jump posted is therefore invalid.

We had some thoughts about how the permissioned part can be removed (adding new features to the queue).
These changes would also allow keeping the logic entirely as eBPF, such that zero protocol-level complexity is introduced.

It would involve changing the negotiation API like so:

  • Feature support advertisement transactions are only posted by leaders in every rotation
  • Support is accumulated in a variable-size window that is a multiple in an epoch
  • During each window, each feature account tracks the number of leader rotations during which support for this feature has been advertised
  • Instead of writing a bitmap, explicitly call a feature program instruction and write to the corresponding feature account
  • Feature gate support can be calculated by either advertised_rotation_cnt*4/slots_per_epoch (share of total stake) or advertised_rotation_cnt*4/blocks_produced_in_epoch (share of active stake)
  • A feature can be added to the queue by the feature account authority

Example API

enum FeatureInstruction {
    /// Initializes a feature account
    /// # Account references
    ///   0. `[writable]` The feature account
    ///   1. `[signer]` The feature authority
    Initialize,

    /// Closes a feature account without nominating it
    /// # Account references
    ///   0. `[writable]` The feature account
    ///   1. `[signer]` The feature authority
    ClosesFeature,

    /// Advertise support for feature. Can only be called by current leader, once per rotation.
    /// # Account references
    ///   0. `[writable]` The feature account
    ///   1. `[signer]` Current leader
    LeaderAdvertise,

    /// Nominates a feature for a governance vote.
    /// # Account references
    ///   0. `[writable]` The feature account
    ///   1. `[signer]` The feature authority
    Nominate,
}

The disadvantage of this change is that it increases the amount of transactions and data required to activate a feature.
(Also relevant comment by @CriesofCarrots: #72 (comment))

@lheeger-jump
Copy link
Contributor

@lheeger-jump , thanks for your thoughts. To clarify, the system we envision is 2 phases:

  • Feature queuing: currently done manually by a single person, but to transition to an explicit governance vote. This creates the feature account (but this could be split into two steps -- account creation by "anyone" and some update of account state by the governance signer to indicate queuing -- if this division is needed).
  • Version availability: the proposal in this SIMD

To me, these two steps are synonymous to the Version Availability and Activation steps you describe, with the same essential stake and participation requirements, just in the opposite order. Can you please expand on how changing the order of these steps changes the amount of control any particular client wields?

This should be more clear in the SIMD and should be very explicit. Thanks for the clarification. Nonetheless, the arming mechanism described is relevant to the discussion.

@lheeger-jump
Copy link
Contributor

To me, these two steps are synonymous to the Version Availability and Activation steps you describe, with the same essential stake and participation requirements, just in the opposite order. Can you please expand on how changing the order of these steps changes the amount of control any particular client wields?

The 3 steps I described are to allow anyone to create a new feature, but not to allow them to be able to be the activator (that would be governance). I think what I am saying is that anyone can do creation (i.e. anyone can propose new features) and that is separate from activation which should not be voted on. I think you basically say that here:

(but this could be split into two steps -- account creation by "anyone" and some update of account state by the governance signer to indicate queuing -- if this division is needed).

Can you please expand on how changing the order of these steps changes the amount of control any particular client wields?

By splitting up the first step into creation and then activation after version availability is satisfied, you disallow the feature owner from ramming the feature on and the choice to enable is explicit. Less about client control. The client control is more about a lack of governance on feature activation.

@buffalojoec buffalojoec changed the title 0072: Feature Gate Threshold Automation SIMD 0072: Feature Gate Threshold Automation Oct 27, 2023
@buffalojoec buffalojoec force-pushed the simd-feature-gate-threshold-automation branch from 546b3ab to d1817fa Compare January 26, 2024 03:49
@buffalojoec buffalojoec force-pushed the simd-feature-gate-threshold-automation branch from ca3d674 to 2497aee Compare February 5, 2024 21:51
@buffalojoec
Copy link
Contributor Author

buffalojoec commented Feb 5, 2024

Hey everyone! It's time to dust this one off again now that we have #88 and #89 merged!

We've introduced a completely revised version of this proposal, re-written from scratch. It should assess all of the discussion points raised above in previous threads, as well as numerous concerns raised in various offline discussions.

I'd love it if we can open back up the review process for this one, and start working toward a new feature activation process!

@CriesofCarrots @lheeger-jump @ripatel-fd

P.S. I've included a link to the Feature Gate program's reference implementation in the PR description.

@lheeger-jump
Copy link
Contributor

@ripatel-fd and I can take another look in the next few days

Copy link

@topointon-jump topointon-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general approach!

proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
Copy link

@topointon-jump topointon-jump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, added some clarifying comments/questions, the answers to which would be good to have in the SIMD text.

Basically trying to be explicit about all assumptions/details, so that a client can implement this SIMD using just this SIMD.

Nice one!

proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
@buffalojoec buffalojoec force-pushed the simd-feature-gate-threshold-automation branch from a643bdf to 177b45a Compare May 13, 2024 14:31
Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One substantive question, otherwise mostly suggestions for clarity.

proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
Comment on lines 179 to 189
Transactions sent too late (< 4500 slots before the epoch end) will be rejected
by the Feature Gate program.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need any deadline/restriction? I see the program cache being alluded to, but I don't see any specification that changes to the Staged Features PDA must begin or conclude at some point earlier than the epoch boundary. It's not obvious why the Feature Gate program can't continue to modify the Staged Features PDA up to the epoch boundary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to explicitly explain the reasoning for this constraint in the SIMD, or are you asking why we actually need such a constraint? In either case, I agree it should be more explicit in the proposal.

It seems to me that @Lichtso intends to increase the cache preparation phase's initiation to be sooner in the epoch (comment), so we probably need a more concrete explanation as to why the Feature Gate program should have this constraint.

In this proposal, it was initially set to 128 slots remaining in the epoch because the cache preparation stage begins at a maximum of 128 slots before the end of the epoch, as computed here.

It's worth calling out that if we configure this constraint in the Feature Gate program and then later need to adjust it for an earlier cache preparation start, we'll be bound to using a feature gate for that change to the program.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm asking why we need a constraint at all. The cache being referenced is supposedly an agave-specific implementation, not part of the runtime protocol. If feature activation needs to work around any aspect of cache behavior, that seems like (a) a problem; and (b) a thing that definitely needs to be specified for all clients.

Copy link
Contributor Author

@buffalojoec buffalojoec May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really nothing stopping the cache from being able to anticipate the upcoming feature environment by querying the PDA under the Feature Gate program that would tell it what the current stake support is for each feature. It's actually arguably equally or more deterministic than the way it's done now.

The cache could run the calculation at its current 128 slots before epoch boundary and probably get a pretty good understanding of which features have enough stake support. Increasing that slot buffer beyond 128 could just mean increasing the likelihood that not enough stake has been signaled for a feature. Perhaps this is a compromise the cache implementation should consider? After all, I believe recompilation is an optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit removing the "slots remaining" constraint, since I think the above is a sensible direction to head in. If there's explicit opposition, it can be added back in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my imprecise formulation, I meant scheduled for activation.
If a feature is scheduled for activation in some slot close to the boundary, then it is possible that the TX is not yet rooted. Thus, some validator nodes will not have it scheduled, and will not activate it on the boundary.

Copy link
Contributor

@CriesofCarrots CriesofCarrots Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain how this is different from the cluster coming to consensus on any other tx/block on which other logic depends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For features which only affect their own fork this is irrelevant. But for features which change validator global behavior this would break. And I am not sure if there is much to be gained from making exceptions or harving different feature types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding an additional point of clarity here, if it helps.

Right now, in Agave, the cache is calling compute_active_feature_set, which will compute the feature set based on the on-chain accounts that have been created under Feature11111... (ie. "activated"). The cache does this ~128 slots before the epoch boundary to prepare the cache entries for the anticipated upcoming environment.

With the proposed system in place, the cache can still call compute_active_feature_set the same way it does now. The function will just compute the anticipated feature set from the proposed PDA (with stake support) rather than just the presence of on-chain accounts.

In almost every way it's the same as what we have now. The subtle difference is that someone can no longer activate a feature after the cache has done its feature-set computation, but nodes could still cast votes (signals) on a feature, bringing it to "eligible" status. This edge case exists today and just causes cache preparation/recompilation to be moot.

What @CriesofCarrots is suggesting is that we don't explicitly architect this mechanism around the cache, which is sensible, especially considering the cache preparation may change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lichtso do you still have concerns here? I'd like to move this SIMD along.

@buffalojoec buffalojoec force-pushed the simd-feature-gate-threshold-automation branch from 4d2b500 to 5ca6da6 Compare May 22, 2024 18:32
@buffalojoec buffalojoec force-pushed the simd-feature-gate-threshold-automation branch from 5ca6da6 to 58cb436 Compare May 22, 2024 18:38
Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small comments on the new PDA wording. Just the question about "slots remaining" constraint to be dealt with, as far as I'm concerned.

proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
proposals/0072-feature-gate-threshold-automation.md Outdated Show resolved Hide resolved
CriesofCarrots
CriesofCarrots previously approved these changes Dec 30, 2024
@buffalojoec
Copy link
Contributor Author

@jstarry as per our conversation, I've added a commit to update the design, which does appear to simplify things a bit.

For everyone participating, here's a breakdown of what we chatted about.


Before commit b14cad2 (at commit 4591f55), the design had a few limitations.

  • The use of a global multisig imposed a cat-herding burden on all feature activations.
  • There existed no capability for configuring different staging authorities per feature.
  • Node support signals were lost after four consecutive epochs.
  • An arbitrary limit of 8 activations per epoch was imposed.

With this latest commit, we can address these issues. TL/DR:

  • Move “staged feature” state into the feature accounts themselves, and remove the central “stage” PDA.
  • Require nodes to signal support for all pending features, minus any they intentionally omit.
  • Node signals land in a PDA derived from their vote address, the feature ID, and the epoch. One account per node, per feature, per epoch.

Pros:

  • No global multisig
  • Configurable staging authorities
  • Perpetual node signal history
  • No arbitrary stage size limit
  • Staged features remain pending across epochs (not automatically withdrawn)

Cons:

  • Slightly larger transaction for node operators to signal
  • Rent cost incurred per epoch for node operators

cc @topointon-jump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants