Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition proactive refresh #504

Merged
merged 11 commits into from
Nov 20, 2023
Merged

Partition proactive refresh #504

merged 11 commits into from
Nov 20, 2023

Conversation

JesseAbram
Copy link
Member

@JesseAbram JesseAbram commented Nov 15, 2023

Creates an algorithm for partitioning the network into chunks for the proactive refresh.

  • This allows us to set a constant REFRESHES_PRE_SESSION and the network will grab a rotating batch of registered keys to refresh
  • The network will "loop around" when all keys have been refreshed starting the process again
  • This does not take into account that the registered accounts will change over time and will be sorted added and not appended onto the registered key struct, will make an issue to handle Fix Partition all keys function  #510

Copy link

vercel bot commented Nov 15, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
entropy-core ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 17, 2023 9:32pm

@JesseAbram JesseAbram force-pushed the partition-proactive-refresh branch from eff91fe to f10877c Compare November 16, 2023 18:51
Copy link
Collaborator

@HCastano HCastano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to read through your algorithm code and test a bit. I left some comments and questions for now

@@ -350,6 +356,7 @@ pub mod pallet {
pub fn new_session_handler(
validators: &[<T as pallet_session::Config>::ValidatorId],
) -> Result<(), DispatchError> {
// TODO add back in refresh trigger and refreshed counter https://github.com/entropyxyz/entropy-core/issues/511
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we need this TODO here since we have #511 open.

Suggested change
// TODO add back in refresh trigger and refreshed counter https://github.com/entropyxyz/entropy-core/issues/511

/// Total amount of refreshes done on the network
#[pallet::storage]
#[pallet::getter(fn refreshes_done)]
pub type RefreshesDone<T: Config> = StorageValue<_, u128, ValueQuery>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the partition algorithm uses this as a sort of nonce to do it's work with? Would it be safe to use pseudo-random hash or something instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, to the first question, and I want it to go in order, like do the first 10 accounts, then the next 10, if it was pseudorandom we would have accounts get refreshed more times than needed while having others not be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha thanks.

Is there any risk to validators knowing they were "next up" on being refreshed?

@@ -18,3 +18,6 @@ pub const PRUNE_BLOCK: u32 = 14400;

/// Timeout for validators to wait for other validators to join protocol committees
pub const SETUP_TIMEOUT_SECONDS: u64 = 20;

/// The amount of proactive refreshes we do per session
pub const REFRESHES_PRE_SESSION: u128 = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub const REFRESHES_PRE_SESSION: u128 = 10;
pub const REFRESHES_PER_SESSION: u32 = 10;

We can get away with using something smaller here for the type, maybe even a u8.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this type. Is this referring to the number of times per session we do the whole proactive refresh algorithm, or the number of validators per session that are split according to the proactive refresh algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is u128 so I don't have to do extra conversations, I guess the question is what do we optimize for, smaller type or less conversions in the partition function

It is how many accounts we do a proactive refresh on per session so if we have 1000 registered accounts we would take them in x amount of accounts per session to not overstress our validators

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left #513 as a suggestion for changing things to a u32

/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys
/// https://github.com/entropyxyz/entropy-core/issues/510
pub async fn partition_all_keys(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to have this as a non-async function


/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys
/// https://github.com/entropyxyz/entropy-core/issues/510
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// https://github.com/entropyxyz/entropy-core/issues/510

Same thing here, we've got the issue so we're good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk man personally I like to keep these in the code base

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, made a different suggestion in another comment then

Comment on lines 269 to 270
let all_keys_length = all_keys.len() as u128;
let usized_refreshed_pre_session = REFRESHES_PRE_SESSION as usize;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of type casting going on in this function. Maybe we should try and change the types so they work together a bit better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya so the issue here is that I need usizes for the len and the slicing, but I was not able to get substrate to compile with usizes so I opted for u128 and coverting to usize when I needed


#[tokio::test]
async fn test_partition_all_keys() {
initialize_test_logger();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't strictly necessary since there's no logging in partition_all_keys. I guess there could be in the future though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya I mean just seemed like the thing to do, idk your call should I remove or keep

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it then. I've done that in my u32 PR

/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys
/// https://github.com/entropyxyz/entropy-core/issues/510
pub async fn partition_all_keys(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this is being done by the server instead of on-chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undo stress to be done on chain having a subset of validators know this instead of every node in the network do this seems like a better call

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there not be problems if validators don't end up coming to agreement on how the network has been partitioned? That's why I'm thinking it might be better to have the chain be the canonical source of truth here

@ameba23
Copy link
Contributor

ameba23 commented Nov 16, 2023

Im trying to understand this - i think i don't totally understand the process by which proactive refresh gets activated in the first place.

It looks like the propagation pallet sends out OcwMessageProactiveRefresh messages to all TSS servers:

pallet_relayer::Pallet::<T>::get_validator_info().unwrap_or_default();

and then in this PR, those TSS servers ask the staking pallet for the RefreshesDone value by which to decide which accounts should proactive refresh this time around. But i'm missing where RefreshesDone gets updated.

Also could we put RefreshesDone into OcwMessageProactiveRefresh?

The algorithm itself looks great! But from what you said on the dev call i have the impression you're not totally happy with this setup. And i still don't understand what the issue is relating to how rocks/paritydb orders keys.

I'd be up for having a call sometime to look at this in more detail.

@JesseAbram
Copy link
Member Author

Im trying to understand this - i think i don't totally understand the process by which proactive refresh gets activated in the first place.

It looks like the propagation pallet sends out OcwMessageProactiveRefresh messages to all TSS servers:

pallet_relayer::Pallet::<T>::get_validator_info().unwrap_or_default();

and then in this PR, those TSS servers ask the staking pallet for the RefreshesDone value by which to decide which accounts should proactive refresh this time around. But i'm missing where RefreshesDone gets updated.
Also could we put RefreshesDone into OcwMessageProactiveRefresh?

The algorithm itself looks great! But from what you said on the dev call i have the impression you're not totally happy with this setup. And i still don't understand what the issue is relating to how rocks/paritydb orders keys.

I'd be up for having a call sometime to look at this in more detail.

ya we can have a call tomorrow

the activating proactive refresh can be seen here #511 which includes the incrementing of the refreshed counter

and yes that can def be sent in the ocw message


#[tokio::test]
async fn test_partition_all_keys() {
initialize_test_logger();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it then. I've done that in my u32 PR

@@ -18,3 +18,6 @@ pub const PRUNE_BLOCK: u32 = 14400;

/// Timeout for validators to wait for other validators to join protocol committees
pub const SETUP_TIMEOUT_SECONDS: u64 = 20;

/// The amount of proactive refreshes we do per session
pub const REFRESHES_PRE_SESSION: u128 = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left #513 as a suggestion for changing things to a u32

/// Total amount of refreshes done on the network
#[pallet::storage]
#[pallet::getter(fn refreshes_done)]
pub type RefreshesDone<T: Config> = StorageValue<_, u128, ValueQuery>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha thanks.

Is there any risk to validators knowing they were "next up" on being refreshed?

Comment on lines 262 to 264
/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys
/// https://github.com/entropyxyz/entropy-core/issues/510
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay no worries. Let's add a bit more context here then. How about something like:

Suggested change
/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys
/// https://github.com/entropyxyz/entropy-core/issues/510
/// Partitions all registered keys into a subset of the network (REFRESHES_PRE_SESSION)
/// Currently rotates between a moving batch of all keys.
///
/// See https://github.com/entropyxyz/entropy-core/issues/510 for some issues which exist
/// around the scaling of this function.

* Change refresh session type to `u32`

* Fix typo

* Remove `Result` return type

* Remove `unwrap`s from test
@JesseAbram
Copy link
Member Author

#504 (comment) @HCastano sorry outdated so couldn't respond, but possible in only a DDos scenerio, if that is the case would have to be fixed in dkg too, this should be a standalone issue

Copy link
Contributor

@ameba23 ameba23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this, after getting clear on the ordering issues on a call. Looks great!

@@ -86,6 +86,12 @@ pub mod pallet {
pub x25519_public_key: X25519PublicKey,
pub endpoint: TssServerURL,
}
/// Info that is requiered to do a proactive refresh
#[derive(Clone, Encode, Decode, Eq, PartialEq, RuntimeDebug, TypeInfo, Default)]
pub struct RefreshInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is very similar to OcwMessageProactiveRefresh. Im guessing the reason you didn't use the same type is because of compilation issues with using the RuntimeDebug and TypeInfo traits in entropy-shared. Possibly we could get that to work by fiddling around with feature flags, but i think this is fine as it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes lol this is exactly what happened

@JesseAbram JesseAbram merged commit 7b7aa10 into master Nov 20, 2023
5 checks passed
@JesseAbram JesseAbram deleted the partition-proactive-refresh branch November 20, 2023 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants