-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed attnets revamp #2749
Comments
This solution is really neat, having an enforceable condition is a great improvement. However, must be noted that for small stakers with a single key this requirement doubles their attestation processing cost while for large stakers / pools it'll represent a tiny increase, going against
Post sharding, single key stakers will be expected two fully validate two shards if their random subnet != current duty subnet? |
@djrtwo Should the 'random' word be removed here? As I understand |
Good proposal from my perspective 👍 It also has an extra benefit: it would make validator nodes not so trivially discoverable (currently it is even easy to estimate the number of validators per node with discovery However we need to consider a potential attack on a single attestation subnet. An attacker could craft specific It might make sense to think of mixing a randao to the |
yeah, I guess I mean that you get an effective pseudo-random distribution. Will update |
right! an incredible amount of information is leaked today.
I see. I suppose you can attempt to flood the neighbor DHTs with peer-ids of a particular type so that when a node searches for a subnet, they find these "fake" subnet peers. I would argue that the damage of eclipsing a small set of nodes wrt an attsetation subnet is likely limited -- a few validators might no get their individual attestations included. A targeted DOS could do the same thing with likely even fewer resources. Additionally, gossipsub's IWANT/IHAVE and other mechanisms are primarily for these types of cases. To aid dissemination of data even under various attacks to subnets.
Right, but this is likely better in the long run over our status quo. In the event that you have a 10:1 (or even 100:1) user_node:staking_node ratio then attacking subnets as a function of total network nodes degrades more and more.
Ah, okay. So you can't pre-compute a node-id table that will be useful for all time. This seems reasonable if you had short subscription times, but I think you likely want subscription times on the order of at a bare minimum 1-hour (or at least multiple epochs) to reduce subnet backbone churn. Once you get to 1-hour or 1-day an attack would be able to easily "mine" node-ids with a given randao. I suppose it makes it harder to get your peers positioned in the DHT though |
Yep, the present solution looks even more vulnerable to me.
Yes, that's totally not a silver bullet, but limits an attacker time to 1-hour/1-day. |
Additional argument pro this solution: I'm now gathering mainnet network statistics and it looks like large validator nodes (with many att subnets subscriptions) tend to restrict inbound connections: i.e. they either immediately drop connection or keep you connected for just several seconds and then disconnect with That behavior seems reasonable when you have much at stake since every inbound connection adds more risk to be attacked. So I believe this tendency could evolve further with time and it would be even harder to join a subnet UPD: some connection statistics:
Probability of being connected to a node (connected here means "negotiate on Eth2 level and then being connected for at least 5 slots") with grouping to 'small' validator nodes (from 1 to 31 att subnets) and 'large' validator nodes (more than 31 att subnets): So large validator nodes indeed harder to connect to, but the overall statistics doesn't look too dramatic |
I would argue that due to our attestation subnet search requirements that all nodes should have both a Otherwise, node connections can become far to rigid and peer discovery for particular purposes (e.g. finding a subnet) can become more prone to error. Lighthouse uses this mechanism and has since genesis. We might evn consider specifying this as honest behaviour (e.g. |
I was probably trying to say 'behavior rational from validator node's perspective' rather than 'reasonable from the network point of view' I agree that dropping inbound connections is not a honest behavior, but if I was maintaining a node with a lot of validators and acting rationally I would stick to some kind of conservative peer management policy like peering with just a limited known subset of nodes or something similar |
Teku does this in a similar way to Lighthouse but because there are so many nodes with closed ports, any node with open ports gets far more incoming connection attempts than it needs (or can reasonably be expected to accept). So even with a large difference between We've been recommending to users for a while that the best way to improve attestation performance is ensure incoming connections work because that not only lets you find peers faster, it gives you a much wider selection of peers to select from because you will inevitably have more incoming requests than your node could possibly handle. |
Also in favour of the change. I'm just thinking about attacks mentioned and also discovery efficency. Just to clarify, anyone can still spin up a bunch of nodes and subscribe to a single subnet and try and censor messages/cause havoc on a subnet. The change here (as I understand) is that we would actively search for specific peer-ids rather than ENRs with subnet fields. The attack as it stands today, would consist of just spinning up a bunch of nodes with enr's saying you are subscribed to a subnet and discovery would try and find those nodes. I think the change suggested is here is strictly safer in that now you can't just modify an ENR field, you have to pre-generate a bunch of node-ids. I think the next interesting thing is about whether you can eclipse a target node via discovery with your malicious nodes (that you've pregenerated). We usually search for random node ids, which makes eclipsing a individual node hard provided there are sufficient good nodes in the DHT. Another thought I had which would make searching easier/more efficient but could potentially increase the risk of these eclipses. Currently, we select a random node-id and troll through the closest nodes to this random id until we find enough that have an ENR that match a subnet. Instead, we could potentially do the following: Instead of having it based on |
One advantage of the current strategy is that while subnets are more concentrated, an attestation that is propagated there has a very high likelihood of making it into all the aggregates eventually broadcasted. This is simply for the fact that only nodes with active validators are going to be subscribed to a subnet. For attestations to be propagated to all nodes in the particular subnet requires a smaller number of hops. Shifting this to now be dependant on a node's id would also bring in both staking and non-staking nodes to the current subnet. If we expect non-staking nodes to eventually be 10 times the amount of staking nodes this is a non-trivial increase. An increase in the number of hops would have an effect on whether a particular individual attestation can make it into the aggregate in time. This would definitely require a bit more involved testing/simulations to see how this would affect the overall participation of the network. Things we would have to look out for: Also on another note to add, the |
One thing to note is that this will degrade on small-node-count networks (e.g. testnets and devnets). To counteract this -- I propose adding a configuration value (only updated at network upgrades) called So concretely:
|
fortunately gossip hops generally scale with log(N) where N is size of the network. In expectation, this likely won't affect delivery times in a well-structured/distributed network
Agreed |
Why not adjust |
Is this such a terrible thing? Larger staking nodes are often highly specced, and will commonly have aggregating validators, so are both required to and capable of generating aggregate attestations in a timely and (rest of the network willing) relatively complete fashion. Given the makeup of aggregates in the average block at current, anything that would potentially reduce the effectiveness of aggregators would seem to be a retrograde step with potentially significant impact. If nodes could still join more subnets, as per the current This would potentially weaken some of the ideas above, such as being able to generate the set of subnets a given node subscribes to given its node ID, but I think would be overall net positive. |
I think this is a client implementation detail. Clients should be fine to manage their peers for subnets they need ahead of time and if there's enough peers on the subnets should manage this gracefully. I would consider As you've pointed out, there are potential downsides to changing the network structure, but its hard to estimate what these actually are in practice without large-scale tests. I think this change is valuable enough that we test how it performs on a devnet, then testnets etc. Just to make it explicit, there is quite a large burden on large validating nodes. Being subscribed to every subnet all of the time defeats the purpose of having subnets at all. They have to validate and propagate everything. It could be beneficial having a smaller number of nodes on each subnet, but having each node less restricted by CPU/bandwidth. In Lighthouse we'll keep the |
Agreed, Teku will also keep the Also given that validators are spread over 32 slots and 64 subnets, if you're running 2048 validators or more then you're probably subscribed to every subnet to support your aggregation duties anyway and that's probably true with even fewer validators since you'd likely subscribe a few slots before you need to aggregate. For small numbers of validators this likely has a bigger impact and reduces the load. Currently for each of the first 64 or so validators you wind up subscribed to an additional subnet, plus a potentially temporary subscription to aggregate. Whereas with this change that burden is reduced significantly. Potentially that makes it more feasible for people to stake from home with a few validators rather than having to move to the cloud for bandwidth. |
How so? There's a cap on the number of attestations / epoch in the system, and we've reached peak-attestation on mainnet already. |
Every validator attests every epoch so if you're subscribed to all individual subnets the more validators there are the more individual attestation gossip messages you have to handle. There is no peak-attestation, only peak-aggregate. |
Right, I got the two confused it seems. |
This idea can be extended to sync committee subnets too. But with some mod such that only 1 every N validators have to subscribe. |
@djrtwo not actually sure I understand how DHT could be utilized here (with just discV5 in mind)? |
I'd recommend checking existing nodes but going to the dht (discv5) to find appropriate peers would be the backup when existing/known peers dont meet the requirement. That said, itd be good to pre-populate known peers with a diverse set of attnets I might be confused by your comment @Ant, can you elaborate? |
I think this is something worth considering and is a way to use the discv5 DHT (or rather the kademlia XOR metric) to find nodes on subnets faster. i.e minimize the search overhead. |
Yeah, that makes sense to me. Is there any reason peer-id is beneficial over node-id? |
I'd like to test and model this. Anyone have the bandwidth to do either? |
I can write up a simulation at some point, verifying the idea suggested above saves us discovery hops when traversing the DHT comparted to randomly searching. I'll post results here. I imagine we probably need to adjust our peer management a bit. If each node we connect to is only subscribed to one long-lived subnet, we might want to increase our default peer count to handle all 64 subnets and not have too much peer churn. We probably need to agree on a function perhaps something like: def compute_subnet_from_node_id(node_id, epoch):
return ((id >> 250) + epoch // SUBNET_DURATION_IN_EPOCHS) % ATTESTATION_SUBNET_COUNT Where the node_id is 32 byte Anyone, I"ll start a branch on Lighthouse, so make lighthouse ready if someone has bandwidth to start a testnet. |
Was doing some experiments with discv5 recently (live not simulation), and DHT searching is indeed much more effective than a random. |
Just run a number of experiments with DHT searching for nodes with 8-bit prefix:
|
However strictly binding a subnet to a node prefix looks more vulnerable to sybil attack. I.e. an adversary may allocate a number of nodes for a specific prefix and would always have a majority of nodes for some subnet across all persistence periods. The hybrid approach could be using a larger |
Thanks for the work @Nashatyrev. In my original comment, i was out by 2 bits. We only need to prefix the first 6 bits for 64 subnets. In regards to the sybil attack. Currently, an attacker can create a bunch of nodes and set their ENR to be a specific subnet. We could slowly search through the DHT and find their nodes amongst others for a given subnet. In this new design, an attacker can create a bunch of NodeIds that match a specific subnet, and discovery now finds them faster (but should also find honest nodes in the process). If we just have a I was imagining that when we do the search for peers, we fix the first 6 bits of the target node id but randomize the remaining 200 bits. The search then should behave normally in that the 250 bit random target makes us somewhat sybil resistant because we still find nodes closest to the random target (which is not known by an attacker). Essentially we've split the DHT into 64 sub-dht's to search for peers on subnets. If we add entropy (I assume its local) do we then forgo this property?:
The fast DHT lookup works by finding the first matching bits. In your hybrid approach I assume we're adding some extra random bits in there which change every persistence period or so right? So Sybils could still work just within the persistence period? |
I suppose the new design with just even a simple The slightly better version could be The original proposal with mapping
Isn't it equivalent to
Yep. We could use the original mapping (
UPD: just realized that an attacker may craft nodes with the same 12-bit prefix and always fall into a single subnet, but that could potentially be mitigated by forcing clients to randomly select nodes from all 64 prefixes corresponding to the target subnet |
Yes but what I'm saying is, we want to make it easy for nodes to advertise what subnet they are on and make them easy to find. But we dont want an attacker to make a bunch of nodes such that a peer always connects to the same bunch of attacker nodes. So assuming on any given subnet, there are some malicious nodes but some honest nodes, if we randomize the 250 bits of the target, we are likely to get a mix of honest/malicious nodes. i.e if we didn't randomize the target, an attacker just creates a bunch of nodes closest to the fixed target and all nodes always connect to all malicious nodes. This is how it currently works when we find peers now. We search a random target and try and find eth2 nodes. A malicious user can spin up many eth2 nodes and try to eclipse us, but the random target makes this difficult. I'm proposing we leave this logic as is, and by prefixing the node-id we sub-divide the dht into peers that should be subscribed to invididual subnets. If we need to find a peer on any given subnet, its easy to do. If an attacker wants to try and eclipse a subnet, they can can create lots of nodes on a prefixed id, it makes us more vulnerable only in that we've 1/64th the size of the DHT and there may be less honest nodes with that prefix, otherwise the sybil resistance of discovery remains intact. The epoch shuffling will make the malicious nodes change subnets every persistence period, the difference with throwing in a PRF is just that its not predictable right? I imagine its not hard to generate a bunch of keys for each subnet and switch between them every persistence period. I think my preference is to keep it relatively simple and just prefix the first 6 bytes. I could be misunderstanding the benefit of adding the PRF apart from making the mapping unknown in advance of a persistance period. I assume all nodes need to have the same PRF, so a local node given its peer-id knows which subnet it is supposed to subscribe to |
To move this along I propose the following function for computing subnets: def compute_subnet(node_id, epoch):
return [((node_id >> 256 - ceil(log2(ATTESTATION_SUBNET_COUNT))) + epoch // SUBNET_DURATION_IN_EPOCHS + x) % ATTESTATION_SUBNET_COUNT for x in range(SUBNETS_PER_NODE)] I had a look at a number of functions that attempt to stagger or slow the transition and also ones that mix in randao. In the end, I'm proposing the simplest possible function for now, that we may iterate on in the future if we want additional features. Some reasoning:
|
@AgeManning thanks for starting the work on the function! Below I'm trying to modify it to add:
Using existing permutation function
Staggering needs more thinking. I didn't come up with any simple solution yet ATTESTATION_SUBNET_EXTRA_BITS = 6
ATTESTATION_SUBNET_PREFIX_BITS = ceil(log2(ATTESTATION_SUBNET_COUNT)) + ATTESTATION_SUBNET_EXTRA_BITS
def compute_subnet(node_id, epoch, index):
node_id_prefix = node_id >> (256 - ATTESTATION_SUBNET_PREFIX_BITS)
permutation_seed = hash(uint_to_bytes(epoch // SUBNET_DURATION_IN_EPOCHS))
permutated_prefix = compute_shuffled_index(node_id_prefix, 1 << ATTESTATION_SUBNET_PREFIX_BITS, permutation_seed)
return (permutated_prefix + index) % ATTESTATION_SUBNET_COUNT
def compute_subnets(node_id, epoch):
return [compute_subnet(node_id, epoch, idx) for idx in range(SUBNETS_PER_NODE)] |
This looks like an improvement to me. I'll implement this in Lighthouse so that we can do some initial testing to see how this looks. As we discussed, for simplicity in the early testing, i'll set |
Attnets revamp
Since the launch of the beacon chain, the "backbone" of attestation subnets (attnets) relies upon staking nodes connecting to a random set of subnets. The size of this random set is dictated by the quantity of validators attached to the node, up to the maximum of 64 (
ATTESTATION_SUBNET_COUNT
which maps to the expectedSHARD_COUNT
). The general idea at genesis was that a node's validator requirments will scale linearly when sharding is released thus we can put this linear subnet requirement for subnet backbones as an "honesty" assumption until sharding comes around.An attestation subnet backbone is required so that at each epoch, validators can quickly find and publish to their assigned subnet. If there was no notion of persistence in these subnets, then there would be no subnet to "find" in he ~6 minute window and those no clear place to publish individual attestations before they are aggregated.
Backbone requirements:
Problems
There are a few issues with the current structure:
Proposed solution
In an effort to solve the above issues, we propose:
node_id, epoch
Rather than putting the backbone requirement on a brittle validator-honesty assumption, this puts the backbone requirement on the entire network of full nodes such that, on average one out of every
ATTESTATION_SUBNET_COUNT
nodes will be of a particular subnet.This means that the size of subnets becomes a function of the total number of nodes on the network, rather than on staking node count combined with validator-node density. This means that we simultaneously reduce the over population of attnets by a small number of staking nodes and ensure that even if the Ethereum network (and DHT) grows by orders of magnitude, attnets will be able to be found within a few hops.
Additionally, due to the requisite subnet subscription being a function of a node's peer-id, honest/dishonesty wrt attnet backbone can be deterministically assessed allowing for peer downscoring and disconnects of dishonest peers.
The downside to this approach is that it puts a minimum of one attnet of load on every node rather than just on staking nodes, but in estimation, this is not a very high burden wrt home node resources and provides much more benefit in meeting the backbone requirements than negative.
Concrete spec mods
Remove random subscriptions
RANDOM_SUBNETS_PER_VALIDATOR
from validator guideattnets
from ENR and MetaData in p2p specAdd node-id subscription
compute_subnet_from_node_id(node_id) -> uint64
that takes in a node_id and returns a value on[0, ATTESTATION_SUBNET_COUNT)
. Consider anepoch
param that causes these subscriptions to slowly rotate on the order of ~1 daycompute_subnet_from_node_id
attnets
dht search withcompute_subnet_from_node_id
dht searchStrategy
We'd likely want to simulate and test this change in strategy in a controlled environment before pushing this to testnets and then mainnet.
Such a controlled environmnt to test gossipsub at scale seems critical to a number of the network optimization investigations underway.
Protocol Lab's testground could be a good candidate. Alternatively, another simulation framework or even spinning up 1k+ distributed networks for ~1 day tests could also be a viable path.
EDITED TO USE
node_id
instead ofpeer_id
per discussions belowThe text was updated successfully, but these errors were encountered: