-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Initial version of the polite-grandpa networking protocol #2110
Conversation
…into rh-polite-grandpa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered replacing the sharing of the validator state between grandpa and network with messaging, which could fit into how the other capabilities of the network service are made available(for example messages_for
)?
/// Bridge between the underlying network service, gossiping consensus messages and Grandpa | ||
pub(crate) struct NetworkBridge<B: BlockT, N: Network<B>> { | ||
service: N, | ||
validator: Arc<GossipValidator<B>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, one could avoid the sharing of the validator itself with grandpa via the network bridge, instead all the operations below on self.validator
could be forwarded to the validator held by the network via a message sent by the service.
For example global_communication
consists of:
1.self.validator.note_set
2. let incoming = self.service.messages_for
3. doing something with incoming.
Here messages_for
is implemented via messaging to the network thread, so one could also do validator.note_set
via a message on the same channel, and the ordering would be guaranteed.
One could for example do something similar to what is done with register_validator
like:
self.with_gossip(
move |gossip, context| {
let validator = gossip.get_validator(GRANDPA_ENGINE_ID);
validator.note_set(set_id, &context); // passing context to do the multicast
}
)
Main benefit would be being able to get rid of the inner: parking_lot::RwLock<Inner<Block>>
, instead just putting the various member currently on Inner
directly on GossipValidator
, and only mutating those in responses to messages received by the network/protocol thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main benefit would be being able to get rid of the inner:
parking_lot::RwLock<Inner<Block>>
I'm not convinced that is a benefit, if the alternative is more costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@demimarie-parity the communication to network(things like with_gossip
) is implemented with a crossbeam-channel.
I wonder if benchmarks are required to make this call, and I think we can reason about the problem logically and come to the conclusion that we are unnecessarily creating contention on shared resources(which don't have to be shared).
An example:
When grandpa does a round_communication
, it will start with self.validator.note_round
. note_round
will lock inner
, do some setup in the current thread(or rather grandpa task) and then result in sending a message to the network thread(via with_gossip
, which is what is used as part of multicast_neighbor_packet
).
Later, still as part of round_communication
, grandpa will do a messages_for
, which again involves sending a message to the network thread.
Logically, grandpa is communicating with network, and with the validator that lives in network. it just so happens that one part of that communication is implemented via shared state. Sharing the state of the validator with grandpa does not appear inherent to the logical operations that are performed. In fact, the state remains hidden to grandpa, which just calls methods on the network bridge. What we're doing now is essentially "outsourcing" state transitions to occur on the grandpa thread/task that logically would appear to belong in the network thread.
Another example: note_set
will attempt to acquire a write lock on inner
, and if network is doing a propagate
at the same time, holding a read lock while propagating a number of gossip messages, you'll risk a contention(and if it's grandpa that ends up having to wait for the lock being available, you run into the issue that this can block other tasks running on the same thread).
Any propagate
that would happen concurrently to a note_set
should not be influenced by it, so what we have now is unnecessary contention. With message-passing, the note_set
will result in a message sent to network/gossip, which will handle it sometimes later, without influencing an ongoing propagate
(although it will have effect on a propagate
happening after that message has been handled, which gives us in my opinion an easy to reason about transition of state).
As a counter-example, I can imagine a piece of data being key to logical operations on both side. In that case if the data was not shared, the component without access to it would probably have to send messages containing "acknowledgement channels" and then wait on the acknowledgement or response. That would point to the need for sharing the data and locking it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks make it a lot easier to avoid races -- I have a feeling we already have some subtle races in the networking code that's been ported to use channels (for instance, the test network doesn't always sync anymore...) and I'm not too eager to introduce more. For code that's not in a bottlenecking path, that kind of rewrite adds much more code complexity (hours of dev time in design, review, debugging) for a pretty nebulous gain.
Unless there's an argument other than "locks are bad, avoid at all costs", then I don't see any reason to make that change.
The general policy in this codebase is to not block PRs that are strictly better than what's currently implemented on improvements that could be made to them. Those improvements should be filed as separate issues, and this one needs a substantial argument as to why it improves performance in a meaningful way that justifies the dev time spent on it. There are probably 20 other optimizations we've already filed that give 10-100x bigger performance wins.
Parking lot RwLocks are only a single atomic operation in the uncontended case, and I don't see why this lock would be be contended unless under a sustained, multi-peer DoS. crossbeam channels are fast, but even in the "minimal" case they still use more atomic operations (which cause cache flushes and other performance implications) than uncontended locks. The additional round-trips, cloning to deal with Send + 'static
, etc. probably adds more of a cost in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks make it a lot easier to avoid races
The re-design I describe above would mean that sate would not be shared among threads anymore, which can play a role in avoided races as well. Weird ordering of messages is in my opinion much easier to debug, one can just add a println
at the point of message-handling. In my experience it can be harder trying to figure out when something in an Arc
is unexpectedly mutated.
we already have some subtle races in the networking code that's been ported to use channels
I don't think we should rely on blocks across threads in the implementation of business logic. If removing blocks exposes races, we can fix those races and still end-up with async code, it's an opportunity to improve the situation.
Parking lot RwLocks are only a single atomic operation in the uncontended case, and I don't see why this lock would be be contended unless under a sustained, multi-peer DoS.
Ok, then let me provide an example of something I don't think requires Dos for it to result in contention: in propagate
a read-lock is acquired and held for the duration of the iteration over messages to be propagated. Re the fast un-contented case(or the efficient spinning a few rounds and trying to acquire again before parking) for parking_lot
, I think you're more likely to hit it when you do things like quick "one operation" acquire/release of the lock, as opposed to making that entire loop a critical section.
There are probably 20 other optimizations we've already filed that give 10-100x bigger performance wins.
None of my arguments above relate strongly to performance, rather it's about the design of the system.
The general policy in this codebase is to not block PRs that are strictly better than what's currently implemented on improvements that could be made to them.
I don't intend to "block" this PR, I would just like to point out an alternative design. We can discuss further on another occasion.
…into rh-polite-grandpa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
||
// not with lock held! | ||
for topic in broadcast_topics { | ||
// TODO: only to this peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use send_topic
now.
(added looksgood because it has been reviewed well -- it should have been |
* Consensus status packet * Allow for repropagation after status * More generic gossip * add a basic view struct and gossip module * move gossip stuff to the gossip module * integrate view into gossip * some reshuffling * alter rules for keeping one commit at a time in view * Allow sending addressed messages * don't cast outgoing votes if we know that we voted before * Handle one hop messages * initial run at polite grandpa * build WASM * handle neighbor messages * refactor validator's internals into an Inner struct * gossip only knows to keep or discard messages. optimize should_send_to * Periodic rebroadcast * implement `should_send_to` and message_expired * track peers' best received commit height * Pass peer id to topic steam * kill rebroadcasting network * Notify about existing peers * clean up network APIs a bunch * implement gossip::send_message for direct messages * refactor network trait * implement gossip::send_message for direct messages * get all non set-change tests passing * treat unknown rebroadcasts as broadcasts * get all other main tests passing * remove unimplemented test * everything compiles * treat unknown rebroadcasts as broadcasts * Rebradcast interval * Apply suggestions from code review Style Co-Authored-By: arkpar <arkady.paronyan@gmail.com> * Style * some module docs * address some grumbles + docs * allow rebroadcast every few minutes * send_topic && generic context * some tests for view change * more grumbles & tests * use send_peer
…#2110) * Consensus status packet * Allow for repropagation after status * More generic gossip * add a basic view struct and gossip module * move gossip stuff to the gossip module * integrate view into gossip * some reshuffling * alter rules for keeping one commit at a time in view * Allow sending addressed messages * don't cast outgoing votes if we know that we voted before * Handle one hop messages * initial run at polite grandpa * build WASM * handle neighbor messages * refactor validator's internals into an Inner struct * gossip only knows to keep or discard messages. optimize should_send_to * Periodic rebroadcast * implement `should_send_to` and message_expired * track peers' best received commit height * Pass peer id to topic steam * kill rebroadcasting network * Notify about existing peers * clean up network APIs a bunch * implement gossip::send_message for direct messages * refactor network trait * implement gossip::send_message for direct messages * get all non set-change tests passing * treat unknown rebroadcasts as broadcasts * get all other main tests passing * remove unimplemented test * everything compiles * treat unknown rebroadcasts as broadcasts * Rebradcast interval * Apply suggestions from code review Style Co-Authored-By: arkpar <arkady.paronyan@gmail.com> * Style * some module docs * address some grumbles + docs * allow rebroadcast every few minutes * send_topic && generic context * some tests for view change * more grumbles & tests * use send_peer
Based on #1988
Closes #366
Adds a
GossipValidator
for GRANDPA messages to implement a gossip protocol where peers only send each other messages that are needed to reach consensus. Docs on how it works are incommunication/mod.rs
andcommunication/gossip.rs
TODO: