-
Notifications
You must be signed in to change notification settings - Fork 146
Merge partitioned clusters #42
Comments
I'd posit that such logic is very similar to airlock/prober style logic. On Tue, Mar 3, 2015 at 10:33 PM, Jeff Wolski notifications@github.com
|
i dont see how this is related to airlock behavior. this problem occurs at the gossip/membership level. if the network partitions, members of the original cluster will eventually converge on either side of the partition. once the partition heals, nothing within ringpop will attempt to merge both halves. serf has a mechanism by which nodes attempt to rejoin faulty members. we'll need similar behavior. excuse the brevity of original description. Sent from my iPhone
|
What I meant was:
|
this makes more sense now. yes, the approach may very well be similar. ringpop bases whom it chooses to ping as part of its protocol period based on status alone, but can easily take into account status + last health probe and reverse a faulty member to an alive one as the result of a valid response. Sent from my iPhone
|
One potential issue with using the primary protocol for this is that hosts that are down might always run up against the timeout. So if you lose 10 out of 100 nodes, you might end up waiting for 10 ping timeouts before advancing. I think it would be better to have a second protocol with a slower period to attempt to revive any previously up nodes. |
@mranney Yep, good point. From the sounds of Hashicorp's presentation, they also maintain a separate faulty member loop. I thought it'd be a nice/clean to fit into the membership iterator used during normal protocol period operation, but delaying the pings because of faulty/slow members would suck. |
Ringpop should be able to handle network partitions that temporarily cause two or more clusters to form. A ringpop instance never attempts to rejoin faulty members, but could since we maintain faulty members in the membership list.
The text was updated successfully, but these errors were encountered: