Heal raft cluster #4685

corylanou · 2015-11-05T22:09:42Z

This PR enables the ability for a node to be promoted to a raft node when needed to fully heal the raft cluster.

Add tests
Add Changelog

corylanou · 2015-11-06T00:33:24Z

meta/rpc.go

+	}
+
+	if sz == 0 {
+		return internal.RPCType_Error, nil, fmt.Errorf("invalid message size: %d", sz)


@jwilder this originally returned 0 which didn't seem valid, so I updated it to internal.RPCType_Error

jwilder · 2015-11-06T16:09:11Z

Needs a changelog and just some minor naming comments, but 👍.

otoolep · 2015-11-10T00:12:58Z

meta/rpc.go

@@ -63,6 +65,8 @@ func (r *rpc) proxyLeader(conn *net.TCPConn) {
 	defer leaderConn.Close()

 	leaderConn.Write([]byte{MuxRPCHeader})
+	// re-write the original message to the leader
+	leaderConn.Write(buf)


What's going on here? Was the red blob removed below moved into this function?

We used to always forward everything to the leader. No, depending on the message, it may not be intended for the leader. In the case of promoting a new raft node, we explicitly don't want to talk to the leader, but talk to the node that needs to be promoted to the raft cluster. The read was refactored into two other methods to make it easier to interrogate the message coming in. Then, if we did want to talk to the leader, we have to re-write the original message, which is what we are doing above.

otoolep · 2015-11-10T00:19:21Z

This is a pretty significant feature, that some people may not want. I could envision some unstable clusters where nodes come and go, and auto-promotion makes it harder to understand what is happening.

So can I suggest a config option to allow this to be disabled? In the meta section?

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L27

raft-promotion-disable=false

for example.

corylanou · 2015-11-10T00:41:57Z

@otoolep I agree about the config setting to disable this feature. I'll add that.

I'll do it as raft-promotion-enabled and make it on by default.

corylanou · 2015-11-10T15:23:18Z

meta/store.go

+	if err := s.rpc.enableRaft(n.Host, peers); err != nil {
+		return fmt.Errorf("error notifying raft peer: %s", err)
+	}
+	s.Logger.Printf("promoted nodeID %d, host %s to raft peer", n.ID, n.Host)


@otoolep added log entry for successful promotion of node to raft peer.

otoolep · 2015-11-11T00:43:08Z

CHANGELOG.md

@@ -35,6 +35,7 @@
 - [#4721](https://github.com/influxdb/influxdb/pull/4721): Export tsdb.InterfaceValues
 - [#4681](https://github.com/influxdb/influxdb/pull/4681): Increase default buffer size for collectd and graphite listeners
 - [#4659](https://github.com/influxdb/influxdb/pull/4659): Support IF EXISTS for DROP DATABASE
+- [#4685](https://github.com/influxdb/influxdb/pull/4685): Heal Raft Cluster


FWIW, I think this could be a bit more descriptive. E.g. "Cluster auto-promotes data node to Raft nodes on loss of Raft node". Just a suggestion.

This only promotes a raft node when it sees a gap in the peers. It can only see a gap in the peers if someone issues a DROP SERVER.

They type of healing you are referring too is beyond the scope of this PR, and I'm not sure we would actually want to take those steps or not.

otoolep · 2015-11-11T00:49:18Z

@corylanou -- what will happen in the following scenario?

3 Raft nodes, 1 data node.
A network partition causes 1 of the Raft nodes to be disconnected from the cluster.
The data node is promoted.
The partition heals.
The disconnected Raft node can now contact the cluster again.

What is the status of the Raft node that comes back? What if it has data on it, that was solely replicated on that node? Will that data be queryable?

corylanou · 2015-11-11T02:47:39Z

If a partition happens, we don't change any of the peers. You have to manually issue a drop server before another node would promote itself.

otoolep · 2015-11-11T02:52:12Z

Ah, OK, this only happens in response to DROP NODE? OK, makes sense.

otoolep · 2015-11-11T16:55:40Z

+1

Heal raft cluster

corylanou added the 2 - Working label Nov 5, 2015

corylanou reviewed Nov 6, 2015
View reviewed changes

corylanou force-pushed the heal-raft-cluster branch from 2e2766c to cc74eed Compare November 6, 2015 18:14

otoolep reviewed Nov 10, 2015
View reviewed changes

corylanou force-pushed the heal-raft-cluster branch from cc74eed to 9f8e014 Compare November 10, 2015 14:39

corylanou reviewed Nov 10, 2015
View reviewed changes

corylanou added review-pair review and removed 2 - Working labels Nov 10, 2015

corylanou force-pushed the heal-raft-cluster branch from 115e368 to da32c82 Compare November 10, 2015 21:30

otoolep reviewed Nov 11, 2015
View reviewed changes

corylanou added 6 commits November 11, 2015 10:04

Automatically promote node to raft if needed

70e1a83

make raft self healing

8a8564e

update changelog

7fb0f90

tweaks based on pr feedback

c91c6c9

tweaks based on pr review. added config for raft self healing

3912c71

test auto replace raft node

b2ed141

better changelog description

4187fbb

corylanou force-pushed the heal-raft-cluster branch from da32c82 to 4187fbb Compare November 11, 2015 16:05

corylanou added 2 commits November 11, 2015 10:58

update sample config (was missing cluster-tracing as well).

615024a

clarify config comment

a4c54cb

corylanou added a commit that referenced this pull request Nov 11, 2015

Merge pull request #4685 from influxdb/heal-raft-cluster

8ec4d04

Heal raft cluster

corylanou merged commit 8ec4d04 into master Nov 11, 2015

corylanou removed the review label Nov 11, 2015

corylanou deleted the heal-raft-cluster branch November 11, 2015 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heal raft cluster #4685

Heal raft cluster #4685

corylanou commented Nov 5, 2015

corylanou Nov 6, 2015

jwilder commented Nov 6, 2015

otoolep Nov 10, 2015

corylanou Nov 10, 2015

otoolep commented Nov 10, 2015

corylanou commented Nov 10, 2015

corylanou Nov 10, 2015

otoolep Nov 11, 2015

corylanou Nov 11, 2015

otoolep commented Nov 11, 2015

corylanou commented Nov 11, 2015

otoolep commented Nov 11, 2015

otoolep commented Nov 11, 2015

Heal raft cluster #4685

Heal raft cluster #4685

Conversation

corylanou commented Nov 5, 2015

corylanou Nov 6, 2015

Choose a reason for hiding this comment

jwilder commented Nov 6, 2015

otoolep Nov 10, 2015

Choose a reason for hiding this comment

corylanou Nov 10, 2015

Choose a reason for hiding this comment

otoolep commented Nov 10, 2015

corylanou commented Nov 10, 2015

corylanou Nov 10, 2015

Choose a reason for hiding this comment

otoolep Nov 11, 2015

Choose a reason for hiding this comment

corylanou Nov 11, 2015

Choose a reason for hiding this comment

otoolep commented Nov 11, 2015

corylanou commented Nov 11, 2015

otoolep commented Nov 11, 2015

otoolep commented Nov 11, 2015