Distributed Query/Clustering Fixes #2353

jwilder · 2015-04-20T21:26:39Z

This PR fixes cluster joins and distributed queries.

New data nodes would never join an existing cluster properly. They would start up but would not attempt to join the cluster so they all reported themselves as Server ID 1.

The distributed queries issue was that previously sent data was not being cleared out on each iteration of the mapper output processing. Subsequent iterations would send out duplicate data causing incorrect results.

Fixes #2348 #2343 #2334 #2272

otoolep · 2015-04-20T23:15:01Z

cmd/influxd/server_integration_test.go

 	t.Parallel()
-	testName := "3-node server integration partial replication"
+	testName := "6-node server integration partial replication"


6 is not really a valid number for a cluster. Best practices means you should use odd numbers. Why 6?

Why change it at all? Would 5 or 7 be better?

What we are trying to do is get the server to create more than one shard per shard group, to test the full range of possibilities across writing/reading series data.

From what we could tell, the number of shards that a shard group creates is calculated on the # of data nodes / replication factor. So, we needed something that would do that. We probably should make this a 5 node cluster with a replication factor of 2.

Yes, a node size of 5 and a replication factor of 2 would be better. That should result in shard groups of 2 shards in size, which will then result in 4 actual shards on the cluster.

otoolep · 2015-04-20T23:33:24Z

Some of these changes look really important. I'm not ready to +1 it yet however, as I have some questions.

otoolep · 2015-04-20T23:35:38Z

cmd/influxd/run.go

@@ -642,19 +642,8 @@ func (cmd *RunCommand) openServer(joinURLs []url.URL) *influxdb.Server {
 	// Give brokers time to elect a leader if entire cluster is being restarted.
 	time.Sleep(1 * time.Second)

-	if s.ID() == 0 && s.Index() == 0 {


So why wasn't this condition correct? Why was also requiring Index() to be 0 wrong?

From what I understand, after you open your own meta store, you will read your ID from that, if you're ID isn't 0, your index certainly shouldn't be either, as you have been part of a cluster from before.

If I understand this issue correctly, the bug was that sometimes s.ID() was 0 but s.Index() was not, hence the bug?

Yes. Index was non-zero.

toddboom · 2015-04-20T23:51:11Z

Reviewing this now.

otoolep · 2015-04-21T18:42:43Z

cmd/influxd/server_integration_test.go

 	t.Parallel()
-	testName := "3-node server integration partial replication"
+	testName := "6-node server integration partial replication"


testName not quite right 6-> 5

otoolep · 2015-04-21T18:49:53Z

Changes look good, I have some minor comments you might wish to address, but I don't need to re-review. As long as I understand why just using s.ID() is correct, great. @benbjohnson also took a quick look at this.

+1, nice work.

otoolep · 2015-04-21T18:50:21Z

Don't forget the changelog.

Fix #2353 #2348 #2343 #2334 #2272

New data nodes would never actually join the cluster. They would pose as server ID 1 in a cluster.

Drop database did not close any open shard files or close any topic reader/heartbeats. In the tests, we create and drop new databases during each test run so these were open files and connection slowed things down and consumed a lot of RAM as the tests progressed.

Fix #2353 #2348 #2343 #2334 #2272

Distributed Query/Clustering Fixes

otoolep · 2015-04-21T21:23:44Z

Yeah, here's hoping! It is probable.

…error Change typo of procstats to procstat and make exe required

jwilder added the 2 - Working label Apr 20, 2015

influxdb-denver-pair force-pushed the 2272-fix branch from 2e5e708 to 2a03ac5 Compare April 20, 2015 21:29

otoolep reviewed Apr 20, 2015
View reviewed changes

jwilder force-pushed the 2272-fix branch from 907651c to d717854 Compare April 21, 2015 18:00

otoolep reviewed Apr 21, 2015
View reviewed changes

jwilder force-pushed the 2272-fix branch from d717854 to 053387d Compare April 21, 2015 19:06

jwilder added a commit that referenced this pull request Apr 21, 2015

Update changelog

053387d

Fix #2353 #2348 #2343 #2334 #2272

jwilder added 7 commits April 21, 2015 13:39

Pass series IDs for shards in DQ instead of all series IDs

9268ddf

Fix cluster join

f5a8227

New data nodes would never actually join the cluster. They would pose as server ID 1 in a cluster.

Fix processRawQuery from returning duplicate data

90e3059

Fix comments

406a951

Change 6 node test to 5 node

25a43a8

Update changelog

2975a9d

Fix #2353 #2348 #2343 #2334 #2272

jwilder force-pushed the 2272-fix branch from 053387d to 2975a9d Compare April 21, 2015 19:42

jwilder added a commit that referenced this pull request Apr 21, 2015

Merge pull request #2353 from influxdb/2272-fix

fbf9cdb

Distributed Query/Clustering Fixes

jwilder merged commit fbf9cdb into master Apr 21, 2015

jwilder deleted the 2272-fix branch April 21, 2015 19:45

jwilder removed the 2 - Working label Apr 21, 2015

This was referenced Apr 21, 2015

Test Partial replication is very problematic #2334

Closed

Node falls behind Metastore updates #2343

Closed

jwilder mentioned this pull request Apr 23, 2015

Wire up queries that must hit multiple shards in a shard group #2188

Closed

mark-rushakoff pushed a commit that referenced this pull request Jan 9, 2019

Merge pull request #2353 from influxdata/fix/procstat-invalid-plugin-…

a5831e7

…error Change typo of procstats to procstat and make exe required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Query/Clustering Fixes #2353

Distributed Query/Clustering Fixes #2353

jwilder commented Apr 20, 2015

otoolep Apr 20, 2015

otoolep Apr 20, 2015

corylanou Apr 21, 2015

otoolep Apr 21, 2015

otoolep commented Apr 20, 2015

otoolep Apr 20, 2015

corylanou Apr 21, 2015

otoolep Apr 21, 2015

jwilder Apr 21, 2015

toddboom commented Apr 20, 2015

otoolep Apr 21, 2015

otoolep commented Apr 21, 2015

otoolep commented Apr 21, 2015

otoolep commented Apr 21, 2015

Distributed Query/Clustering Fixes #2353

Distributed Query/Clustering Fixes #2353

Conversation

jwilder commented Apr 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

otoolep commented Apr 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddboom commented Apr 20, 2015

Choose a reason for hiding this comment

otoolep commented Apr 21, 2015

otoolep commented Apr 21, 2015

otoolep commented Apr 21, 2015