-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Query/Clustering Fixes #2353
Conversation
2e5e708
to
2a03ac5
Compare
t.Parallel() | ||
testName := "3-node server integration partial replication" | ||
testName := "6-node server integration partial replication" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 is not really a valid number for a cluster. Best practices means you should use odd numbers. Why 6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change it at all? Would 5 or 7 be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we are trying to do is get the server to create more than one shard per shard group, to test the full range of possibilities across writing/reading series data.
From what we could tell, the number of shards that a shard group creates is calculated on the # of data nodes / replication factor. So, we needed something that would do that. We probably should make this a 5 node cluster with a replication factor of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a node size of 5 and a replication factor of 2 would be better. That should result in shard groups of 2 shards in size, which will then result in 4 actual shards on the cluster.
Some of these changes look really important. I'm not ready to +1 it yet however, as I have some questions. |
@@ -642,19 +642,8 @@ func (cmd *RunCommand) openServer(joinURLs []url.URL) *influxdb.Server { | |||
// Give brokers time to elect a leader if entire cluster is being restarted. | |||
time.Sleep(1 * time.Second) | |||
|
|||
if s.ID() == 0 && s.Index() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So why wasn't this condition correct? Why was also requiring Index()
to be 0 wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, after you open your own meta store, you will read your ID from that, if you're ID isn't 0, your index certainly shouldn't be either, as you have been part of a cluster from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this issue correctly, the bug was that sometimes s.ID()
was 0 but s.Index()
was not, hence the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Index was non-zero.
Reviewing this now. |
t.Parallel() | ||
testName := "3-node server integration partial replication" | ||
testName := "6-node server integration partial replication" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testName
not quite right 6-> 5
Changes look good, I have some minor comments you might wish to address, but I don't need to re-review. As long as I understand why just using +1, nice work. |
Don't forget the changelog. |
New data nodes would never actually join the cluster. They would pose as server ID 1 in a cluster.
Drop database did not close any open shard files or close any topic reader/heartbeats. In the tests, we create and drop new databases during each test run so these were open files and connection slowed things down and consumed a lot of RAM as the tests progressed.
Distributed Query/Clustering Fixes
Yeah, here's hoping! It is probable. |
…error Change typo of procstats to procstat and make exe required
This PR fixes cluster joins and distributed queries.
New data nodes would never join an existing cluster properly. They would start up but would not attempt to join the cluster so they all reported themselves as Server ID 1.
The distributed queries issue was that previously sent data was not being cleared out on each iteration of the mapper output processing. Subsequent iterations would send out duplicate data causing incorrect results.
Fixes #2348 #2343 #2334 #2272