clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

agreentree · 2015-04-13T20:58:57Z

influxdb.log: panic: distributed queries not implemented yet and there are too many shards in this group
the command I use to get the panic is: monasca measurement-list cpu.idle_perc 1970 --merge_metrics

agreentree · 2015-04-13T22:13:21Z

I understand if this is a feature that hasn't been implemented yet, but the same commands worked with RC19, so we have written many regression tests with merge_metrics in the GET that now fail.

otoolep · 2015-04-13T22:17:37Z

OK, this may be a regression since we did change the query engine.

Can you supply a sequence of curl commands (writing data, then reading it) which brings out this issue?

https://github.com/influxdb/influxdb/blob/master/CONTRIBUTING.md#bug-reports

agreentree · 2015-04-13T22:43:00Z

this is the curl command I use to reproduce it:
curl -i -X GET -H 'X-Auth-User: mini-mon' -H 'X-Auth-Token: 46968286c92a423da87b2eae570c422a' -H 'X-Auth-Key: password' -H 'Accept: application/json' -H 'User-Agent: python-monascaclient' -H 'Content-Type: application/json' --cacert /usr/local/share/ca-certificates/monasca_test_ca.crt https://mon-ae1test-monasca01.useast.hpcloud.net:8080/v2.0/metrics/measurements?start_time=1970&merge_metrics=True&name=cpu.idle_perc

agreentree · 2015-04-13T23:25:40Z

I just reproduced it with influxdb stand-alone as follows:

installed rc23 on 3 nodes, modified the join-urls on the 2 worker nodes to point to the master
ran the following curl commands (from the influxdb 0.9 doc. page):

curl -G 'http://localhost:8086/query' --data-urlencode "db=mydb" --data-urlencode "q=SELECT value FROM cpu_load_short WHERE region='us-west'"

curl -XPOST 'http://localhost:8086/write' -d ' {
"database": "mydb",
"retentionPolicy": "default",
"points": [
{
"name": "cpu_load_short",
"tags": {
"host": "server01",
"region": "us-west"
},
"timestamp": "2009-11-10T23:00:00Z",
"fields": {
"value": 0.64
}
}
]
}
'

curl -G 'http://localhost:8086/query' --data-urlencode "db=mydb" --data-urlencode "q=SELECT value FROM cpu_load_short WHERE region='us-west'"

influxdb fails with panic message from original comment

jwilder · 2015-04-13T23:43:16Z

From: https://github.com/influxdb/influxdb/blob/master/tx.go#L144, it looks like this is triggered because the replication factor is less than the number of servers in the cluster. I believe the default replication factor is 1 for the default retention policy and you are using a 3 node cluster. You might try creating a retention policy w/ a replication factor of 3 and specifying that RP in your writes as a work around.

@otoolep is this panic still needed though?

otoolep · 2015-04-13T23:50:15Z

Yeah, that could be an issue. Let me look into it, we may not need the explicit panic any longer.

agreentree · 2015-04-14T16:12:43Z

I resolved the issue in my enviironment by adding DEFAULT to the database creation so that the correct retention policy and replication factor get set; no opinion on whether that panic should still occur if the replication factor doesn't match the clustered env, perhaps a more informative message would help.

otoolep · 2015-04-14T16:51:31Z

Here is the requested documentation.

http://influxdb.com/docs/v0.9/query_language/database_administration.html#modifying-a-retention-policy

agreentree · 2015-04-16T18:57:54Z

this works for me now

svscorp · 2015-04-17T14:00:57Z

@beckettsean @levicook

Facing the same issue on rc25. 3 server setup, all 3 are both broker and data-node.

Then I put some data

curl -XPOST 'http://influxdb:8086/write' -d '
{
    "database": "ilia",
    "retentionPolicy": "default",
    "points": [
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:00:00Z",
            "fields": {
                "value": 10.64
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:05:00Z",
            "fields": {
                "value": 20.00
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:10:00Z",
            "fields": {
                "value": 25.00
            }
        },
        {
            "name": "cpu",
            "tags": {
                "host": "server1",
                "region": "nl"
            },
            "timestamp": "2015-04-10T15:15:00Z",
            "fields": {
                "value": 35.01
            }
        }
    ]
}'

Then I run

curl -G http://influxdb:8086/query?pretty=true --data-urlencode "q=SELECT * FROM cpu" --data-urlencode "db=ilia"

And it crashed the node.

Then I found this issue. Created the policy with replicaN = 2 and it started to work. But that's weak, default retention policy has replicaN = 1 (which is not more than the number of servers in the cluster). Why it is running into the panic?

Although, In <0.9.0 there was an option to set the default replica factor and in 0.9.0 there is not.

agreentree · 2015-04-17T16:20:49Z

even though this works when the replication factor matches the number of nodes in the cluster, it sounds like there is still a question around whether the panic should occur if the replication factor is less than the number of clustered nodes.

Fixes #2272 There was previously a explict panic put in the query engine to prevent queries where the number of shards was not equal to the number of data nodes in the cluster. This was waiting for the distributed queries branch to land but was not removed when that landed. There may be a more efficient way to do fix this but this fix simply queries all the shards and merges their outputs. Previously, the code assumed that only one shard would be hit. Querying multiple shards ended up producing duplicate values during the map phase so the map output needed to be merged as opposed to appended to avoid the dups.

Fixes #2272 There was previously a explict panic put in the query engine to prevent queries where the number of shards was not equal to the number of data nodes in the cluster. This was waiting for the distributed queries branch to land but was not removed when that landed.

svscorp · 2015-04-18T08:36:15Z

@jwilder how do I configure default amount of shards?

Fixes #2272 There was previously a explict panic put in the query engine to prevent queries where the number of shards was not equal to the number of data nodes in the cluster. This was waiting for the distributed queries branch to land but was not removed when that landed.

jwilder · 2015-04-19T05:43:45Z

@svscorp You should be able to change the replication factor using the CLI

$ influx
> alter retention policy "default" on "mydb" replication 3 default

You could also create new retention policy and mark is as default

$ influx
> create retention policy "myrp" on "mydb" duration 1d replication 2 default

Fixes #2272 There was previously a explict panic put in the query engine to prevent queries where the number of shards was not equal to the number of data nodes in the cluster. This was waiting for the distributed queries branch to land but was not removed when that landed.

svscorp · 2015-04-20T06:09:09Z

@jwilder I get it, my question was about some kind of configuration, how it was done in <0.9. But that's okay, means I need to execute the ALTER query after I created a database.

But I think it would be wise to have a configuration for applying the default replication factor to all the new databases created with 'default' key. Because it's easy to lost that bit, you need to always make 1 query more to have the replication factor you need (or that you need to include it to the create db query).

What do you think?

svscorp · 2015-04-20T06:26:31Z

Also, here it is hardcoded:

https://github.com/influxdb/influxdb/blob/master/server.go#L1104

maybe it make sense to define it as constant or something (or again, the configuration)?

Fixes #2272 There was previously a explict panic put in the query engine to prevent queries where the number of shards was not equal to the number of data nodes in the cluster. This was waiting for the distributed queries branch to land but was not removed when that landed.

Fix #2353 #2348 #2343 #2334 #2272

beckettsean added this to the 0.9.0 milestone Apr 14, 2015

beckettsean added the 1 - Ready label Apr 14, 2015

agreentree closed this as completed Apr 16, 2015

levicook removed the 1 - Ready label Apr 16, 2015

agreentree reopened this Apr 17, 2015

jwilder self-assigned this Apr 17, 2015

jwilder closed this as completed Apr 17, 2015

jwilder reopened this Apr 17, 2015

jwilder added the 2 - Working label Apr 17, 2015

jwilder mentioned this issue Apr 17, 2015

Handle distributed queries when shards != data nodes #2327

Closed

jwilder mentioned this issue Apr 18, 2015

Handle distributed queries when shards != data nodes #2336

Merged

svscorp mentioned this issue Apr 20, 2015

[0.9.0-rc28] Issue with number values being inserted as strings. Select 'mean(value)' crashes the node. #2346

Closed

toddboom closed this as completed in #2336 Apr 20, 2015

toddboom removed the 2 - Working label Apr 20, 2015

jwilder mentioned this issue Apr 20, 2015

Distributed Query/Clustering Fixes #2353

Merged

jwilder added a commit that referenced this issue Apr 21, 2015

Update changelog

053387d

Fix #2353 #2348 #2343 #2334 #2272

jwilder added a commit that referenced this issue Apr 21, 2015

Update changelog

2975a9d

Fix #2353 #2348 #2343 #2334 #2272

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

agreentree commented Apr 13, 2015

agreentree commented Apr 13, 2015

otoolep commented Apr 13, 2015

agreentree commented Apr 13, 2015

agreentree commented Apr 13, 2015

jwilder commented Apr 13, 2015

otoolep commented Apr 13, 2015

agreentree commented Apr 14, 2015

otoolep commented Apr 14, 2015

agreentree commented Apr 16, 2015

svscorp commented Apr 17, 2015

agreentree commented Apr 17, 2015

svscorp commented Apr 18, 2015

jwilder commented Apr 19, 2015

svscorp commented Apr 20, 2015

svscorp commented Apr 20, 2015

clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

clustering: influxdb 0.9.0-rc23 panics when doing a GET with merge_metrics in a 3 node cluster #2272

Comments

agreentree commented Apr 13, 2015

agreentree commented Apr 13, 2015

otoolep commented Apr 13, 2015

agreentree commented Apr 13, 2015

agreentree commented Apr 13, 2015

jwilder commented Apr 13, 2015

otoolep commented Apr 13, 2015

agreentree commented Apr 14, 2015

otoolep commented Apr 14, 2015

agreentree commented Apr 16, 2015

svscorp commented Apr 17, 2015

agreentree commented Apr 17, 2015

svscorp commented Apr 18, 2015

jwilder commented Apr 19, 2015

svscorp commented Apr 20, 2015

svscorp commented Apr 20, 2015