Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch CQ writes to avoid timeouts #3517

Merged
merged 1 commit into from
Aug 14, 2015
Merged

Conversation

dim
Copy link
Contributor

@dim dim commented Jul 30, 2015

Currently continuous query timeouts are causing the server to reject writes, here's how you can replicate it:

  1. start influxdb server

    (master)$ go build ./... && go install ./... && $GOPATH/bin/influxd run
  2. in a second console, run the following ruby script

    require 'influxdb'
    
    N  = 1000
    DB = "cq_bug"
    
    influxdb = InfluxDB::Client.new DB, host: 'localhost', time_precision: "s"
    influxdb.create_database DB
    influxdb.create_continuous_query "test_1m", DB, "SELECT sum(visits) as visits INTO test_1m FROM test GROUP BY time(1m), host"
    
    count    = 0
    loop do
      data = []
      N.times do |i|
        data.push series: 'test', tags: { host: "web-00#{rand(16)}" }, values: { visits: 1 }
      end
    
      begin
        res = influxdb.write_points(data)
        puts "written #{count += N} points"
      rescue => e
        puts "ERROR #{e.message}"
      end
    
      sleep(0.5)
    end

After a few seconds, the server will crash after the first CQ run and never recover:

[continuous_querier] 2015/07/30 15:33:03 starting continuous query service
[metastore] 2015/07/30 15:33:03 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/30 15:33:05 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/30 15:33:05 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/30 15:33:05 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/30 15:33:05 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/30 15:33:05 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/30 15:33:05 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/30 15:33:05 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/30 15:33:05 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/30 15:33:05 created local node: id=1, host=127.0.0.1:8088
[admin] 2015/07/30 15:33:05 listening on HTTP: [::]:8083
[httpd] 2015/07/30 15:33:05 authentication enabled: false
[httpd] 2015/07/30 15:33:05 listening on HTTP: [::]:8086
2015/07/30 15:33:05 InfluxDB starting, version 0.9, commit unknown
2015/07/30 15:33:05 GOMAXPROCS set to 4
[run] 2015/07/30 15:33:05 listening for signals
2015/07/30 15:33:05 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/30 15:33:07 127.0.0.1 - root [30/Jul/2015:15:33:07 +0100] GET /query?q=CREATE+DATABASE+cq_bug&u=root&p=root HTTP/1.1 200 40 - Ruby e522ccf1-36c7-11e5-8001-000000000000 14.280937ms
[http] 2015/07/30 15:33:07 127.0.0.1 - root [30/Jul/2015:15:33:07 +0100] GET /query?q=CREATE+CONTINUOUS+QUERY+test_1m+ON+cq_bug+BEGIN%0ASELECT+sum%28visits%29+as+visits+INTO+test_1m+FROM+test+GROUP+BY+time%281m%29%2C+host%0AEND&u=root&p=root HTTP/1.1 200 40 - Ruby e5251e1e-36c7-11e5-8002-000000000000 4.359335ms
[http] 2015/07/30 15:33:07 127.0.0.1 - root [30/Jul/2015:15:33:07 +0100] POST /write?db=cq_bug&precision=s&u=root&p=root HTTP/1.1 204 0 - Ruby e528340d-36c7-11e5-8003-000000000000 74.981611ms
[http] 2015/07/30 15:33:08 127.0.0.1 - root [30/Jul/2015:15:33:08 +0100] POST /write?db=cq_bug&precision=s&u=root&p=root HTTP/1.1 204 0 - Ruby e585261b-36c7-11e5-8004-000000000000 43.971901ms
[continuous_querier] 2015/07/30 15:33:08 executing continuous query test_1m
[continuous_querier] 2015/07/30 15:33:08 wrote 1 point(s) to cq_bug.default.test_1m
[continuous_querier] 2015/07/30 15:33:08 wrote 1 point(s) to cq_bug.default.test_1m
[continuous_querier] 2015/07/30 15:33:08 wrote 1 point(s) to cq_bug.default.test_1m
[continuous_querier] 2015/07/30 15:33:08 wrote 1 point(s) to cq_bug.default.test_1m
[continuous_querier] 2015/07/30 15:33:08 wrote 1 point(s) to cq_bug.default.test_1m
[continuous_querier] 2015/07/30 15:33:13 timeout
[continuous_querier] 2015/07/30 15:33:13 error: timeout. running: SELECT sum(visits) AS "visits" INTO "cq_bug"."default".test_1m FROM "cq_bug"."default".test WHERE time >= '2015-07-30 14:33:00' AND time < '2015-07-30 14:34:00' GROUP BY time(1m), host
[continuous_querier] 2015/07/30 15:33:13 error executing query: CREATE CONTINUOUS QUERY test_1m ON cq_bug BEGIN SELECT sum(visits) AS "visits" INTO "cq_bug"."default".test_1m FROM "cq_bug"."default".test GROUP BY time(1m), host END: err = timeout
[http] 2015/07/30 15:33:13 127.0.0.1 - root [30/Jul/2015:15:33:08 +0100] POST /write?db=cq_bug&precision=s&u=root&p=root HTTP/1.1 500 32 - Ruby e5dcb763-36c7-11e5-8005-000000000000 5.003839263s
[http] 2015/07/30 15:33:19 127.0.0.1 - root [30/Jul/2015:15:33:14 +0100] POST /write?db=cq_bug&precision=s&u=root&p=root HTTP/1.1 500 32 - Ruby e928ebf8-36c7-11e5-8006-000000000000 5.003988581s

The output of the ruby script is:

$ ruby client.rb 
written 1000 points
written 2000 points
ERROR timeout
ERROR timeout

This pull requests reduces the time needed to run the CQ and avoids timeouts but it - unfortunately - doesn't fix the underlying problem which I traced down to cluster.PointsWriter but was unable to solve myself.

@jwilder
Copy link
Contributor

jwilder commented Aug 13, 2015

@dim 👍 This change looks good. Can you sign the CLA? https://influxdb.com/community/cla.html

@otoolep
Copy link
Contributor

otoolep commented Aug 13, 2015

+1, thanks @dim

@dim
Copy link
Contributor Author

dim commented Aug 14, 2015

Done, signed

jwilder added a commit that referenced this pull request Aug 14, 2015
Test script from #3517 that reproduces a CQ deadlock.  This is related
to #3522 as well.
jwilder added a commit that referenced this pull request Aug 14, 2015
Batch CQ writes to avoid timeouts
@jwilder jwilder merged commit e5e782d into influxdata:master Aug 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants