[0.9.1] 500 errors, then out of memory, then crash #3362

nicksellen · 2015-07-17T11:13:42Z

Thing	Value
version	`InfluxDB v0.9.1 (git: 8b3219e74fcc3843a6f4901bdf00e905642b6bd6)`
OS	`Ubuntu 14.04.2 LTS`
uname -a	`Linux stats02.ec.everycity.co.uk 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux`
installed from	`influxdb_0.9.1_amd64.deb`
spec	32gb memory, 8 cores, SSD disks, KVM vm

InfluxDB crashed and I'm not sure why. Starting it again caused it to recover and everything as ok again now. It would be nice to know why.

I am running default install, except for extending the continuous query recomputer previous/time. I have 32 continuous queries (1 for each measurement). I write about

Here's what happened:

2015/07/15 02:43:33 - starts returning 500 errors to all "POST /write" requests (previously was all 204's). no longer error messages.
2015/07/16 17:37:59 - fatal error: runtime: out of memory
long list of errors mostly relating to RWMutex
Jul 16 17:38 (last log access) - presumably it died at this point
it didn't get restarted automatically (but I don't see anything that is trying to do this either, is that correct?)

The last error stack trace was this (go routine number looks very high...):

goroutine 504419193 [semacquire]:
sync.(*RWMutex).RLock(0xc20b49cca0)
        /usr/local/go/src/sync/rwmutex.go:36 +0x5f
github.com/influxdb/influxdb/tsdb.(*Shard).validateSeriesAndFields(0xc20b49cc60, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/shard.go:638 +0x143
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc20b49cc60, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/shard.go:220 +0x75
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc20800c720, 0x8, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/store.go:281 +0xf7
github.com/influxdb/influxdb/cluster.func·002(0x8, 0x1, 0xca68ffc000, 0x2a7, 0x400)
        /home/philip/build/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
        /home/philip/build/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

Here is a little sample of the latest flush log messages (after the restart) to give an idea of write volume (although this is probably only about 15/20% of intended write volume):

[shard] 2015/07/17 11:07:16 flush 14977 points in 0.083s
[shard] 2015/07/17 11:07:18 flush 14391 points in 0.074s
[shard] 2015/07/17 11:07:21 flush 14307 points in 0.091s
[shard] 2015/07/17 11:07:23 flush 15176 points in 0.073s
[shard] 2015/07/17 11:07:25 flush 13710 points in 0.077s
[shard] 2015/07/17 11:07:27 flush 14902 points in 0.108s
[shard] 2015/07/17 11:07:29 flush 15219 points in 0.083s
[shard] 2015/07/17 11:07:31 flush 13811 points in 0.072s

Thanks :)

The text was updated successfully, but these errors were encountered:

beckettsean · 2015-07-17T18:19:00Z

@nicksellen what volume of writes were you pushing in when the database crashed? Roughly how many points per second, through what protocol, and with what batch size?

nicksellen · 2015-07-18T11:29:49Z

@beckettsean There are 5 servers writing points every 5 seconds, this is how many they write each time (and thus the batch size): 219 + 1943 + 1926 + 783 + 735 = 5606 (these are slightly higher per server than the time of the crash, but not too far off). So around 1000/second, and I'm using the line protocol with ordered tags.

I'm pushing in about double that now, htop tells me it's writing to disk at 90mb/s which seems high (it's only reading 135kb/s from the network).

nicksellen · 2015-07-18T11:36:47Z

(sorry, make that ~400kb/s of network read, I was reading the wrong figure)

nicksellen · 2015-07-18T12:10:49Z

here are stats graphs of the server running influxdb (ignore the memory one):

~3k/s seems a lot of disk ops too?

nicksellen · 2015-07-23T11:07:04Z

I suspect it's on it's way to doing the same thing again, here is a simplified log https://gist.github.com/nicksellen/f5f1bd2ff0f94ba70cb5 showing " <last 2 chars of request time>" - interesting points are:

10:38:20 204 3s - first multi second request
10:38:50 500 3s - first 500 error
10:39:21 500 7s - everything 500 and multi second after here

And here's a screengrab of htop.

I'm going to restart it now and hopefully it comes back up ok.

nicksellen · 2015-07-23T11:15:20Z

Took around 5 minutes after rebooting before it was ready to accept writes again.

GOMAXPROCS set to 8
2015/07/23 11:07:14 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
2015/07/23 11:07:14 read local node id: 1
2015/07/23 11:07:15 [WARN] raft: Heartbeat timeout reached, starting election
2015/07/23 11:07:15 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
2015/07/23 11:07:15 [DEBUG] raft: Votes needed: 1
2015/07/23 11:07:15 [DEBUG] raft: Vote granted. Tally: 1
2015/07/23 11:07:15 [INFO] raft: Election won. Tally: 1
2015/07/23 11:07:15 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
2015/07/23 11:07:15 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[shard] 2015/07/23 11:11:33 flush 11621 points in 0.481s
[shard] 2015/07/23 11:11:35 flush 10341 points in 0.557s
[shard] 2015/07/23 11:11:38 flush 8647 points in 0.511s
[shard] 2015/07/23 11:11:40 flush 6534 points in 0.331s
[shard] 2015/07/23 11:11:43 flush 4893 points in 0.494s
[shard] 2015/07/23 11:11:45 flush 3507 points in 0.435s
[shard] 2015/07/23 11:11:47 flush 1852 points in 0.290s
[shard] 2015/07/23 11:11:50 flush 86413 points in 0.916s
[httpd] 2015/07/23 11:12:09 listening on HTTP: [::]:8086
2015/07/23 11:12:09 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/23 11:12:09 10.6.25.1 - - [23/Jul/2015:11:12:09 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 204 0 - - a95e6bb8-312b-11e5-8001-000000000000 38.722637ms

Would 0.9.2-rc1 be worth upgrading to at this point? Or is the release close?

nicksellen · 2015-07-23T13:25:48Z

I suspect I should remove my continuous queries too (re: #3368).

jhedlund · 2015-07-23T13:28:11Z

Nick,

Since removing the continuous queries (per #3368) I haven't had any problems. In fact, I am running my own down sampling queries every 15s and have not experienced any problems, so it would appear that the problem is entirely within the continuous query routines...

nicksellen · 2015-07-23T14:31:00Z

It looked good for a moment :)

disk writes from ~80mb/s to ~3mb/s
iops from 1000's per second to 10's per second
cpu load was plumenting (2 down to 0.2)

14:16:10 - I deleted all the continuous queries (31 of them) and the measurements they had created, and the retention policy for them at ... then the 500 errors again less than a minute later

14:16:56 204 135.102192ms
14:16:56 204 97.832963ms
14:16:57 204 94.919236ms
14:16:57 204 81.784885ms
14:16:58 204 113.605918ms
14:16:58 204 127.928435ms
14:16:59 204 110.895281ms
14:16:59 204 144.970289ms
14:17:00 204 94.75633ms
14:17:00 204 122.744927ms
14:17:01 204 137.778577ms
14:17:01 204 93.077679ms
14:17:07 500 5.003747598s
14:17:07 500 5.002539646s
14:17:08 500 5.023381179s
14:17:08 500 5.012413516s
14:17:09 500 5.008010318s
14:17:09 500 5.013264289s
14:17:10 500 5.005421954s
14:17:10 500 5.0177413s
14:17:11 500 5.024279723s
14:17:11 500 5.012032194s
14:17:13 500 5.005368704s
14:17:17 500 5.007134619s
14:17:17 500 5.00205985s
14:17:17 500 5.023649674s
14:17:19 500 5.010935744s
14:17:19 500 5.024027929s
14:17:20 500 5.004635773s
14:17:20 500 5.023014609s
14:17:21 500 5.022248712s

I restarted again and after another 5 minute startup time it's happy again... so far so good... and order of magnitude better resource usage:

dim · 2015-07-30T14:43:33Z

I managed to recreate the problem in #3517

pauldix · 2015-08-18T18:29:39Z

I'm pretty sure this is going to be fixed by #3569 which will be in the 0.9.3 release. It should be in a nightly build tonight. I'm closing for now, but if you can repro with that build please reopen. Thanks!

beckettsean added the area/performance label Jul 17, 2015

beckettsean changed the title ~~500 errors, then out of memory, then crash~~ [0.9.1] 500 errors, then out of memory, then crash Jul 17, 2015

jhedlund mentioned this issue Jul 17, 2015

Continuous Queries causing 500 Timeouts #3368

Closed

This was referenced Aug 7, 2015

grafana can support influxdb continuous queries some way grafana/grafana#420

Closed

[0.9.2] CQ timeout caused http write timeouts. #3469

Closed

pauldix closed this as completed Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.1] 500 errors, then out of memory, then crash #3362

[0.9.1] 500 errors, then out of memory, then crash #3362

nicksellen commented Jul 17, 2015

beckettsean commented Jul 17, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 23, 2015

nicksellen commented Jul 23, 2015

nicksellen commented Jul 23, 2015

jhedlund commented Jul 23, 2015

nicksellen commented Jul 23, 2015

dim commented Jul 30, 2015

pauldix commented Aug 18, 2015

[0.9.1] 500 errors, then out of memory, then crash #3362

[0.9.1] 500 errors, then out of memory, then crash #3362

Comments

nicksellen commented Jul 17, 2015

beckettsean commented Jul 17, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 18, 2015

nicksellen commented Jul 23, 2015

nicksellen commented Jul 23, 2015

nicksellen commented Jul 23, 2015

jhedlund commented Jul 23, 2015

nicksellen commented Jul 23, 2015

dim commented Jul 30, 2015

pauldix commented Aug 18, 2015