Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.1] 500 errors, then out of memory, then crash #3362

Closed
nicksellen opened this issue Jul 17, 2015 · 11 comments
Closed

[0.9.1] 500 errors, then out of memory, then crash #3362

nicksellen opened this issue Jul 17, 2015 · 11 comments

Comments

@nicksellen
Copy link

Thing Value
version InfluxDB v0.9.1 (git: 8b3219e74fcc3843a6f4901bdf00e905642b6bd6)
OS Ubuntu 14.04.2 LTS
uname -a Linux stats02.ec.everycity.co.uk 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
installed from influxdb_0.9.1_amd64.deb
spec 32gb memory, 8 cores, SSD disks, KVM vm

InfluxDB crashed and I'm not sure why. Starting it again caused it to recover and everything as ok again now. It would be nice to know why.

I am running default install, except for extending the continuous query recomputer previous/time. I have 32 continuous queries (1 for each measurement). I write about

Here's what happened:

  1. 2015/07/15 02:43:33 - starts returning 500 errors to all "POST /write" requests (previously was all 204's). no longer error messages.
  2. 2015/07/16 17:37:59 - fatal error: runtime: out of memory
  3. long list of errors mostly relating to RWMutex
  4. Jul 16 17:38 (last log access) - presumably it died at this point
  5. it didn't get restarted automatically (but I don't see anything that is trying to do this either, is that correct?)

The last error stack trace was this (go routine number looks very high...):

goroutine 504419193 [semacquire]:
sync.(*RWMutex).RLock(0xc20b49cca0)
        /usr/local/go/src/sync/rwmutex.go:36 +0x5f
github.com/influxdb/influxdb/tsdb.(*Shard).validateSeriesAndFields(0xc20b49cc60, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/shard.go:638 +0x143
github.com/influxdb/influxdb/tsdb.(*Shard).WritePoints(0xc20b49cc60, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/shard.go:220 +0x75
github.com/influxdb/influxdb/tsdb.(*Store).WriteToShard(0xc20800c720, 0x8, 0xca68ffc000, 0x2a7, 0x400, 0x0, 0x0)
        /home/philip/build/src/github.com/influxdb/influxdb/tsdb/store.go:281 +0xf7
github.com/influxdb/influxdb/cluster.func·002(0x8, 0x1, 0xca68ffc000, 0x2a7, 0x400)
        /home/philip/build/src/github.com/influxdb/influxdb/cluster/points_writer.go:241 +0xfc
created by github.com/influxdb/influxdb/cluster.(*PointsWriter).writeToShard
        /home/philip/build/src/github.com/influxdb/influxdb/cluster/points_writer.go:271 +0x30b

Here is a little sample of the latest flush log messages (after the restart) to give an idea of write volume (although this is probably only about 15/20% of intended write volume):

[shard] 2015/07/17 11:07:16 flush 14977 points in 0.083s
[shard] 2015/07/17 11:07:18 flush 14391 points in 0.074s
[shard] 2015/07/17 11:07:21 flush 14307 points in 0.091s
[shard] 2015/07/17 11:07:23 flush 15176 points in 0.073s
[shard] 2015/07/17 11:07:25 flush 13710 points in 0.077s
[shard] 2015/07/17 11:07:27 flush 14902 points in 0.108s
[shard] 2015/07/17 11:07:29 flush 15219 points in 0.083s
[shard] 2015/07/17 11:07:31 flush 13811 points in 0.072s

Thanks :)

@beckettsean beckettsean changed the title 500 errors, then out of memory, then crash [0.9.1] 500 errors, then out of memory, then crash Jul 17, 2015
@beckettsean
Copy link
Contributor

@nicksellen what volume of writes were you pushing in when the database crashed? Roughly how many points per second, through what protocol, and with what batch size?

@nicksellen
Copy link
Author

@beckettsean There are 5 servers writing points every 5 seconds, this is how many they write each time (and thus the batch size): 219 + 1943 + 1926 + 783 + 735 = 5606 (these are slightly higher per server than the time of the crash, but not too far off). So around 1000/second, and I'm using the line protocol with ordered tags.

I'm pushing in about double that now, htop tells me it's writing to disk at 90mb/s which seems high (it's only reading 135kb/s from the network).

@nicksellen
Copy link
Author

(sorry, make that ~400kb/s of network read, I was reading the wrong figure)

@nicksellen
Copy link
Author

here are stats graphs of the server running influxdb (ignore the memory one):

screen shot 2015-07-18 at 13 39 00

~3k/s seems a lot of disk ops too?

@nicksellen
Copy link
Author

I suspect it's on it's way to doing the same thing again, here is a simplified log https://gist.github.com/nicksellen/f5f1bd2ff0f94ba70cb5 showing " <last 2 chars of request time>" - interesting points are:

10:38:20 204 3s - first multi second request
10:38:50 500 3s - first 500 error
10:39:21 500 7s - everything 500 and multi second after here

And here's a screengrab of htop.

screen shot 2015-07-23 at 13 05 34

I'm going to restart it now and hopefully it comes back up ok.

@nicksellen
Copy link
Author

Took around 5 minutes after rebooting before it was ready to accept writes again.

GOMAXPROCS set to 8
2015/07/23 11:07:14 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
2015/07/23 11:07:14 read local node id: 1
2015/07/23 11:07:15 [WARN] raft: Heartbeat timeout reached, starting election
2015/07/23 11:07:15 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
2015/07/23 11:07:15 [DEBUG] raft: Votes needed: 1
2015/07/23 11:07:15 [DEBUG] raft: Vote granted. Tally: 1
2015/07/23 11:07:15 [INFO] raft: Election won. Tally: 1
2015/07/23 11:07:15 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
2015/07/23 11:07:15 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/23 11:07:15 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[shard] 2015/07/23 11:11:33 flush 11621 points in 0.481s
[shard] 2015/07/23 11:11:35 flush 10341 points in 0.557s
[shard] 2015/07/23 11:11:38 flush 8647 points in 0.511s
[shard] 2015/07/23 11:11:40 flush 6534 points in 0.331s
[shard] 2015/07/23 11:11:43 flush 4893 points in 0.494s
[shard] 2015/07/23 11:11:45 flush 3507 points in 0.435s
[shard] 2015/07/23 11:11:47 flush 1852 points in 0.290s
[shard] 2015/07/23 11:11:50 flush 86413 points in 0.916s
[httpd] 2015/07/23 11:12:09 listening on HTTP: [::]:8086
2015/07/23 11:12:09 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/23 11:12:09 10.6.25.1 - - [23/Jul/2015:11:12:09 +0000] POST /write?db=metrics_production&precision=ms HTTP/1.1 204 0 - - a95e6bb8-312b-11e5-8001-000000000000 38.722637ms

Would 0.9.2-rc1 be worth upgrading to at this point? Or is the release close?

@nicksellen
Copy link
Author

I suspect I should remove my continuous queries too (re: #3368).

@jhedlund
Copy link

Nick,

Since removing the continuous queries (per #3368) I haven't had any problems. In fact, I am running my own down sampling queries every 15s and have not experienced any problems, so it would appear that the problem is entirely within the continuous query routines...

@nicksellen
Copy link
Author

It looked good for a moment :)

disk writes from ~80mb/s to ~3mb/s
iops from 1000's per second to 10's per second
cpu load was plumenting (2 down to 0.2)

14:16:10 - I deleted all the continuous queries (31 of them) and the measurements they had created, and the retention policy for them at ... then the 500 errors again less than a minute later

14:16:56 204 135.102192ms
14:16:56 204 97.832963ms
14:16:57 204 94.919236ms
14:16:57 204 81.784885ms
14:16:58 204 113.605918ms
14:16:58 204 127.928435ms
14:16:59 204 110.895281ms
14:16:59 204 144.970289ms
14:17:00 204 94.75633ms
14:17:00 204 122.744927ms
14:17:01 204 137.778577ms
14:17:01 204 93.077679ms
14:17:07 500 5.003747598s
14:17:07 500 5.002539646s
14:17:08 500 5.023381179s
14:17:08 500 5.012413516s
14:17:09 500 5.008010318s
14:17:09 500 5.013264289s
14:17:10 500 5.005421954s
14:17:10 500 5.0177413s
14:17:11 500 5.024279723s
14:17:11 500 5.012032194s
14:17:13 500 5.005368704s
14:17:17 500 5.007134619s
14:17:17 500 5.00205985s
14:17:17 500 5.023649674s
14:17:19 500 5.010935744s
14:17:19 500 5.024027929s
14:17:20 500 5.004635773s
14:17:20 500 5.023014609s
14:17:21 500 5.022248712s

I restarted again and after another 5 minute startup time it's happy again... so far so good... and order of magnitude better resource usage:

stats

@dim
Copy link
Contributor

dim commented Jul 30, 2015

I managed to recreate the problem in #3517

@pauldix
Copy link
Member

pauldix commented Aug 18, 2015

I'm pretty sure this is going to be fixed by #3569 which will be in the 0.9.3 release. It should be in a nightly build tonight. I'm closing for now, but if you can repro with that build please reopen. Thanks!

@pauldix pauldix closed this as completed Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants