memory leak crashing influxdb #7810

rybit · 2017-01-10T02:01:40Z

I am seeing influxdb crash every day or two, because it consumes all the memory on the box. I took it from a 4GB to a 16GB and it just pushed how long it can go. Below is the chart. I find it odd that the Alloc and HeapAlloc are identical, but I looked here and it seems right. Still the growth and eventual OOM is a problem.

name: database
tags: database=metrics
numMeasurements	numSeries
---------------	---------
37		373362

influxdb version: InfluxDB v1.1.1 (git: master e47cf1f)
linux version: ubuntu 14.04

[general]
bind-address = ":8088"

[meta]
dir = "/mnt/db/meta"

[data]
dir = "/mnt/db/data"
wal-dir = "/mnt/influx"

[admin]
enabled = true
port = ":8083"
https-enabled = true
https-certificate = "/etc/cfssl/certs/combined.pem"

[hinted-handoff]
dir = "/mnt/db/hh"

[http]
bind-address = ":8086"
https-enabled = true
https-certificate = "/etc/cfssl/certs/combined.pem"
auth-enabled = true

The text was updated successfully, but these errors were encountered:

mark-rushakoff · 2017-01-10T02:43:53Z

Many (most?) users are not experiencing this problem. If there's a memory leak, we want to fix it, but we need more detail to track down the source of the issue.

Please collect the following diagnostic information that was requested in the issue template. It will be most helpful if you gather this multiple times, each an hour or two apart, as you observe the memory growing.

influx -execute "show shards" > "shards-$(date +%s).txt"
influx -execute "show stats" > "stats-$(date +%s).txt"
influx -execute "show diagnostics" > "diagnostics-$(date +%s).txt"
curl -o "goroutine-$(date +%s).txt" "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o "heap-$(date +%s).txt" "http://localhost:8086/debug/pprof/heap?debug=1"

rybit · 2017-01-10T17:15:50Z

Yeah, I thought the behaviour was very strange. As requested here is the gist with the results.

Is there anyway for me to see the number of series per measurement? I could then dig into if one of them is acting up (like has an always increasing field/tag).

rybit · 2017-01-10T22:08:40Z

I looked at the series cardinality and I think that we have most of our metrics are around a few hundred series, but we have 2 that are pretty large.

  51943 name: kpis.site_count
 270534 name: trafficserver.request_count

the largest one I might split up to 2 measurements. It currently has a size and domain field, which commonly don't match up with any other points. It would significantly decrease the cardinality but add a whole nother measurement. I am not sure if that is super relevant?

jwilder · 2017-01-10T22:33:44Z

Alloc and HeapAlloc should be the same according to the go docs. Can you graph HeapInUse and HeapSys?

Is the process panicing with a memory allocation error or is it being killed by the OS? Do you have a stack trace you could attach?

Also, how are you writing data? telegraf? Go client? How big are your batches and what kind of data are you writing? The heap profiles looks like there might be a connection leak from the client side, but the goroutine profile attached doesn't correlate with that idea. Would you be able to get a goroutine profile when the heap is high?

rybit · 2017-01-10T22:49:52Z

We are using the go client
The batches are 3000 points (or every 10 seconds)
Here is a stack trace I could find
The data is our own metric data, simple little json payloads our different services produce. On the order of 10 fields/tags, a name, timestamp and value.

Thanks for all your help @jwilder!

mark-rushakoff · 2017-01-10T23:43:21Z

There are a number of query-related goroutines in that stack trace that are waiting for about a minute. If you can check the server logs around a minute before the OOM crash, you may be able to find the query or queries that triggered the OOM. Common "bad" queries may have a GROUP BY * or an aggregate function across a large time range or across a large number of series. You can try setting some of the limits in the coordinator section of the config to prevent out-of-control queries in some cases.

rybit · 2017-01-10T23:48:03Z

Awesome - I'll dig through those for sure

jwilder · 2017-01-10T23:49:02Z

@rybit Also, another thing you could try is setting:

[http]
max-connection-limit = 10

in your config to help rule out that there are write connections leaking?

rybit · 2017-01-10T23:50:08Z

ok - there should only be 1 service writing and I can take a look at how many inflight batches it has around this time...I think I can at least :)

rybit · 2017-01-11T21:38:33Z

I disabled a query I suspected was being a bad actor, and put some limits in place

[coordinator]
log-queries-after = "10s"
max-concurrent-queries = 10
query-timeout = "30s"

One thing I know for sure is that I have a single measurement with a really high cardinality. It also generates the most points in the system and is queried a lot too.

Is there any documentation on what each of the measurements in the _internal are? Most are self-explanatory, but not all.

mark-rushakoff · 2017-01-11T21:41:59Z

There's an open docs pr documenting the internal metrics: influxdata/docs.influxdata.com-ARCHIVE#777

We're still investigating what may be causing the memory leak.

rybit · 2017-01-11T21:42:36Z

I am still digging in a bit too. Let me know if there is some more info I can provide.

mark-rushakoff · 2017-01-12T01:18:59Z

@rybit can you run this short script to gather multiple goroutine and heap profiles, one per hour for 6 hours? The single profiles from earlier didn't have any smoking guns, but these profiles will be more likely to show us what's leaking over time.

It's important that these profiles are all from a single run of influxd. If the server does happen to crash while this script is running, and we have at least 3 profiles, that should be enough to work with. I picked 6 arbitrarily for a decent sample size.

for i in $(seq 6); do
  [ "$i" != "1" ] && sleep 3600
  curl -o "goroutine-$(date +%s).txt" "http://localhost:8086/debug/pprof/goroutine?debug=1" 
  curl -o "heap-$(date +%s).txt" "http://localhost:8086/debug/pprof/heap?debug=1" 
done

rybit · 2017-01-12T01:22:39Z

will do!

rybit · 2017-01-12T17:17:47Z

ok samples run - the DB went on to crash around 08:00 this morning, but these samples show some growth. here is the gist with the files. It seems that long long files makes gist unhappy :)

jwilder · 2017-01-12T19:56:00Z

@rybit Are you able to share some sample write payloads? We're trying to reproduce this, but have not been successful.

rybit · 2017-01-12T19:58:36Z

@jwilder yeah I can do that - lemme snag a few.

rybit · 2017-01-12T22:01:33Z

I pulled a little sample onto my local box. I can't get all the traffic we have (I am still going to try for more...) but the data looks like this

trafficserver.request_count,content_type=text/html,domain=http://site1.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=123,status=200,testing=true,timing=0,value=1i 1483142459000000000
trafficserver.request_count,content_type=text/html,domain=http://site2.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=34,status=200,testing=true,timing=0,value=1i 1483142460000000000
trafficserver.request_count,content_type=text/html,domain=http://site3.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=121254,status=200,testing=true,timing=0,value=1i 1483142460000000000
trafficserver.request_count,content_type=-,domain=http://site4.com,hostname=localhost,method=GET,result=TCP_IMS_HIT,version=0.0.1 size=23423,status=304,testing=true,timing=6,value=1i 1483142461000000000
trafficserver.request_count,content_type=text/html,domain=http://site1.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=13124323,status=200,testing=true,timing=0,value=1i 1483142459000000000
trafficserver.request_count,content_type=application/json,domain=https://site33.com,hostname=localhost,method=GET,result=TCP_MISS,version=0.0.1 size=1230812,status=200,testing=true,timing=283,value=1i 1483142460000000000
doppler.batches_inflight,hostname=ryan-mbp.local value=1i 1484256497078568400

we have about 40 measurements, and the one trafficserver.request_count has a huge cardinality (I think). About 483371.

$ influx -database metrics -password $pass -username netlify -ssl --unsafeSsl -execute 'show series from "trafficserver.request_count"' > ats-request-count-series

I think a problem is that we have the size field and the domain tag and those rarely line up (though they could). I am going to split that into 2 measurements to try and bring the cardinality down.

Another point - it seems that if I start asking about this measurement it shoots the mem consumption up.

> select count(*) from "trafficserver.request_count"
name: trafficserver.request_count
time	count_size	count_status	count_timing	count_value
----	----------	------------	------------	-----------
0	3213		3213		2646		2646

I am not too sure how to get the number of points corresponding to that measurements, but it is by far going to be the largest we have (prob 2 orders of mag I'd expect). But here is what the internal db says as of now

^^ those are numSeries and numMeasurements respectively

rybit · 2017-01-12T23:05:46Z

and for perspective on scale here are measurements from the sender to influx. It seems to be keeping up very well

This leak seems to have been introduced in 8aa224b, present in 1.1.0 and 1.1.1. When points were parsed from HTTP payloads, their tags and fields referred to subslices of the request body; if any tag set introduced a new series, then those tags then were stored in the in-memory series index objects, preventing the HTTP body from being garbage collected. If there were no new series in the payload, then the request body would be garbage collected as usual. Now, we clone the tags before we store them in the index. This is an imperfect fix because the Point still holds references to the original tags, and the Point's field iterator also refers to the payload buffer. However, the current write code path does not retain references to the Point or its fields; and this change will likely be obsoleted when TSI is introduced. This change likely fixes #7827, #7810, #7778, and perhaps others.

jwilder · 2017-01-23T19:44:51Z

Fixed by #7832

This leak seems to have been introduced in 8aa224b, present in 1.1.0 and 1.1.1. When points were parsed from HTTP payloads, their tags and fields referred to subslices of the request body; if any tag set introduced a new series, then those tags then were stored in the in-memory series index objects, preventing the HTTP body from being garbage collected. If there were no new series in the payload, then the request body would be garbage collected as usual. Now, we clone the tags before we store them in the index. This is an imperfect fix because the Point still holds references to the original tags, and the Point's field iterator also refers to the payload buffer. However, the current write code path does not retain references to the Point or its fields; and this change will likely be obsoleted when TSI is introduced. This change likely fixes #7827, #7810, #7778, and perhaps others.

desa added the need more info label Jan 10, 2017

jwilder added the area/performance label Jan 12, 2017

mark-rushakoff mentioned this issue Jan 13, 2017

Fix memory leak of retained HTTP write payloads #7832

Merged

3 tasks

jwilder added this to the 1.2.0 milestone Jan 23, 2017

jwilder closed this as completed Jan 23, 2017

jwilder added area/writes and removed need more info labels Jan 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak crashing influxdb #7810

memory leak crashing influxdb #7810

rybit commented Jan 10, 2017

mark-rushakoff commented Jan 10, 2017

rybit commented Jan 10, 2017

rybit commented Jan 10, 2017

jwilder commented Jan 10, 2017

rybit commented Jan 10, 2017 •

edited

Loading

mark-rushakoff commented Jan 10, 2017

rybit commented Jan 10, 2017

jwilder commented Jan 10, 2017

rybit commented Jan 10, 2017

rybit commented Jan 11, 2017

mark-rushakoff commented Jan 11, 2017

rybit commented Jan 11, 2017

mark-rushakoff commented Jan 12, 2017

rybit commented Jan 12, 2017

rybit commented Jan 12, 2017

jwilder commented Jan 12, 2017

rybit commented Jan 12, 2017

rybit commented Jan 12, 2017 •

edited

Loading

rybit commented Jan 12, 2017

jwilder commented Jan 23, 2017

memory leak crashing influxdb #7810

memory leak crashing influxdb #7810

Comments

rybit commented Jan 10, 2017

mark-rushakoff commented Jan 10, 2017

rybit commented Jan 10, 2017

rybit commented Jan 10, 2017

jwilder commented Jan 10, 2017

rybit commented Jan 10, 2017 • edited Loading

mark-rushakoff commented Jan 10, 2017

rybit commented Jan 10, 2017

jwilder commented Jan 10, 2017

rybit commented Jan 10, 2017

rybit commented Jan 11, 2017

mark-rushakoff commented Jan 11, 2017

rybit commented Jan 11, 2017

mark-rushakoff commented Jan 12, 2017

rybit commented Jan 12, 2017

rybit commented Jan 12, 2017

jwilder commented Jan 12, 2017

rybit commented Jan 12, 2017

rybit commented Jan 12, 2017 • edited Loading

rybit commented Jan 12, 2017

jwilder commented Jan 23, 2017

rybit commented Jan 10, 2017 •

edited

Loading

rybit commented Jan 12, 2017 •

edited

Loading