Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak crashing influxdb #7810

Closed
rybit opened this issue Jan 10, 2017 · 20 comments
Closed

memory leak crashing influxdb #7810

rybit opened this issue Jan 10, 2017 · 20 comments

Comments

@rybit
Copy link

rybit commented Jan 10, 2017

I am seeing influxdb crash every day or two, because it consumes all the memory on the box. I took it from a 4GB to a 16GB and it just pushed how long it can go. Below is the chart. I find it odd that the Alloc and HeapAlloc are identical, but I looked here and it seems right. Still the growth and eventual OOM is a problem.

image

name: database
tags: database=metrics
numMeasurements	numSeries
---------------	---------
37		373362

influxdb version: InfluxDB v1.1.1 (git: master e47cf1f)
linux version: ubuntu 14.04

[general]
bind-address = ":8088"

[meta]
dir = "/mnt/db/meta"

[data]
dir = "/mnt/db/data"
wal-dir = "/mnt/influx"

[admin]
enabled = true
port = ":8083"
https-enabled = true
https-certificate = "/etc/cfssl/certs/combined.pem"

[hinted-handoff]
dir = "/mnt/db/hh"

[http]
bind-address = ":8086"
https-enabled = true
https-certificate = "/etc/cfssl/certs/combined.pem"
auth-enabled = true
@mark-rushakoff
Copy link
Contributor

Many (most?) users are not experiencing this problem. If there's a memory leak, we want to fix it, but we need more detail to track down the source of the issue.

Please collect the following diagnostic information that was requested in the issue template. It will be most helpful if you gather this multiple times, each an hour or two apart, as you observe the memory growing.

influx -execute "show shards" > "shards-$(date +%s).txt"
influx -execute "show stats" > "stats-$(date +%s).txt"
influx -execute "show diagnostics" > "diagnostics-$(date +%s).txt"
curl -o "goroutine-$(date +%s).txt" "http://localhost:8086/debug/pprof/goroutine?debug=1" 
curl -o "heap-$(date +%s).txt" "http://localhost:8086/debug/pprof/heap?debug=1" 

@rybit
Copy link
Author

rybit commented Jan 10, 2017

Yeah, I thought the behaviour was very strange. As requested here is the gist with the results.

Is there anyway for me to see the number of series per measurement? I could then dig into if one of them is acting up (like has an always increasing field/tag).

@rybit
Copy link
Author

rybit commented Jan 10, 2017

I looked at the series cardinality and I think that we have most of our metrics are around a few hundred series, but we have 2 that are pretty large.

  51943 name: kpis.site_count
 270534 name: trafficserver.request_count

the largest one I might split up to 2 measurements. It currently has a size and domain field, which commonly don't match up with any other points. It would significantly decrease the cardinality but add a whole nother measurement. I am not sure if that is super relevant?

@jwilder
Copy link
Contributor

jwilder commented Jan 10, 2017

Alloc and HeapAlloc should be the same according to the go docs. Can you graph HeapInUse and HeapSys?

Is the process panicing with a memory allocation error or is it being killed by the OS? Do you have a stack trace you could attach?

Also, how are you writing data? telegraf? Go client? How big are your batches and what kind of data are you writing? The heap profiles looks like there might be a connection leak from the client side, but the goroutine profile attached doesn't correlate with that idea. Would you be able to get a goroutine profile when the heap is high?

@rybit
Copy link
Author

rybit commented Jan 10, 2017

We are using the go client
The batches are 3000 points (or every 10 seconds)
Here is a stack trace I could find
The data is our own metric data, simple little json payloads our different services produce. On the order of 10 fields/tags, a name, timestamp and value.

image

Thanks for all your help @jwilder!

@mark-rushakoff
Copy link
Contributor

There are a number of query-related goroutines in that stack trace that are waiting for about a minute. If you can check the server logs around a minute before the OOM crash, you may be able to find the query or queries that triggered the OOM. Common "bad" queries may have a GROUP BY * or an aggregate function across a large time range or across a large number of series. You can try setting some of the limits in the coordinator section of the config to prevent out-of-control queries in some cases.

@rybit
Copy link
Author

rybit commented Jan 10, 2017

Awesome - I'll dig through those for sure

@jwilder
Copy link
Contributor

jwilder commented Jan 10, 2017

@rybit Also, another thing you could try is setting:

[http]
max-connection-limit = 10

in your config to help rule out that there are write connections leaking?

@rybit
Copy link
Author

rybit commented Jan 10, 2017

ok - there should only be 1 service writing and I can take a look at how many inflight batches it has around this time...I think I can at least :)

@rybit
Copy link
Author

rybit commented Jan 11, 2017

I disabled a query I suspected was being a bad actor, and put some limits in place

[coordinator]
log-queries-after = "10s"
max-concurrent-queries = 10
query-timeout = "30s"

One thing I know for sure is that I have a single measurement with a really high cardinality. It also generates the most points in the system and is queried a lot too.

Is there any documentation on what each of the measurements in the _internal are? Most are self-explanatory, but not all.

@mark-rushakoff
Copy link
Contributor

There's an open docs pr documenting the internal metrics: influxdata/docs.influxdata.com-ARCHIVE#777

We're still investigating what may be causing the memory leak.

@rybit
Copy link
Author

rybit commented Jan 11, 2017

I am still digging in a bit too. Let me know if there is some more info I can provide.

@mark-rushakoff
Copy link
Contributor

@rybit can you run this short script to gather multiple goroutine and heap profiles, one per hour for 6 hours? The single profiles from earlier didn't have any smoking guns, but these profiles will be more likely to show us what's leaking over time.

It's important that these profiles are all from a single run of influxd. If the server does happen to crash while this script is running, and we have at least 3 profiles, that should be enough to work with. I picked 6 arbitrarily for a decent sample size.

for i in $(seq 6); do
  [ "$i" != "1" ] && sleep 3600
  curl -o "goroutine-$(date +%s).txt" "http://localhost:8086/debug/pprof/goroutine?debug=1" 
  curl -o "heap-$(date +%s).txt" "http://localhost:8086/debug/pprof/heap?debug=1" 
done

@rybit
Copy link
Author

rybit commented Jan 12, 2017

will do!

@rybit
Copy link
Author

rybit commented Jan 12, 2017

ok samples run - the DB went on to crash around 08:00 this morning, but these samples show some growth. here is the gist with the files. It seems that long long files makes gist unhappy :)

image

@jwilder
Copy link
Contributor

jwilder commented Jan 12, 2017

@rybit Are you able to share some sample write payloads? We're trying to reproduce this, but have not been successful.

@rybit
Copy link
Author

rybit commented Jan 12, 2017

@jwilder yeah I can do that - lemme snag a few.

@rybit
Copy link
Author

rybit commented Jan 12, 2017

I pulled a little sample onto my local box. I can't get all the traffic we have (I am still going to try for more...) but the data looks like this

trafficserver.request_count,content_type=text/html,domain=http://site1.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=123,status=200,testing=true,timing=0,value=1i 1483142459000000000
trafficserver.request_count,content_type=text/html,domain=http://site2.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=34,status=200,testing=true,timing=0,value=1i 1483142460000000000
trafficserver.request_count,content_type=text/html,domain=http://site3.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=121254,status=200,testing=true,timing=0,value=1i 1483142460000000000
trafficserver.request_count,content_type=-,domain=http://site4.com,hostname=localhost,method=GET,result=TCP_IMS_HIT,version=0.0.1 size=23423,status=304,testing=true,timing=6,value=1i 1483142461000000000
trafficserver.request_count,content_type=text/html,domain=http://site1.netlify.com,hostname=localhost,method=GET,result=TCP_MEM_HIT,version=0.0.1 size=13124323,status=200,testing=true,timing=0,value=1i 1483142459000000000
trafficserver.request_count,content_type=application/json,domain=https://site33.com,hostname=localhost,method=GET,result=TCP_MISS,version=0.0.1 size=1230812,status=200,testing=true,timing=283,value=1i 1483142460000000000
doppler.batches_inflight,hostname=ryan-mbp.local value=1i 1484256497078568400

we have about 40 measurements, and the one trafficserver.request_count has a huge cardinality (I think). About 483371.

$ influx -database metrics -password $pass -username netlify -ssl --unsafeSsl -execute 'show series from "trafficserver.request_count"' > ats-request-count-series

I think a problem is that we have the size field and the domain tag and those rarely line up (though they could). I am going to split that into 2 measurements to try and bring the cardinality down.

Another point - it seems that if I start asking about this measurement it shoots the mem consumption up.

> select count(*) from "trafficserver.request_count"
name: trafficserver.request_count
time	count_size	count_status	count_timing	count_value
----	----------	------------	------------	-----------
0	3213		3213		2646		2646

image

I am not too sure how to get the number of points corresponding to that measurements, but it is by far going to be the largest we have (prob 2 orders of mag I'd expect). But here is what the internal db says as of now

image

^^ those are numSeries and numMeasurements respectively

@rybit
Copy link
Author

rybit commented Jan 12, 2017

and for perspective on scale here are measurements from the sender to influx. It seems to be keeping up very well

image

mark-rushakoff added a commit that referenced this issue Jan 13, 2017
This leak seems to have been introduced in 8aa224b,
present in 1.1.0 and 1.1.1.

When points were parsed from HTTP payloads, their tags and fields
referred to subslices of the request body; if any tag set introduced a
new series, then those tags then were stored in the in-memory series
index objects, preventing the HTTP body from being garbage collected. If
there were no new series in the payload, then the request body would be
garbage collected as usual.

Now, we clone the tags before we store them in the index. This is an
imperfect fix because the Point still holds references to the original
tags, and the Point's field iterator also refers to the payload buffer.
However, the current write code path does not retain references to the
Point or its fields; and this change will likely be obsoleted when TSI
is introduced.

This change likely fixes #7827, #7810, #7778, and perhaps others.
@jwilder jwilder added this to the 1.2.0 milestone Jan 23, 2017
@jwilder
Copy link
Contributor

jwilder commented Jan 23, 2017

Fixed by #7832

@jwilder jwilder closed this as completed Jan 23, 2017
gunnaraasen pushed a commit that referenced this issue Feb 16, 2017
This leak seems to have been introduced in 8aa224b,
present in 1.1.0 and 1.1.1.

When points were parsed from HTTP payloads, their tags and fields
referred to subslices of the request body; if any tag set introduced a
new series, then those tags then were stored in the in-memory series
index objects, preventing the HTTP body from being garbage collected. If
there were no new series in the payload, then the request body would be
garbage collected as usual.

Now, we clone the tags before we store them in the index. This is an
imperfect fix because the Point still holds references to the original
tags, and the Point's field iterator also refers to the payload buffer.
However, the current write code path does not retain references to the
Point or its fields; and this change will likely be obsoleted when TSI
is introduced.

This change likely fixes #7827, #7810, #7778, and perhaps others.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants