-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak crashing influxdb #7810
Comments
Many (most?) users are not experiencing this problem. If there's a memory leak, we want to fix it, but we need more detail to track down the source of the issue. Please collect the following diagnostic information that was requested in the issue template. It will be most helpful if you gather this multiple times, each an hour or two apart, as you observe the memory growing.
|
Yeah, I thought the behaviour was very strange. As requested here is the gist with the results. Is there anyway for me to see the number of series per measurement? I could then dig into if one of them is acting up (like has an always increasing field/tag). |
I looked at the series cardinality and I think that we have most of our metrics are around a few hundred series, but we have 2 that are pretty large.
the largest one I might split up to 2 measurements. It currently has a size and domain field, which commonly don't match up with any other points. It would significantly decrease the cardinality but add a whole nother measurement. I am not sure if that is super relevant? |
Is the process panicing with a memory allocation error or is it being killed by the OS? Do you have a stack trace you could attach? Also, how are you writing data? telegraf? Go client? How big are your batches and what kind of data are you writing? The heap profiles looks like there might be a connection leak from the client side, but the goroutine profile attached doesn't correlate with that idea. Would you be able to get a goroutine profile when the heap is high? |
We are using the go client Thanks for all your help @jwilder! |
There are a number of query-related goroutines in that stack trace that are waiting for about a minute. If you can check the server logs around a minute before the OOM crash, you may be able to find the query or queries that triggered the OOM. Common "bad" queries may have a |
Awesome - I'll dig through those for sure |
@rybit Also, another thing you could try is setting:
in your config to help rule out that there are write connections leaking? |
ok - there should only be 1 service writing and I can take a look at how many inflight batches it has around this time...I think I can at least :) |
I disabled a query I suspected was being a bad actor, and put some limits in place
One thing I know for sure is that I have a single measurement with a really high cardinality. It also generates the most points in the system and is queried a lot too. Is there any documentation on what each of the measurements in the |
There's an open docs pr documenting the internal metrics: influxdata/docs.influxdata.com-ARCHIVE#777 We're still investigating what may be causing the memory leak. |
I am still digging in a bit too. Let me know if there is some more info I can provide. |
@rybit can you run this short script to gather multiple goroutine and heap profiles, one per hour for 6 hours? The single profiles from earlier didn't have any smoking guns, but these profiles will be more likely to show us what's leaking over time. It's important that these profiles are all from a single run of for i in $(seq 6); do
[ "$i" != "1" ] && sleep 3600
curl -o "goroutine-$(date +%s).txt" "http://localhost:8086/debug/pprof/goroutine?debug=1"
curl -o "heap-$(date +%s).txt" "http://localhost:8086/debug/pprof/heap?debug=1"
done |
will do! |
ok samples run - the DB went on to crash around 08:00 this morning, but these samples show some growth. here is the gist with the files. It seems that long long files makes gist unhappy :) |
@rybit Are you able to share some sample write payloads? We're trying to reproduce this, but have not been successful. |
@jwilder yeah I can do that - lemme snag a few. |
I pulled a little sample onto my local box. I can't get all the traffic we have (I am still going to try for more...) but the data looks like this
we have about 40 measurements, and the one
I think a problem is that we have the Another point - it seems that if I start asking about this measurement it shoots the mem consumption up.
I am not too sure how to get the number of points corresponding to that measurements, but it is by far going to be the largest we have (prob 2 orders of mag I'd expect). But here is what the internal db says as of now ^^ those are numSeries and numMeasurements respectively |
This leak seems to have been introduced in 8aa224b, present in 1.1.0 and 1.1.1. When points were parsed from HTTP payloads, their tags and fields referred to subslices of the request body; if any tag set introduced a new series, then those tags then were stored in the in-memory series index objects, preventing the HTTP body from being garbage collected. If there were no new series in the payload, then the request body would be garbage collected as usual. Now, we clone the tags before we store them in the index. This is an imperfect fix because the Point still holds references to the original tags, and the Point's field iterator also refers to the payload buffer. However, the current write code path does not retain references to the Point or its fields; and this change will likely be obsoleted when TSI is introduced. This change likely fixes #7827, #7810, #7778, and perhaps others.
Fixed by #7832 |
This leak seems to have been introduced in 8aa224b, present in 1.1.0 and 1.1.1. When points were parsed from HTTP payloads, their tags and fields referred to subslices of the request body; if any tag set introduced a new series, then those tags then were stored in the in-memory series index objects, preventing the HTTP body from being garbage collected. If there were no new series in the payload, then the request body would be garbage collected as usual. Now, we clone the tags before we store them in the index. This is an imperfect fix because the Point still holds references to the original tags, and the Point's field iterator also refers to the payload buffer. However, the current write code path does not retain references to the Point or its fields; and this change will likely be obsoleted when TSI is introduced. This change likely fixes #7827, #7810, #7778, and perhaps others.
I am seeing influxdb crash every day or two, because it consumes all the memory on the box. I took it from a 4GB to a 16GB and it just pushed how long it can go. Below is the chart. I find it odd that the Alloc and HeapAlloc are identical, but I looked here and it seems right. Still the growth and eventual OOM is a problem.
influxdb version: InfluxDB v1.1.1 (git: master e47cf1f)
linux version: ubuntu 14.04
The text was updated successfully, but these errors were encountered: