Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10.0] Unbounded memory usage when inserting data #5685

Closed
SaganBolliger opened this issue Feb 14, 2016 · 21 comments
Closed

[0.10.0] Unbounded memory usage when inserting data #5685

SaganBolliger opened this issue Feb 14, 2016 · 21 comments

Comments

@SaganBolliger
Copy link

I recently upgraded from 0.9.0 to 0.10.0 and since then have been running into memory issues that I'm guessing are related to the new tsm engine. My dataset consists of 6 series, each with 4 fields, and about 75M records per series. I'm not using tags. I'm inserting this data into influxdb using the Python Pandas client in chronological order in chunks of approximately 4k records at a rate of about 6 chunks per second (one chunk per series per second). Previously when I tried doing this with 0.9.0 the memory usage would stay moderate never going over around 2 GB, but with 0.10.0 memory usage seems to increase linearly with the amount of data inserted. After around 1/3rd of the data, the memory usage hits 15GB, at which point I'm forced to kill the process. I've had this same issue running on OSX El Cap and Ubuntu 14.04.

@jonseymour
Copy link
Contributor

What kind of time period is covered by the timestamps in the total data set? If the timestamps in the data set cover a wide period of time (say several weeks, months or years), then influx will be creating multiple caches and WAL logs one for each shard that covers the time periods of interest so this might be causing the memory requirements to be larger than you might otherwise expect.

If this case applies, then you can try to reduce the size of each cache (use the cache-max-memory-size parameter inside the [data] section of the configuration) which will encourage the WAL to start compacting earlier. This should reduce the total memory consumed by influx during your initial load.

If you then start to consume data contemporaneously, you can revert to the default configuration which might be better optimised for processing data in real time.

@julienmathevet
Copy link

I have same problem too. Before Influxdb run on 2 G, now I need 16g. I am only writing data. If I set less than 16 G I got 500 http code, and even restarting influxdb dosent fix it.
On you website, I look hardarware recommendation seem wrong with 0.10

I think there is something wrong with 0.10

I will try your recommendation tomorrow. But I would like to know if this is a bug or it's normal. There is a huge difference.

@jonseymour
Copy link
Contributor

I don't speak for influx, and my comments may only apply to the hypothetical case where a large amount of relatively sparsely distributed historical data is being bulk loaded; they don't apply to cases of memory issues where data is being loaded in real time or in cases where the historical data is dense enough to ensure that most writes are concentrated in a handful of active shards.

@julienmathevet
Copy link

Ok,
I am only writing continuous data. I write between 2000 and 5000 series per second spread on 20 measurement. I have only 3 tags.
I write by batch of 1000. If i set it too high performance go down and consume more memory

@jonseymour
Copy link
Contributor

How big and how many .wal files are in the database directory?

@SaganBolliger
Copy link
Author

Thanks for the suggestion @jonseymour. The data I'm loading covers a period of about 3 years, so it definitely qualifies. I managed to finish loading the data before I saw your comment by periodically restarting influxdb whenever the memory usage grew too large, so I didn't have a chance to see if the cache-max-memory-size parameter made a difference.

If cache-max-memory-size does fix the issue, should this perhaps be better documented? I imagine loading historical data at non-real-time rates is a fairly common use case. Or else having the default value set more conservatively with instructions to increase the limit in performance-critical situations?

@julienmathevet
Copy link

17 wals files and all is around 120 mo, the biggest is about 6mo
Database if about 10 Go on 3 days running.

@jonseymour
Copy link
Contributor

@SaganBolliger personally, I think there is a case for dividing a total cache budget across the active shards so that the system can dynamically adjust to different load scenarios

@jonseymour
Copy link
Contributor

@easyrasta based on this it seems unlikely that your memory issues are related to the caches of a large number of active shards. It might be worth raising a separate issue detailing your case so that you can solicit feedback about your particular problem from the influx support team.

@julienmathevet
Copy link

Thank's @jonseymour I did it

@jwilder
Copy link
Contributor

jwilder commented Feb 14, 2016

@SaganBolliger Another setting you could adjust is cache-snapshot-write-cold-duration. If you are back filling data ordered by time, setting that lower should force the older shard's caches to snapshot more aggressively and free up memory. It defaults to 1h so you could try 1m perhaps. The value is the length of time from the last time the shard accepts a write before we consider the shard cold for writes and snapshot the cache to disk.

@jonseymour
Copy link
Contributor

@SaganBolliger The size parameter that will increase flushing frequency if its value is decreased is actually cache-snapshot-memory-size not cache-max-memory-size as I stated earlier. But, of course, @jwilder's suggestion to use cache-snapshot-write-cold-duration is a better bet anyway.

@julienmathevet
Copy link

It would be great that someone write a blog on this kind of tuning. 1h isn't too high value for default ?

@francisdb
Copy link

would be great if influx would not need tuning, did not have this problem on 0.9.x

@jwilder
Copy link
Contributor

jwilder commented Feb 17, 2016

It's quite possible the value is too high and the default should be lower. If you are able to test with lower values that would be useful data to share.

@jonseymour
Copy link
Contributor

@francisdb - I am sure this on the influx roadmap somewhere - there comes a time when you just have to ship. I also have some ideas about how this might be done.

First things, first, though. It would be really useful I think, if the shard stats that are published to the _internal database were extended with the following metrics:

  • cacheSize - the current cacheSize
  • cacheAge - time since last snapshot
  • snapShotSize - the current snapShotSize
  • snapShots - the current number of snapshots

This would make it much easier to reason about where the big memory usage is and if there are any problems (for example; snapShots > 1) in the compaction path. It would help confirm issues such as those caused by backfilling as hypothesised in this issue.

Having such stats would also allow the before and after benefits of any later change aimed at dynamically optimising caching behaviour to be measured quantitatively.

I am quite keen to have a crack at adding such stats. @jwilder are you happy for me to propose changes in this area? If influxdata staff are already working on such changes let me know and I'll find some other things to do :-)

@jwilder
Copy link
Contributor

jwilder commented Feb 18, 2016

@jonseymour #5499 is still open. Any stats and diagnostics would be useful and no one is working on that currently.

@clongbottom
Copy link

clongbottom commented May 2, 2016

Just hit this myself, about 640 series, even if I massively delay my writes (1000 per block with 10s delay between writes) I run out of memory on a VM with 16Gb ram. I have about 3 years of data for 10 devices that I'm inserting into one measurement.

Just about to try messing with the config options but cache-snapshot-write-cold-duration down to "1m" didn't help.

@francisdb
Copy link

We have seen this when uploading data that introduced a lot of new shards being created. Like you have data for jan 2016 to april 2016 and you do a batch upload of sparse historic data all the way to jan 2010.
see #6635

@joshughes
Copy link

Also hit this when importing 2 years of data... was only able to work around it by restarting the database when the memory limit was hit.

@mark-rushakoff
Copy link
Contributor

Closing this because the write path has been changed significantly since 0.10. Please open a new issue if you're running into problems on a current release.

@joshughes if you had to restart v1.1 after writing many new series you likely ran into #7832, fixed in 1.2rc1 but not yet backported to the 1.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants