[0.10.0] Unbounded memory usage when inserting data #5685

SaganBolliger · 2016-02-14T07:14:40Z

I recently upgraded from 0.9.0 to 0.10.0 and since then have been running into memory issues that I'm guessing are related to the new tsm engine. My dataset consists of 6 series, each with 4 fields, and about 75M records per series. I'm not using tags. I'm inserting this data into influxdb using the Python Pandas client in chronological order in chunks of approximately 4k records at a rate of about 6 chunks per second (one chunk per series per second). Previously when I tried doing this with 0.9.0 the memory usage would stay moderate never going over around 2 GB, but with 0.10.0 memory usage seems to increase linearly with the amount of data inserted. After around 1/3rd of the data, the memory usage hits 15GB, at which point I'm forced to kill the process. I've had this same issue running on OSX El Cap and Ubuntu 14.04.

jonseymour · 2016-02-14T19:30:36Z

What kind of time period is covered by the timestamps in the total data set? If the timestamps in the data set cover a wide period of time (say several weeks, months or years), then influx will be creating multiple caches and WAL logs one for each shard that covers the time periods of interest so this might be causing the memory requirements to be larger than you might otherwise expect.

If this case applies, then you can try to reduce the size of each cache (use the cache-max-memory-size parameter inside the [data] section of the configuration) which will encourage the WAL to start compacting earlier. This should reduce the total memory consumed by influx during your initial load.

If you then start to consume data contemporaneously, you can revert to the default configuration which might be better optimised for processing data in real time.

julienmathevet · 2016-02-14T20:43:06Z

I have same problem too. Before Influxdb run on 2 G, now I need 16g. I am only writing data. If I set less than 16 G I got 500 http code, and even restarting influxdb dosent fix it.
On you website, I look hardarware recommendation seem wrong with 0.10

I think there is something wrong with 0.10

I will try your recommendation tomorrow. But I would like to know if this is a bug or it's normal. There is a huge difference.

jonseymour · 2016-02-14T20:52:14Z

I don't speak for influx, and my comments may only apply to the hypothetical case where a large amount of relatively sparsely distributed historical data is being bulk loaded; they don't apply to cases of memory issues where data is being loaded in real time or in cases where the historical data is dense enough to ensure that most writes are concentrated in a handful of active shards.

julienmathevet · 2016-02-14T21:03:19Z

Ok,
I am only writing continuous data. I write between 2000 and 5000 series per second spread on 20 measurement. I have only 3 tags.
I write by batch of 1000. If i set it too high performance go down and consume more memory

jonseymour · 2016-02-14T21:21:26Z

How big and how many .wal files are in the database directory?

SaganBolliger · 2016-02-14T21:29:05Z

Thanks for the suggestion @jonseymour. The data I'm loading covers a period of about 3 years, so it definitely qualifies. I managed to finish loading the data before I saw your comment by periodically restarting influxdb whenever the memory usage grew too large, so I didn't have a chance to see if the cache-max-memory-size parameter made a difference.

If cache-max-memory-size does fix the issue, should this perhaps be better documented? I imagine loading historical data at non-real-time rates is a fairly common use case. Or else having the default value set more conservatively with instructions to increase the limit in performance-critical situations?

julienmathevet · 2016-02-14T21:33:12Z

17 wals files and all is around 120 mo, the biggest is about 6mo
Database if about 10 Go on 3 days running.

jonseymour · 2016-02-14T21:38:29Z

@SaganBolliger personally, I think there is a case for dividing a total cache budget across the active shards so that the system can dynamically adjust to different load scenarios

jonseymour · 2016-02-14T22:10:07Z

@easyrasta based on this it seems unlikely that your memory issues are related to the caches of a large number of active shards. It might be worth raising a separate issue detailing your case so that you can solicit feedback about your particular problem from the influx support team.

julienmathevet · 2016-02-14T22:31:16Z

Thank's @jonseymour I did it

jwilder · 2016-02-14T23:20:26Z

@SaganBolliger Another setting you could adjust is cache-snapshot-write-cold-duration. If you are back filling data ordered by time, setting that lower should force the older shard's caches to snapshot more aggressively and free up memory. It defaults to 1h so you could try 1m perhaps. The value is the length of time from the last time the shard accepts a write before we consider the shard cold for writes and snapshot the cache to disk.

jonseymour · 2016-02-15T12:21:44Z

@SaganBolliger The size parameter that will increase flushing frequency if its value is decreased is actually cache-snapshot-memory-size not cache-max-memory-size as I stated earlier. But, of course, @jwilder's suggestion to use cache-snapshot-write-cold-duration is a better bet anyway.

julienmathevet · 2016-02-17T21:38:59Z

It would be great that someone write a blog on this kind of tuning. 1h isn't too high value for default ?

francisdb · 2016-02-17T22:06:27Z

would be great if influx would not need tuning, did not have this problem on 0.9.x

jwilder · 2016-02-17T22:37:07Z

It's quite possible the value is too high and the default should be lower. If you are able to test with lower values that would be useful data to share.

jonseymour · 2016-02-18T02:44:04Z

@francisdb - I am sure this on the influx roadmap somewhere - there comes a time when you just have to ship. I also have some ideas about how this might be done.

First things, first, though. It would be really useful I think, if the shard stats that are published to the _internal database were extended with the following metrics:

cacheSize - the current cacheSize
cacheAge - time since last snapshot
snapShotSize - the current snapShotSize
snapShots - the current number of snapshots

This would make it much easier to reason about where the big memory usage is and if there are any problems (for example; snapShots > 1) in the compaction path. It would help confirm issues such as those caused by backfilling as hypothesised in this issue.

Having such stats would also allow the before and after benefits of any later change aimed at dynamically optimising caching behaviour to be measured quantitatively.

I am quite keen to have a crack at adding such stats. @jwilder are you happy for me to propose changes in this area? If influxdata staff are already working on such changes let me know and I'll find some other things to do :-)

jwilder · 2016-02-18T02:59:03Z

@jonseymour #5499 is still open. Any stats and diagnostics would be useful and no one is working on that currently.

clongbottom · 2016-05-02T09:31:00Z

Just hit this myself, about 640 series, even if I massively delay my writes (1000 per block with 10s delay between writes) I run out of memory on a VM with 16Gb ram. I have about 3 years of data for 10 devices that I'm inserting into one measurement.

Just about to try messing with the config options but cache-snapshot-write-cold-duration down to "1m" didn't help.

francisdb · 2016-05-31T13:57:08Z

We have seen this when uploading data that introduced a lot of new shards being created. Like you have data for jan 2016 to april 2016 and you do a batch upload of sparse historic data all the way to jan 2010.
see #6635

joshughes · 2016-11-30T05:49:43Z

Also hit this when importing 2 years of data... was only able to work around it by restarting the database when the memory limit was hit.

mark-rushakoff · 2017-01-14T00:16:00Z

Closing this because the write path has been changed significantly since 0.10. Please open a new issue if you're running into problems on a current release.

@joshughes if you had to restart v1.1 after writing many new series you likely ran into #7832, fixed in 1.2rc1 but not yet backported to the 1.1 release.

julienmathevet mentioned this issue Feb 14, 2016

[0.10.0_nightly_d9ed54c-1] Influxdb opening too many TCP connections #5277

Closed

julienmathevet mentioned this issue Feb 14, 2016

High memory usage - too much shard active ? #5687

Closed

jonseymour mentioned this issue Feb 21, 2016

[feature proposal]: add an idle cache monitor #5771

Closed

jwilder added the area/performance label May 2, 2016

mark-rushakoff closed this as completed Jan 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10.0] Unbounded memory usage when inserting data #5685

[0.10.0] Unbounded memory usage when inserting data #5685

SaganBolliger commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

SaganBolliger commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jwilder commented Feb 14, 2016

jonseymour commented Feb 15, 2016

julienmathevet commented Feb 17, 2016

francisdb commented Feb 17, 2016

jwilder commented Feb 17, 2016

jonseymour commented Feb 18, 2016

jwilder commented Feb 18, 2016

clongbottom commented May 2, 2016 •

edited

Loading

francisdb commented May 31, 2016

joshughes commented Nov 30, 2016

mark-rushakoff commented Jan 14, 2017

[0.10.0] Unbounded memory usage when inserting data #5685

[0.10.0] Unbounded memory usage when inserting data #5685

Comments

SaganBolliger commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

SaganBolliger commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jonseymour commented Feb 14, 2016

jonseymour commented Feb 14, 2016

julienmathevet commented Feb 14, 2016

jwilder commented Feb 14, 2016

jonseymour commented Feb 15, 2016

julienmathevet commented Feb 17, 2016

francisdb commented Feb 17, 2016

jwilder commented Feb 17, 2016

jonseymour commented Feb 18, 2016

jwilder commented Feb 18, 2016

clongbottom commented May 2, 2016 • edited Loading

francisdb commented May 31, 2016

joshughes commented Nov 30, 2016

mark-rushakoff commented Jan 14, 2017

clongbottom commented May 2, 2016 •

edited

Loading