Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache maximum memory size exceeded #6109

Closed
yangfan876 opened this issue Mar 24, 2016 · 22 comments
Closed

cache maximum memory size exceeded #6109

yangfan876 opened this issue Mar 24, 2016 · 22 comments
Labels
Milestone

Comments

@yangfan876
Copy link

hi,
when I restart influxdb and I have too many wal files, the influxdb will exit. Logs as follows:

[cacheloader] 2016/03/24 14:51:50 reading file /data1/influxdb/wal/sysnoc/default/2/_00692.wal, size 10502077 [cacheloader] 2016/03/24 14:51:52 reading file /data1/influxdb/wal/sysnoc/default/2/_00693.wal, size 10489020 [cacheloader] 2016/03/24 14:51:53 reading file /data1/influxdb/wal/sysnoc/default/2/_00694.wal, size 10498512 [cacheloader] 2016/03/24 14:51:55 reading file /data1/influxdb/wal/sysnoc/default/2/_00695.wal, size 10501253 [cacheloader] 2016/03/24 14:51:56 reading file /data1/influxdb/wal/sysnoc/default/2/_00696.wal, size 10498537 [cacheloader] 2016/03/24 14:51:58 reading file /data1/influxdb/wal/sysnoc/default/2/_00697.wal, size 10503642 [cacheloader] 2016/03/24 14:51:59 reading file /data1/influxdb/wal/sysnoc/default/2/_00698.wal, size 10504904 [cacheloader] 2016/03/24 14:52:01 reading file /data1/influxdb/wal/sysnoc/default/2/_00699.wal, size 10488883 [cacheloader] 2016/03/24 14:52:03 reading file /data1/influxdb/wal/sysnoc/default/2/_00700.wal, size 10495194 [cacheloader] 2016/03/24 14:52:04 reading file /data1/influxdb/wal/sysnoc/default/2/_00701.wal, size 10505617 [cacheloader] 2016/03/24 14:52:06 reading file /data1/influxdb/wal/sysnoc/default/2/_00702.wal, size 10506820 [cacheloader] 2016/03/24 14:52:07 reading file /data1/influxdb/wal/sysnoc/default/2/_00703.wal, size 10504926 [cacheloader] 2016/03/24 14:52:09 reading file /data1/influxdb/wal/sysnoc/default/2/_00704.wal, size 10494123 run: open server: open tsdb store: [shard 2] cache maximum memory size exceeded

I think if wal file is too many, service should stop read and flush some cache to disk, it should not exit.

@rossmcdonald
Copy link
Contributor

@yangfan876 Did you convert your shard to the TSM storage engine recently? This may be due to the permissions being incorrect on the shards (preventing them from flushing at all). Can you run a (assuming InfluxDB is running as the influxdb user):

sudo chown -R influxdb:influxdb /data1/influxdb

To see if that fixes it?

@mark-rushakoff
Copy link
Contributor

You can also try increasing the value of cache-max-memory-size in your config file, at least as a temporary workaround. It defaults to 500MB (524288000).

@yangfan876
Copy link
Author

@rossmcdonald yes, the version of influxdb I test is 0.11, TSM is default. and I'm sure '/data1/influxdb' belongs to influxdb user:
drwxr-xr-x 6 influxdb influxdb 47 Mar 23 18:36 influxdb

@yangfan876
Copy link
Author

@mark-rushakoff yes, I can increase the value of cache-max-memory-size,but in my view when I reach the maximum memory of cache maybe we should flush cache data to disk and then keep reading wal files.

@mark-rushakoff
Copy link
Contributor

@yangfan876 try lowering cache-snapshot-memory-size. The code looks like it should check the cache's size against that value once per second, and if the current value is greater than the configuration value, write the cache to disk as a snapshot file.

@yangfan876
Copy link
Author

@mark-rushakoff I read the code of influxdb, the store engine start goroutione to check the cache's size, but in the engine Open method:
file:tsdb/engine/tsm1/engine.go
128 func (e *Engine) Open() error { 129 e.done = make(chan struct{}) 130 e.Compactor.Cancel = e.done 131 132 if err := os.MkdirAll(e.path, 0777); err != nil { 133 return err 134 } 135 136 if err := e.cleanup(); err != nil { 137 return err 138 } 139 140 if err := e.WAL.Open(); err != nil { 141 return err 142 } 143 144 if err := e.FileStore.Open(); err != nil { 145 return err 146 } 147 148 if err := e.reloadCache(); err != nil { //reload wal before check cache goroutine start. 149 return err 150 } 151 152 e.wg.Add(5) 153 go e.compactCache() //start goroutine to check cache. 154 go e.compactTSMFull() 155 go e.compactTSMLevel(true, 1) 156 go e.compactTSMLevel(true, 2) 157 go e.compactTSMLevel(false, 3) 158 159 return nil 160 }
so, as you know the when I have a lot of wal files to reload, but the compactCache not start yet, cache will reach the maximum memory. And I made a pull request with the other account.

@yangfan876
Copy link
Author

@mark-rushakoff BTW I don't think sleep 1 second is a good idea to check the cache's size. Maybe it can depend on the load of service.

@sstarcher
Copy link

Same issue on influxdb 0.10 randomly. It was after running a fresh version of influxdb 0.10.0 for about 2 weeks the issue randomly came up on restart. I increased cache-max-memory-size to 1gb and it is working now.

@rossmcdonald
Copy link
Contributor

@sstarcher Did you upgrade from a previous version of InfluxDB? Or did you start with v0.10?

@sstarcher
Copy link

@rossmcdonald no I have only used 0.10.0 for this data and no other version

@sstarcher
Copy link

@rossmcdonald I upgraded from 0.10.0 to 0.11.0 and I still get the error, but it no longer crashes

March 30th 2016, 09:12:47.262   [store] 2016/03/30 13:12:47 Failed to open shard: 231: [shard 231] cache maximum memory size exceeded
March 30th 2016, 09:12:46.893   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01465.wal, size 7906114
March 30th 2016, 09:12:46.502   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01464.wal, size 10638834
March 30th 2016, 09:12:46.080   [cacheloader] 2016/03/30 13:12:46 reading file /data/db/wal/telegraf/default/231/_01463.wal, size 11191183
March 30th 2016, 09:12:45.654   [cacheloader] 2016/03/30 13:12:45 reading file /data/db/wal/telegraf/default/231/_01462.wal, size 10949076
March 30th 2016, 09:12:45.239   [cacheloader] 2016/03/30 13:12:45 reading file /data/db/wal/telegraf/default/231/_01461.wal, size 11130405
March 30th 2016, 09:12:44.729   [cacheloader] 2016/03/30 13:12:44 reading file /data/db/wal/telegraf/default/231/_01460.wal, size 11082699
March 30th 2016, 09:12:44.237   [cacheloader] 2016/03/30 13:12:44 reading file /data/db/wal/telegraf/default/231/_01459.wal, size 10987478
March 30th 2016, 09:12:43.835   [cacheloader] 2016/03/30 13:12:43 reading file /data/db/wal/telegraf/default/231/_01458.wal, size 10588884
March 30th 2016, 09:12:43.436   [cacheloader] 2016/03/30 13:12:43 reading file /data/db/wal/telegraf/default/231/_01457.wal, size 11138151
March 30th 2016, 09:12:42.997   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01456.wal, size 11160262
March 30th 2016, 09:12:42.509   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01455.wal, size 11168045
March 30th 2016, 09:12:42.076   [cacheloader] 2016/03/30 13:12:42 reading file /data/db/wal/telegraf/default/231/_01454.wal, size 10495132
March 30th 2016, 09:12:41.657   [cacheloader] 2016/03/30 13:12:41 reading file /data/db/wal/telegraf/default/231/_01453.wal, size 10688138
March 30th 2016, 09:12:41.171   [cacheloader] 2016/03/30 13:12:41 reading file /data/db/wal/telegraf/default/231/_01452.wal, size 11241832
March 30th 2016, 09:12:40.814   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01451.wal, size 10552126
March 30th 2016, 09:12:40.431   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01450.wal, size 10961726
March 30th 2016, 09:12:40.079   [cacheloader] 2016/03/30 13:12:40 reading file /data/db/wal/telegraf/default/231/_01449.wal, size 10660695
March 30th 2016, 09:12:39.632   [cacheloader] 2016/03/30 13:12:39 reading file /data/db/wal/telegraf/default/231/_01448.wal, size 10796586

jwilder added a commit that referenced this issue Apr 20, 2016
The cache max memory size is an approximate size and can prevent a
shard from loading at startup.  This change disable the max size
at startup to prevent this problem and sets the limt back after
reloading.

Fixes #6109
@jwilder jwilder added this to the 0.13.0 milestone Apr 20, 2016
jwilder added a commit that referenced this issue Apr 20, 2016
The cache max memory size is an approximate size and can prevent a
shard from loading at startup.  This change disable the max size
at startup to prevent this problem and sets the limt back after
reloading.

Fixes #6109
mohit pushed a commit to graveyard/influxdb-service that referenced this issue May 12, 2016
@vilinski
Copy link

vilinski commented Mar 13, 2022

cache size increases until max size (within a couple of days) and then influxdb accepts no points. What could be the cause?
influxdb 1.8.10-1 on debian
{"error":"engine: cache-max-memory-size exceeded: (1073884708/1073741824)"}

@samhld
Copy link
Contributor

samhld commented Mar 14, 2022

@vilinski does your write volume increase over that time? That's usually an error wrt write volume being too high at some point in time.

If not that, any other errors in your logs leading up to this?

@vilinski
Copy link

vilinski commented Mar 15, 2022

TBH I don't know what wrt write volume is 😄

DB size on disk is growing, cache as well
grafik

Only logs other than Post and Get are like this:

Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.464301Z lvl=info msg="Cache snapshot (start)" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot op_event=start
Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.817628Z lvl=info msg="Snapshot for path written" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/data/8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData/autogen/129 duration=353.343ms
Mar 15 09:09:11 suite-ipc-10 influxd-systemd-start.sh[6698]: ts=2022-03-15T08:09:11.817661Z lvl=info msg="Cache snapshot (end)" log_id=0_EfZnhG000 engine=tsm1 trace_id=0_F9pQA0000 op_name=tsm1_cache_snapshot op_event=end op_elapsed=353.373ms
5

@samhld
Copy link
Contributor

samhld commented Mar 15, 2022

@vilinski "wrt" means "with regard to". This cache fills up when you are writing more data in (which means the cache) than you can effectively snapshot (take out of the cache). This means you are either writing too much data for your cache size (you can configure it to be larger) at some point in your workload...or your snapshots are too slow because your disk is too slow.

Make sure your cache is big enough for your workload. Make sure the memory on the node is larger enough to increase your cache size if needed. Most importantly, make sure your disk is fast. SSDs!

@vilinski
Copy link

thanks for explanations
well, this is already an ssd, works even with more data/s, writes batched as recommended by 1s or 10000 points at max. Should fit into default 25mb?
Maybe more than 100 of compressing continuous queries play a role? one per measurement

@samhld
Copy link
Contributor

samhld commented Mar 17, 2022

@vilinski try to find the tsm1_cache_snapshot logs and see if the time they're taking is increasing.

Also, can you share an example continuous query?

@samhld
Copy link
Contributor

samhld commented Mar 17, 2022

@vilinski another thing to look at is the metric snapshotCount in the _internal database. If that's not enabled, you can find that at the /debug/vars endpoint. Telegraf can collect all these metrics for you and you viz them in Grafana.

@vilinski
Copy link

we are already collecting the metrics with telegraf and having grafana dashboards, like one I posted above
the influxdb_tsm1_cache_snapshotCount metric has always a value 0 for all dbs and paths
continuous queries are precalculating values, to show in a software:

16ce202f-1a22-497d-a840-1e15fe5156fe_1m CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1m" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe" GROUP BY time(1m), * END
16ce202f-1a22-497d-a840-1e15fe5156fe_1h CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1h" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1h" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" GROUP BY time(1h), * END

each such measurement has one tag with up to 40 different tag values, so has about 40 data points/s

@vilinski
Copy link

the logged snapshot writes are written with random duration between 300 and 500 ms, but not increasing

@samhld
Copy link
Contributor

samhld commented Mar 17, 2022

@vilinski well that's good -- indicates the cache snapshotting is keeping up. I'm not sure how else to identify this issue without reproduceable steps. Perhaps starting a thread in the community Slack would help.

@vilinski
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants