-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache maximum memory size exceeded #6109
Comments
@yangfan876 Did you convert your shard to the TSM storage engine recently? This may be due to the permissions being incorrect on the shards (preventing them from flushing at all). Can you run a (assuming InfluxDB is running as the
To see if that fixes it? |
You can also try increasing the value of |
@rossmcdonald yes, the version of influxdb I test is 0.11, TSM is default. and I'm sure '/data1/influxdb' belongs to influxdb user: |
@mark-rushakoff yes, I can increase the value of |
@yangfan876 try lowering |
@mark-rushakoff I read the code of influxdb, the store engine start goroutione to check the cache's size, but in the engine Open method: |
@mark-rushakoff BTW I don't think sleep 1 second is a good idea to check the cache's size. Maybe it can depend on the load of service. |
Same issue on influxdb 0.10 randomly. It was after running a fresh version of influxdb 0.10.0 for about 2 weeks the issue randomly came up on restart. I increased cache-max-memory-size to 1gb and it is working now. |
@sstarcher Did you upgrade from a previous version of InfluxDB? Or did you start with v0.10? |
@rossmcdonald no I have only used 0.10.0 for this data and no other version |
@rossmcdonald I upgraded from 0.10.0 to 0.11.0 and I still get the error, but it no longer crashes
|
The cache max memory size is an approximate size and can prevent a shard from loading at startup. This change disable the max size at startup to prevent this problem and sets the limt back after reloading. Fixes #6109
The cache max memory size is an approximate size and can prevent a shard from loading at startup. This change disable the max size at startup to prevent this problem and sets the limt back after reloading. Fixes #6109
cache size increases until max size (within a couple of days) and then influxdb accepts no points. What could be the cause? |
@vilinski does your write volume increase over that time? That's usually an error wrt write volume being too high at some point in time. If not that, any other errors in your logs leading up to this? |
@vilinski "wrt" means "with regard to". This cache fills up when you are writing more data in (which means the cache) than you can effectively snapshot (take out of the cache). This means you are either writing too much data for your cache size (you can configure it to be larger) at some point in your workload...or your snapshots are too slow because your disk is too slow. Make sure your cache is big enough for your workload. Make sure the memory on the node is larger enough to increase your cache size if needed. Most importantly, make sure your disk is fast. SSDs! |
thanks for explanations |
@vilinski try to find the Also, can you share an example continuous query? |
@vilinski another thing to look at is the metric snapshotCount in the |
we are already collecting the metrics with telegraf and having grafana dashboards, like one I posted above 16ce202f-1a22-497d-a840-1e15fe5156fe_1m CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1m" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe" GROUP BY time(1m), * END
16ce202f-1a22-497d-a840-1e15fe5156fe_1h CREATE CONTINUOUS QUERY "16ce202f-1a22-497d-a840-1e15fe5156fe_1h" ON "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData" BEGIN SELECT mean(value) AS value, min(value) AS min, max(value) AS max, stddev(value) AS stddev INTO "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1h" FROM "8dd2dd7a-9790-4212-b840-85813b547ea6-ProcessData".autogen."16ce202f-1a22-497d-a840-1e15fe5156fe_1m" GROUP BY time(1h), * END
each such measurement has one tag with up to 40 different tag values, so has about 40 data points/s |
the logged snapshot writes are written with random duration between 300 and 500 ms, but not increasing |
@vilinski well that's good -- indicates the cache snapshotting is keeping up. I'm not sure how else to identify this issue without reproduceable steps. Perhaps starting a thread in the community Slack would help. |
also already done ^^ https://influxcommunity.slack.com/archives/CH8TV3LJG/p1647216283324089 |
hi,
when I restart influxdb and I have too many wal files, the influxdb will exit. Logs as follows:
[cacheloader] 2016/03/24 14:51:50 reading file /data1/influxdb/wal/sysnoc/default/2/_00692.wal, size 10502077 [cacheloader] 2016/03/24 14:51:52 reading file /data1/influxdb/wal/sysnoc/default/2/_00693.wal, size 10489020 [cacheloader] 2016/03/24 14:51:53 reading file /data1/influxdb/wal/sysnoc/default/2/_00694.wal, size 10498512 [cacheloader] 2016/03/24 14:51:55 reading file /data1/influxdb/wal/sysnoc/default/2/_00695.wal, size 10501253 [cacheloader] 2016/03/24 14:51:56 reading file /data1/influxdb/wal/sysnoc/default/2/_00696.wal, size 10498537 [cacheloader] 2016/03/24 14:51:58 reading file /data1/influxdb/wal/sysnoc/default/2/_00697.wal, size 10503642 [cacheloader] 2016/03/24 14:51:59 reading file /data1/influxdb/wal/sysnoc/default/2/_00698.wal, size 10504904 [cacheloader] 2016/03/24 14:52:01 reading file /data1/influxdb/wal/sysnoc/default/2/_00699.wal, size 10488883 [cacheloader] 2016/03/24 14:52:03 reading file /data1/influxdb/wal/sysnoc/default/2/_00700.wal, size 10495194 [cacheloader] 2016/03/24 14:52:04 reading file /data1/influxdb/wal/sysnoc/default/2/_00701.wal, size 10505617 [cacheloader] 2016/03/24 14:52:06 reading file /data1/influxdb/wal/sysnoc/default/2/_00702.wal, size 10506820 [cacheloader] 2016/03/24 14:52:07 reading file /data1/influxdb/wal/sysnoc/default/2/_00703.wal, size 10504926 [cacheloader] 2016/03/24 14:52:09 reading file /data1/influxdb/wal/sysnoc/default/2/_00704.wal, size 10494123 run: open server: open tsdb store: [shard 2] cache maximum memory size exceeded
I think if wal file is too many, service should stop read and flush some cache to disk, it should not exit.
The text was updated successfully, but these errors were encountered: