Key space is not cleanup up by retention policy #8819

ahermspacketwerk · 2017-09-12T08:48:57Z

affected version: influx 1.3.1

We are trying to use influxdb in a memory limited environment. Our data scheme requires to assign a medium number (1000 to 10000) of tag values.

We already found out that this is not optimal regarding the memory usage of influxdb.

Our current attempt is to limit the number of series with max-series-per-database = 10000.
So far, it works as expected. Whenever we hit the limit, new tag values are dropped during import, which is what we try to achieve.

But the key-space of the tag values seems to be not released by the retention policies. Even when the series should have been dropped we cannot insert new entries.
As a result the database looks empty (doing a select * query) but we cannot insert new elements.

This issue can be solved by restarting influxd. After restart we can insert new tag values again.

Any idea how we can overcome this?

e-dard · 2017-09-15T10:18:42Z

Thanks for the @ahermspacketwerk,

Could you just check that this issue occurs on 1.3.5? It sounds like the index isn't being updated appropriately when the retention policy service drops a shard. We should be checking if we have dropped the last existing point for a series, but possibly that's being missed somewhere.

ahermspacketwerk · 2017-09-28T11:29:35Z

Sorry for the late response. I was out of office and it take some time until we have collected enough data in our system to hit the limit.

Anyway - yes, the same problem occurs with InfluxDB v1.3.5 (git: HEAD 9d90010)

e-dard · 2017-09-28T11:44:19Z

Thanks @ahermspacketwerk. I'm treating this as a bug and we'll work on a fix for the next release.

e-dard · 2017-10-03T10:50:37Z

Hi @ahermspacketwerk

I'm having trouble reproducing this, and the code path followed by deleting a retention policy seems to be the correct one.

Can you confirm in your logs that you're seeing lines like retention policy shard deletion check commencing, and also, are you seeing any of the following lines?

failed to delete shard group ?? from database ??, retention policy ??
deleted shard group ?? from database ??, retention policy ??
failed to delete shard ID ?? from database ??, retention policy ??: ??
shard ID ?? from database ??, retention policy ??, deleted
error pruning shard groups: ??

If you see any of those would you be able to provide your logs? You can email them to me privately if you don't want to post them on the ticket. My email is edd@<name_of_database>.com

ahermspacketwerk · 2017-10-06T07:02:21Z

Hi @e-dard

I will reproduce the problem on our setup and look into the log files. Once, the problem is established, we could take a closer look at the situation. We have seen weird effects like shards that are present, but the tables queries do not return any entries. Though, without the insider knowledge it's hard to figure out what this means.

Unfortunately my previous setup had to be reinstalled, which means I will need another week to come to the point we see something.

e-dard · 2017-10-10T14:55:03Z

Just an update: I was able to reproduce this quite trivially with a 1h retention policy and setting the max series to 1 or 2 series. Insert a couple of bits, wait an hour or so and then SHOW SERIES will return nothing (the shard has been dropped) but the limit is still in place and it's not possible to add any new series.

ahermspacketwerk · 2017-10-19T09:11:02Z

Great, so you don't need my huge data example. If you need anything else, let me know.

e-dard · 2017-10-26T09:20:28Z

@ahermspacketwerk OK I think I've narrowed this down. I think it's a race inside of the service that manages the labelling of expired shards, and their subsequent removal. If I'm right then I would expect the max series limit to be reset after:

[retention]
    check-interval = "30m0s"

To clarify:

after check-interval time the retention service checks for expired shards, and marks any as such.
at the same interval another part of the service should remove them from disk (and therefore the index)
it's possible due to a race that the second part doesn't complete, and so while it looks like the shards have gone if you use say SHOW SHARDS, they're not actually removed from disk and the index yet.
wait for check-interval to roll around again, and the shards that were missed will be removed from the index...

I would expect when you experience this issue if you waited another 30m the limit would sort itself out. Hope that makes sense? In the meantime I'm going to put a fix together for this...

ahermspacketwerk · 2017-10-26T09:38:56Z

As I observed this, the key space was not cleaned up for days. Our software is continuously trying to write data. Even after days there are no entries. "select * ..." returns no values anymore.

If this is only a delayed clean-up, I would expect at least some values.

Fixes #8819. Previously, the process of dropping expired shards according to the retention policy duration, was managed by two independent goroutines in the retention policy service. This behaviour was introduced in #2776, at a time when there were both data and meta nodes in the OSS codebase. The idea was that only the leader meta node would run the meta data deletions in the first goroutine, and all other nodes would run the local deletions in the second goroutine. InfluxDB no longer operates in that way and so we ended up with two independent goroutines that were carrying out an action that was really dependent on each other. If the second goroutine runs before the first then it may not see the meta data changes indicating shards should be deleted and it won't delete any shards locally. Shortly after this the first goroutine will run and remove the meta data for the shard groups. This results in a situation where it looks like the shards have gone, but in fact they remain on disk (and importantly, their series within the index) until the next time the second goroutine runs. By default that's 30 minutes. In the case where the shards to be removed would have removed the last occurences of some series, then it's possible that if the database was already at its maximum series limit (or tag limit for that matter), no further new series can be inserted.

e-dard added area/retention policies area/storage labels Sep 15, 2017

e-dard added kind/bug proposed backlog/storage and removed area/storage labels Sep 28, 2017

e-dard added this to the 1.4.0 milestone Oct 20, 2017

e-dard mentioned this issue Oct 26, 2017

Ensure retention service removes shards locally #9017

Merged

3 tasks

ghost assigned e-dard Oct 26, 2017

ghost added review and removed proposed labels Oct 26, 2017

e-dard added a commit that referenced this issue Oct 26, 2017

Add repro test for #8819

77977af

e-dard closed this as completed in #9017 Oct 26, 2017

ghost removed the review label Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key space is not cleanup up by retention policy #8819

Key space is not cleanup up by retention policy #8819

ahermspacketwerk commented Sep 12, 2017

e-dard commented Sep 15, 2017

ahermspacketwerk commented Sep 28, 2017

e-dard commented Sep 28, 2017

e-dard commented Oct 3, 2017

ahermspacketwerk commented Oct 6, 2017

e-dard commented Oct 10, 2017

ahermspacketwerk commented Oct 19, 2017

e-dard commented Oct 26, 2017

ahermspacketwerk commented Oct 26, 2017

Key space is not cleanup up by retention policy #8819

Key space is not cleanup up by retention policy #8819

Comments

ahermspacketwerk commented Sep 12, 2017

e-dard commented Sep 15, 2017

ahermspacketwerk commented Sep 28, 2017

e-dard commented Sep 28, 2017

e-dard commented Oct 3, 2017

ahermspacketwerk commented Oct 6, 2017

e-dard commented Oct 10, 2017

ahermspacketwerk commented Oct 19, 2017

e-dard commented Oct 26, 2017

ahermspacketwerk commented Oct 26, 2017