Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key space is not cleanup up by retention policy #8819

Closed
ahermspacketwerk opened this issue Sep 12, 2017 · 9 comments · Fixed by #9017
Closed

Key space is not cleanup up by retention policy #8819

ahermspacketwerk opened this issue Sep 12, 2017 · 9 comments · Fixed by #9017

Comments

@ahermspacketwerk
Copy link

affected version: influx 1.3.1

We are trying to use influxdb in a memory limited environment. Our data scheme requires to assign a medium number (1000 to 10000) of tag values.

We already found out that this is not optimal regarding the memory usage of influxdb.

Our current attempt is to limit the number of series with max-series-per-database = 10000.
So far, it works as expected. Whenever we hit the limit, new tag values are dropped during import, which is what we try to achieve.

But the key-space of the tag values seems to be not released by the retention policies. Even when the series should have been dropped we cannot insert new entries.
As a result the database looks empty (doing a select * query) but we cannot insert new elements.

This issue can be solved by restarting influxd. After restart we can insert new tag values again.

Any idea how we can overcome this?

@e-dard
Copy link
Contributor

e-dard commented Sep 15, 2017

Thanks for the @ahermspacketwerk,

Could you just check that this issue occurs on 1.3.5? It sounds like the index isn't being updated appropriately when the retention policy service drops a shard. We should be checking if we have dropped the last existing point for a series, but possibly that's being missed somewhere.

@ahermspacketwerk
Copy link
Author

Sorry for the late response. I was out of office and it take some time until we have collected enough data in our system to hit the limit.

Anyway - yes, the same problem occurs with InfluxDB v1.3.5 (git: HEAD 9d90010)

@e-dard
Copy link
Contributor

e-dard commented Sep 28, 2017

Thanks @ahermspacketwerk. I'm treating this as a bug and we'll work on a fix for the next release.

@e-dard
Copy link
Contributor

e-dard commented Oct 3, 2017

Hi @ahermspacketwerk

I'm having trouble reproducing this, and the code path followed by deleting a retention policy seems to be the correct one.

Can you confirm in your logs that you're seeing lines like retention policy shard deletion check commencing, and also, are you seeing any of the following lines?

failed to delete shard group ?? from database ??, retention policy ??
deleted shard group ?? from database ??, retention policy ??
failed to delete shard ID ?? from database ??, retention policy ??: ??
shard ID ?? from database ??, retention policy ??, deleted
error pruning shard groups: ??

If you see any of those would you be able to provide your logs? You can email them to me privately if you don't want to post them on the ticket. My email is edd@<name_of_database>.com

@ahermspacketwerk
Copy link
Author

Hi @e-dard

I will reproduce the problem on our setup and look into the log files. Once, the problem is established, we could take a closer look at the situation. We have seen weird effects like shards that are present, but the tables queries do not return any entries. Though, without the insider knowledge it's hard to figure out what this means.

Unfortunately my previous setup had to be reinstalled, which means I will need another week to come to the point we see something.

@e-dard
Copy link
Contributor

e-dard commented Oct 10, 2017

Just an update: I was able to reproduce this quite trivially with a 1h retention policy and setting the max series to 1 or 2 series. Insert a couple of bits, wait an hour or so and then SHOW SERIES will return nothing (the shard has been dropped) but the limit is still in place and it's not possible to add any new series.

@ahermspacketwerk
Copy link
Author

Great, so you don't need my huge data example. If you need anything else, let me know.

@e-dard e-dard added this to the 1.4.0 milestone Oct 20, 2017
@e-dard
Copy link
Contributor

e-dard commented Oct 26, 2017

@ahermspacketwerk OK I think I've narrowed this down. I think it's a race inside of the service that manages the labelling of expired shards, and their subsequent removal. If I'm right then I would expect the max series limit to be reset after:

[retention]
    check-interval = "30m0s"

To clarify:

  • after check-interval time the retention service checks for expired shards, and marks any as such.
  • at the same interval another part of the service should remove them from disk (and therefore the index)
  • it's possible due to a race that the second part doesn't complete, and so while it looks like the shards have gone if you use say SHOW SHARDS, they're not actually removed from disk and the index yet.
  • wait for check-interval to roll around again, and the shards that were missed will be removed from the index...

I would expect when you experience this issue if you waited another 30m the limit would sort itself out. Hope that makes sense? In the meantime I'm going to put a fix together for this...

@ahermspacketwerk
Copy link
Author

As I observed this, the key space was not cleaned up for days. Our software is continuously trying to write data. Even after days there are no entries. "select * ..." returns no values anymore.

If this is only a delayed clean-up, I would expect at least some values.

@ghost ghost assigned e-dard Oct 26, 2017
@ghost ghost added review and removed proposed labels Oct 26, 2017
e-dard added a commit that referenced this issue Oct 26, 2017
e-dard added a commit that referenced this issue Oct 26, 2017
Fixes #8819.

Previously, the process of dropping expired shards according to the
retention policy duration, was managed by two independent goroutines in
the retention policy service. This behaviour was introduced in #2776,
at a time when there were both data and meta nodes in the OSS codebase.
The idea was that only the leader meta node would run the meta data
deletions in the first goroutine, and all other nodes would run the
local deletions in the second goroutine.

InfluxDB no longer operates in that way and so we ended up with two
independent goroutines that were carrying out an action that was really
dependent on each other.

If the second goroutine runs before the first then it may not see the
meta data changes indicating shards should be deleted and it won't
delete any shards locally. Shortly after this the first goroutine will
run and remove the meta data for the shard groups.

This results in a situation where it looks like the shards have gone,
but in fact they remain on disk (and importantly, their series within
the index) until the next time the second goroutine runs. By default
that's 30 minutes.

In the case where the shards to be removed would have removed the last
occurences of some series, then it's possible that if the database was already at its
maximum series limit (or tag limit for that matter), no further new series
can be inserted.
@ghost ghost removed the review label Oct 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants