Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tsm1): "snapshot in progress" error during backup #20033

Merged
merged 5 commits into from
Nov 17, 2020

Conversation

davidby-influx
Copy link
Contributor

@davidby-influx davidby-influx commented Nov 13, 2020

fix(tsm1): "snapshot in progress" error during backup

Cherry-pick of commits
This requires a matching change to Plutonium PRC protobuf structures in order to link with Plutonium

Fixes InfluxDb side of issue

When an InfluxDB database is very busy writing new points the backup
the process can fail because it can not write a new snapshot.
The error is: operation timed out with error: create snapshot: snapshot in progress.
This happens because InfluxDB takes almost "continuously" a snapshot
from the cache caused by the high number of points ingested.
The fix for this was #16627
but it was for OSS only, and was not in the code path for backups
in clusters.
This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot().
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
and in tsdb.Shard.CreateSnapshot(), the cluster backup code path.
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
(cherry picked from commit 23be20b)
This fix adds a skipCacheOk flag to
tsdb.Store.CreateShardSnapshot() and tsdb.Shard.CreateSnapshot()
to pass to tsdb.Engine.CreateSnapshot()
A value of true allows the backup to proceed even if a cache snapshot
cannot be taken.
This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path
This flag is set to false in tsm1.Engine.Export()

influxdata/plutonium#3227
(cherry picked from commit 6ec446f)
Test the skipCacheOk flag to tsdb.Shard.CreateSnapshot() and
tsdb.Engine.CreateSnapshot()
A value of true allows the backup to proceed even if a cache
snapshot cannot be taken.

influxdata/plutonium#3227
(cherry picked from commit 0dcff81)
@davidby-influx davidby-influx changed the title Dsb influxdb 1.8 3238 fix(tsm1): "snapshot in progress" error during backup Nov 13, 2020
@davidby-influx davidby-influx marked this pull request as draft November 14, 2020 17:34
@davidby-influx davidby-influx self-assigned this Nov 14, 2020
Loop with backoff in (*Engine).CreateSnapshot() to retry
(*Engine).WriteSnapshot() up to 3 times if
ErrSnapshotInPrgress is returned.  Then continue
on no error or on SnapshotInProgress if skipCacheOk is
true.

influxdata/plutonium#3227
Failed to format code before commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants