Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The thinner might delete snapshots that will be used as the source snapshot #84

Closed
Scrin opened this issue Jul 3, 2021 · 11 comments
Closed
Labels
Milestone

Comments

@Scrin
Copy link

Scrin commented Jul 3, 2021

In the output I get:

  #### Synchronising
  [Source] docker/data/cortex@offsite-20210627030513: Destroying
  [Source] docker/data/cortex@offsite-20210628030429: Destroying
  [Target] backups/<redacted>/docker/data/cortex@offsite-20210627030513: receiving full
! [Source] STDERR > cannot open 'docker/data/cortex@offsite-20210627030513': dataset does not exist
! [Target] STDERR > cannot receive: failed to read from stream
! [Source] Command "ssh <redacted> 'zfs send --large-block docker/data/cortex@offsite-20210627030513'" returned exit code 1 (valid codes: [0])
! [Target] Command "zfs recv -u -x quota -x refreservation -x keylocation -x pbkdf2iters -x keyformat -x encryption -v -s backups/<redacted>/docker/data/cortex" returned exit code 1 (valid codes: [0])
! [Source] docker/data/cortex: FAILED: Last command returned error
  [Source] docker/data/grafana@offsite-20210627030513: Destroying
  [Source] docker/data/grafana@offsite-20210628030429: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210627030513: receiving full
! [Source] STDERR > cannot open 'docker/data/grafana@offsite-20210627030513': dataset does not exist
! [Target] STDERR > cannot receive: failed to read from stream
! [Source] Command "ssh <redacted> 'zfs send --large-block docker/data/grafana@offsite-20210627030513'" returned exit code 1 (valid codes: [0])
! [Target] Command "zfs recv -u -x quota -x refreservation -x keylocation -x pbkdf2iters -x keyformat -x encryption -v -s backups/<redacted>/docker/data/grafana" returned exit code 1 (valid codes: [0])
! [Source] docker/data/grafana: FAILED: Last command returned error

Note how the first of the two destroyed snapshots is the one it tries to use in zfs send.

In my nightly backup script I have the following:

zfs-autobackup --ssh-source $HOST --keep-source 5 --clear-refreservation --filter-properties quota --rollback --decrypt --encrypt --verbose offsite $LOCAL_DATASET

This is not the first time I've encountered this, the first time was when backup for a dataset had failed for many days, leading to new snapshots (and thinnings) on the source with no successful transfers (cause was something somehow modifying one specific dataset on the target, unrelated to this, but fixed by mounting the target datasets as read only) -> thinner ended up thinning away all mutual snapshots on this dataset, which was bad.

To fix this I had to destroy/rename the target dataset to cause a full send to fix this, which ended up failing in the same reason; thinner deleted the oldest snapshot and then zfs-autobackup tried to do a full on the just-deleted snapshot -> to resolve this I had to run zfs-autobackup once with a bigger --keep-source.

The second time running into this was this time now when I had swapped the backupper storage medium for a new one, expecting zfs-autobackup to do full sends (which it tried to do, with the just-deleted snapshots), which was essentially the same scenario as the previous one in the end. Once again the workaround should be running it once with a bigger --keep-source which I've changed to my backup script which runs again next night. I'll update tomorrow if that fails for some reason.

(Before someone comments about it; the first time running into this and not noticing failing backups for a week was when zfs-autobackup was being evaluated as a new backup solution to replace my previous one, and thus had no monitoring for, so the issue went unnoticed for so long. With zfs-autobackup now in production and proper monitoring now in place this wouldn't go unnoticed, but that's not a fix for the issue)

@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

This is indeed a serious issue and should not happen. zfs-autobackup should never destroy source snapshots that the target still wants. Even if --keep-source=1.

Which version are you using?

@Scrin
Copy link
Author

Scrin commented Jul 3, 2021

version 3.1rc3 (installed with pip) at the moment, though during the first occurrence of this issue I was at some earlier version which I can't remember.

@psy0rz psy0rz added the bug label Jul 3, 2021
@psy0rz psy0rz added this to the 3.1 milestone Jul 3, 2021
@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

I've confirmed the bug: It only happens if you have a bunch of unsend snapshots that still need to be send (e.g. used --no-send or snapshot-mode).

I'll figure out why this one slipped by and fix it immediately.

@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

Ah its a bit more obscure to trigger, it only happens when:

  • There are multiple snapshots on the source
  • There is no common snapshot yet. (e.g. nothing has been sent over yet)
  • The source schedule wants to obsolete snapshots that the target schedule still wants.

So that explains why i've never seen it until now.

@psy0rz psy0rz closed this as completed in 77f1c16 Jul 3, 2021
@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

It seems i fixed this issue:

The second time running into this was this time now when I had swapped the backupper storage medium for a new one, expecting zfs-autobackup to do full sends (which it tried to do, with the just-deleted snapshots), which was essentially the same scenario as the previous one in the end. Once again the workaround should be running it once with a bigger --keep-source which I've changed to my backup script which runs again next night. I'll update tomorrow if that fails for some reason.

I'm not sure about the first problem you've had. Did you make snapshots another way, by running in snapshot-only mode?

@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

note: It could also had been triggered by destroying the targetdataset and letting zfs-autobackup do a full resync.

@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

I just released rc5 for you to try. Perhaps you can do some tests to make sure there isnt a second issue here.

I also added a regression test to prevent this from ever happening again.

@Scrin
Copy link
Author

Scrin commented Jul 3, 2021

Thanks, I'll update to rc5 and revert the --keep-source back to 5 so that it's back into the "problematic situation" and see if it resolves itself. I'll report back tomorrow on that.

About the "first problem": I did not use any other ways to create the snapshots, to be more precise about what happened;

  • zfs-autobackup had been running fine for a while, no updates or configuration changes done to it
  • something had changed in one of the datasets in such way that it would appear to make changes on its own, probably some very special file or something -> this caused the dataset to change on its own on the target
  • despite zfs-autobackup being configured to use the --rollback option, the target dataset managed to change before the new snapshot was received, resulting in a failure (I later mitigated this by mounting the target datasets as read-only)
  • this seemingly did not prevent the thinner from operating, the thinner kept thinning snapshots on source and target (I think), ultimately leading to the situation of no mutual snapshots
  • at this point there are no mutual snapshots, zfs-autobackup tries to do the "full send" in the specific situation you mentioned, and the thinner destroys the snapshot (this wouldn't had worked anyway since the target dataset did already exist)

Unfortunately I don't have the logs anymore of the first occurrence so I don't have the exact details of what was thinned, when and where, but it's entirely possible it was another very obscure edge case, perhaps the five snapshots in --keep-source happened to fall just between two "weekly snapshots" in the default thinner schedule or something.

@psy0rz
Copy link
Owner

psy0rz commented Jul 3, 2021

Then i think that issue was triggered by the same bug: If there is no common snapshot yet/anymore, zfs-autobackup made a wrong assumption and cleaned up too much.

thanks!

@Scrin
Copy link
Author

Scrin commented Jul 4, 2021

I can confirm the fix works, rc5 doesn't destroy the snapshots too early:

  [Target] backups/<redacted>/docker/data/grafana@offsite-20210629030457: receiving full
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210630030509: receiving incremental
  [Source] docker/data/grafana@offsite-20210629030457: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210701030426: receiving incremental
  [Source] docker/data/grafana@offsite-20210630030509: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210702041148: receiving incremental
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210703030340: receiving incremental
! [Source] STDERR > cannot hold snapshot 'docker/data/grafana@offsite-20210703030340': tag already exists on this dataset
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210704030243: receiving incremental

although I do get the "cannot hold snapshot" error print, but I believe that's expected in this case, and it doesn't trigger a "failed" status as I do get:

  #### All operations completed successfully

in the end, so it's all good now if that's the expected behavior

@psy0rz
Copy link
Owner

psy0rz commented Jul 5, 2021

Thanks!

Yes the "cannot hold" is correct: There is already a hold on that snapshot.

The reason we dont check if a hold already exists, is performance reasons. (we dont want the extra checks and zfs-calls for this edge case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants