The thinner might delete snapshots that will be used as the source snapshot #84

Scrin · 2021-07-03T10:43:19Z

In the output I get:

  #### Synchronising
  [Source] docker/data/cortex@offsite-20210627030513: Destroying
  [Source] docker/data/cortex@offsite-20210628030429: Destroying
  [Target] backups/<redacted>/docker/data/cortex@offsite-20210627030513: receiving full
! [Source] STDERR > cannot open 'docker/data/cortex@offsite-20210627030513': dataset does not exist
! [Target] STDERR > cannot receive: failed to read from stream
! [Source] Command "ssh <redacted> 'zfs send --large-block docker/data/cortex@offsite-20210627030513'" returned exit code 1 (valid codes: [0])
! [Target] Command "zfs recv -u -x quota -x refreservation -x keylocation -x pbkdf2iters -x keyformat -x encryption -v -s backups/<redacted>/docker/data/cortex" returned exit code 1 (valid codes: [0])
! [Source] docker/data/cortex: FAILED: Last command returned error
  [Source] docker/data/grafana@offsite-20210627030513: Destroying
  [Source] docker/data/grafana@offsite-20210628030429: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210627030513: receiving full
! [Source] STDERR > cannot open 'docker/data/grafana@offsite-20210627030513': dataset does not exist
! [Target] STDERR > cannot receive: failed to read from stream
! [Source] Command "ssh <redacted> 'zfs send --large-block docker/data/grafana@offsite-20210627030513'" returned exit code 1 (valid codes: [0])
! [Target] Command "zfs recv -u -x quota -x refreservation -x keylocation -x pbkdf2iters -x keyformat -x encryption -v -s backups/<redacted>/docker/data/grafana" returned exit code 1 (valid codes: [0])
! [Source] docker/data/grafana: FAILED: Last command returned error

Note how the first of the two destroyed snapshots is the one it tries to use in zfs send.

In my nightly backup script I have the following:

zfs-autobackup --ssh-source $HOST --keep-source 5 --clear-refreservation --filter-properties quota --rollback --decrypt --encrypt --verbose offsite $LOCAL_DATASET

This is not the first time I've encountered this, the first time was when backup for a dataset had failed for many days, leading to new snapshots (and thinnings) on the source with no successful transfers (cause was something somehow modifying one specific dataset on the target, unrelated to this, but fixed by mounting the target datasets as read only) -> thinner ended up thinning away all mutual snapshots on this dataset, which was bad.

To fix this I had to destroy/rename the target dataset to cause a full send to fix this, which ended up failing in the same reason; thinner deleted the oldest snapshot and then zfs-autobackup tried to do a full on the just-deleted snapshot -> to resolve this I had to run zfs-autobackup once with a bigger --keep-source.

The second time running into this was this time now when I had swapped the backupper storage medium for a new one, expecting zfs-autobackup to do full sends (which it tried to do, with the just-deleted snapshots), which was essentially the same scenario as the previous one in the end. Once again the workaround should be running it once with a bigger --keep-source which I've changed to my backup script which runs again next night. I'll update tomorrow if that fails for some reason.

(Before someone comments about it; the first time running into this and not noticing failing backups for a week was when zfs-autobackup was being evaluated as a new backup solution to replace my previous one, and thus had no monitoring for, so the issue went unnoticed for so long. With zfs-autobackup now in production and proper monitoring now in place this wouldn't go unnoticed, but that's not a fix for the issue)

The text was updated successfully, but these errors were encountered:

psy0rz · 2021-07-03T11:50:56Z

This is indeed a serious issue and should not happen. zfs-autobackup should never destroy source snapshots that the target still wants. Even if --keep-source=1.

Which version are you using?

Scrin · 2021-07-03T11:54:57Z

version 3.1rc3 (installed with pip) at the moment, though during the first occurrence of this issue I was at some earlier version which I can't remember.

psy0rz · 2021-07-03T12:18:52Z

I've confirmed the bug: It only happens if you have a bunch of unsend snapshots that still need to be send (e.g. used --no-send or snapshot-mode).

I'll figure out why this one slipped by and fix it immediately.

psy0rz · 2021-07-03T12:25:42Z

Ah its a bit more obscure to trigger, it only happens when:

There are multiple snapshots on the source
There is no common snapshot yet. (e.g. nothing has been sent over yet)
The source schedule wants to obsolete snapshots that the target schedule still wants.

So that explains why i've never seen it until now.

psy0rz · 2021-07-03T12:34:45Z

It seems i fixed this issue:

The second time running into this was this time now when I had swapped the backupper storage medium for a new one, expecting zfs-autobackup to do full sends (which it tried to do, with the just-deleted snapshots), which was essentially the same scenario as the previous one in the end. Once again the workaround should be running it once with a bigger --keep-source which I've changed to my backup script which runs again next night. I'll update tomorrow if that fails for some reason.

I'm not sure about the first problem you've had. Did you make snapshots another way, by running in snapshot-only mode?

psy0rz · 2021-07-03T12:42:40Z

note: It could also had been triggered by destroying the targetdataset and letting zfs-autobackup do a full resync.

psy0rz · 2021-07-03T12:49:10Z

I just released rc5 for you to try. Perhaps you can do some tests to make sure there isnt a second issue here.

I also added a regression test to prevent this from ever happening again.

Scrin · 2021-07-03T13:17:37Z

Thanks, I'll update to rc5 and revert the --keep-source back to 5 so that it's back into the "problematic situation" and see if it resolves itself. I'll report back tomorrow on that.

About the "first problem": I did not use any other ways to create the snapshots, to be more precise about what happened;

zfs-autobackup had been running fine for a while, no updates or configuration changes done to it
something had changed in one of the datasets in such way that it would appear to make changes on its own, probably some very special file or something -> this caused the dataset to change on its own on the target
despite zfs-autobackup being configured to use the --rollback option, the target dataset managed to change before the new snapshot was received, resulting in a failure (I later mitigated this by mounting the target datasets as read-only)
this seemingly did not prevent the thinner from operating, the thinner kept thinning snapshots on source and target (I think), ultimately leading to the situation of no mutual snapshots
at this point there are no mutual snapshots, zfs-autobackup tries to do the "full send" in the specific situation you mentioned, and the thinner destroys the snapshot (this wouldn't had worked anyway since the target dataset did already exist)

Unfortunately I don't have the logs anymore of the first occurrence so I don't have the exact details of what was thinned, when and where, but it's entirely possible it was another very obscure edge case, perhaps the five snapshots in --keep-source happened to fall just between two "weekly snapshots" in the default thinner schedule or something.

psy0rz · 2021-07-03T13:33:52Z

Then i think that issue was triggered by the same bug: If there is no common snapshot yet/anymore, zfs-autobackup made a wrong assumption and cleaned up too much.

thanks!

Scrin · 2021-07-04T10:12:46Z

I can confirm the fix works, rc5 doesn't destroy the snapshots too early:

  [Target] backups/<redacted>/docker/data/grafana@offsite-20210629030457: receiving full
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210630030509: receiving incremental
  [Source] docker/data/grafana@offsite-20210629030457: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210701030426: receiving incremental
  [Source] docker/data/grafana@offsite-20210630030509: Destroying
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210702041148: receiving incremental
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210703030340: receiving incremental
! [Source] STDERR > cannot hold snapshot 'docker/data/grafana@offsite-20210703030340': tag already exists on this dataset
  [Target] backups/<redacted>/docker/data/grafana@offsite-20210704030243: receiving incremental

although I do get the "cannot hold snapshot" error print, but I believe that's expected in this case, and it doesn't trigger a "failed" status as I do get:

  #### All operations completed successfully

in the end, so it's all good now if that's the expected behavior

psy0rz · 2021-07-05T08:32:45Z

Thanks!

Yes the "cannot hold" is correct: There is already a hold on that snapshot.

The reason we dont check if a hold already exists, is performance reasons. (we dont want the extra checks and zfs-calls for this edge case)

psy0rz added the bug label Jul 3, 2021

psy0rz added this to the 3.1 milestone Jul 3, 2021

psy0rz closed this as completed in 77f1c16 Jul 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The thinner might delete snapshots that will be used as the source snapshot #84

The thinner might delete snapshots that will be used as the source snapshot #84

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021 •

edited

Loading

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021 •

edited

Loading

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021 •

edited

Loading

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021

Scrin commented Jul 4, 2021

psy0rz commented Jul 5, 2021 •

edited

Loading

The thinner might delete snapshots that will be used as the source snapshot #84

The thinner might delete snapshots that will be used as the source snapshot #84

Comments

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021 • edited Loading

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021 • edited Loading

psy0rz commented Jul 3, 2021

psy0rz commented Jul 3, 2021 • edited Loading

Scrin commented Jul 3, 2021

psy0rz commented Jul 3, 2021

Scrin commented Jul 4, 2021

psy0rz commented Jul 5, 2021 • edited Loading

psy0rz commented Jul 3, 2021 •

edited

Loading

psy0rz commented Jul 3, 2021 •

edited

Loading

psy0rz commented Jul 3, 2021 •

edited

Loading

psy0rz commented Jul 5, 2021 •

edited

Loading