Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borg info not ignoring part files? #3522

Closed
alejandro-perez opened this issue Jan 9, 2018 · 22 comments · Fixed by #4286
Closed

borg info not ignoring part files? #3522

alejandro-perez opened this issue Jan 9, 2018 · 22 comments · Fixed by #4286
Assignees

Comments

@alejandro-perez
Copy link

alejandro-perez commented Jan 9, 2018

Yes, I know recreate is EXPERIMENTAL. But you need feedback, don't you? :)

Since zstd is now available, I wanted to make some tests over a copy of my repository. In particular, I wanted to see the impact on 1) Size 2) Extraction speed.

Hence, I was disposed to re-compress the eldest archive of my repository:
borg info .::2017-03-31_09:01:06

Archive name: 2017-03-31_09:01:06
Archive fingerprint: 44a9792df44a1477c1a613a161e82ac77b6ec631c0aad73a01db36435d7b1492
Comment: 
Hostname: XXXXXXXXXXXX
Username: root
Time (start): Fri, 2017-03-31 09:01:06
Time (end): Fri, 2017-03-31 09:32:40
Duration: 31 minutes 33.56 seconds
Number of files: 209353
Command line: XXXXXXXXXXXXXX
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               89.95 GB             70.82 GB            255.95 MB
All archives:                8.08 TB              5.76 TB            131.07 GB

                       Unique chunks         Total chunks
Chunk index:                  614777             35652254

Now, recompress.
borg recreate .::2017-03-31_09:01:06 --recompress -C zstd -p -v

recreate is an experimental feature.
Type 'YES' if you understand this and want to continue: YES
/usr/lib/python3.6/site-packages/borg/archive.py:225: FutureWarning: use_bin_type option is not specified. Default value of the option will be changed in future version.
/usr/lib/python3.6/site-packages/msgpack/__init__.py:47: FutureWarning: use_bin_type option is not specified. Default value of the option will be changed in future version.                                    
/usr/lib/python3.6/site-packages/msgpack/__init__.py:37: FutureWarning: use_bin_type option is not specified. Default value of the option will be changed in future version. 

And then check the result.
borg info .::2017-03-31_09:01:06

Archive name: 2017-03-31_09:01:06
Archive fingerprint: 4f52e385277a5ab84abbec49dea66bc2f3cebd705ed8379d41738660628e6795
Comment: 
Hostname: XXXXXXXXXXXXXXXXX
Username: root
Time (start): Fri, 2017-03-31 09:01:06
Time (end): Fri, 2017-03-31 09:32:40
Duration: 31 minutes 33.56 seconds
Number of files: 209355
Command line: XXXXXXXXXXXXXXXx
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               94.51 GB             74.07 GB            251.10 MB
All archives:                8.08 TB              5.69 TB            129.89 GB

                       Unique chunks         Total chunks
Chunk index:                  614783             35654007

I fail to understand how there can be 2 more files and 5 more GB. Note total deduplicated size decreased 2GB as well.

bounty: https://www.bountysource.com/issues/53581808-borg-info-not-ignoring-part-files

@alejandro-perez
Copy link
Author

Additinal note 1. There was a compacting process in the middle of the process.
Additional note 2: I performed a previous recreate process in the same archive to "prune" some files I did not want any longer. Maybe the "wrong" numbers were 89GB rather than 94GB for the original size if something failed in the previous recreate.

@ThomasWaldmann ThomasWaldmann changed the title Need help to understand whether substantial changes after recreate are normal are substantial changes after recreate normal? Jan 10, 2018
@ThomasWaldmann
Copy link
Member

Suspicion: count/space accounting seems incorrect because part files are not ignored.

I need some more info:

  • borg --version (likely 1.1.4 release, but who knows :)
  • borg list archive1 | wc -l
  • borg list archive2 | wc -l
  • borg list --consider-part-files archive1 | wc -l
  • borg list --consider-part-files archive2 | wc -l

@alejandro-perez
Copy link
Author

Sorry about the missing information, I was in a hurry and wrote it too fast.

borg --version (likely 1.1.4 release, but who knows :)

borg 1.1.4, Arch Linux

borg list archive1 | wc -l

Can't since the archive was recreated

borg list archive2 | wc -l

253821

borg list --consider-part-files archive1 | wc -l

Can't since the archive was recreated

borg list --consider-part-files archive2 | wc -l

253823

It seems indeed to be the part files. But why wasn't they reported for the old archive. Is it because it was created with an older borg version?

@ThomasWaldmann
Copy link
Member

The part files get created by the checkpointing mechanism to support in-file checkpoints.
They can be different in archive1 and 2.

@ThomasWaldmann ThomasWaldmann added this to the 1.1.x milestone Jan 10, 2018
@alejandro-perez
Copy link
Author

alejandro-perez commented Jan 10, 2018

But it is the same archive, but after recreation. Can a recreate generate new checkpoints?

In the original archive (the one before pruning the files, which I still have as this one is just for testing) there are no part files. But after the "recreate" processes, this is what I have:

-rw-r--r--   1000    100 2506646232 Wed, 2015-07-01 15:55:18 home/alex/VirtualBox VMs/Old VM/GSS preauth.7z.borg_part_1
-rw-r--r--   1000    100 2046030963 Wed, 2015-07-01 15:55:18 home/alex/VirtualBox VMs/Old VM/GSS preauth.7z.borg_part_2
-rw-r--r--   1000    100 4552677195 Wed, 2015-07-01 15:55:18 home/alex/VirtualBox VMs/Old VM/GSS preauth.7z

There I have the full file and two part files, which I guess should have been removed, shouldn't they?

@ThomasWaldmann
Copy link
Member

The new archive is created running same code as for create. Just the data is not sourced from the filesystem, but from the original archive.

So yes, it will create new checkpoints (new part files) according to --checkpoint-interval.

If the original archive has part files, they will get dropped as usual (when not using --consider-part-files) - they are not needed because there is also the full file, assuming that the archive is a normally completed archive.

So the part files you are seeing are NEW part files, made when checkpointing the new archive.

@alejandro-perez
Copy link
Author

But, shouldn't part files be removed if the archive is successfully created?

@ThomasWaldmann
Copy link
Member

The archive's item metadata stream is append-only. Once an item metadata entry is written to it, it can not be easily remove from it again (except with recreate).

We need to create the part file to contain the chunks of the partial file at checkpoint time. After the checkpoint, more part file(s) will be created and finally the full file. After the full file item is written, the part files are not needed anymore, but we can't undo them - thus they are just ignored.

Except for borg info, it looks like it is not ignoring the part files when computing count and size, this is likely the bug here.

@alejandro-perez
Copy link
Author

It is clear now. Thanks!

@ThomasWaldmann ThomasWaldmann changed the title are substantial changes after recreate normal? borg info not ignoring part files? Jan 10, 2018
@alejandro-perez
Copy link
Author

Hence a workaround for me here would be setting a really long checkpoint interval (eg. days) during recreation, to make sure no checkpoint is created.

@ThomasWaldmann
Copy link
Member

@enkore looks like an issue in cache_sync. it does not ignore part files.

@enkore
Copy link
Contributor

enkore commented Jan 10, 2018

I don't think it ever ignored them. If it did, wouldn't that cause the reference counts to be too low?

@ThomasWaldmann
Copy link
Member

The usual python code to iterate over archive items ignores part files by default, except when --consider-part-files is given on the command line.
So my guess is that this worked before it was refactored/tuned with that msgpack state machine in C.
And I'm only talking about the statistics here, so no problem with refcounts.

@enkore
Copy link
Contributor

enkore commented Jan 13, 2018

But wouldn't the stats be wrong if you ignored the part files? The references and the data are there.

@ThomasWaldmann
Copy link
Member

Well, there are 2 views:

  • consider_part_files = True considers everything (like now).
  • consider_part_files = False (default) only considers what a user usually is interested in. Checkpoint implementation details like part files are not shown / not counted.

The latter is important to not confuse users with stats inconsistencies caused by such implementation details.

@enkore
Copy link
Contributor

enkore commented Jan 15, 2018

I think either way is inconsistent, but in different ways:

  • consider_part_files = True (no filtering): accurate representation of how much space the archive uses. borg delete would show the exact same number, for example.
  • consider_part_files = False (filters part files): (mostly) accurate representation of what data was backed up. But that number would often be different (less) from how much space the archive actually uses.

In a way these files made the metadata somewhat self-contradictory; in a perfect world they would not have been necessary, but I really wouldn't want to see how messy it would be to achieve this in the borg code base.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jan 15, 2018

Isn't "how much space the archive uses" the same for both? the part files completely dedup with the final full file. (assuming that there actually is a final full file)

There's also files count obviously showing unexpected higher numbers, see the bug report.

@enkore
Copy link
Contributor

enkore commented Jan 16, 2018

Ah right. So it's only a question of original/compressed size and nfiles then? If one wanted to implement this, then it shouldn't be too difficult. One would want to introduce a second state in expect_chunks_map_key to look for the is_partfile flag (or how it's called exactly). A complication is of course that msgpack is unordered, so the chunks could come before that flag. If one only wants to modify counting of nfiles, no big deal, but if all stats should be changed may require bigger changes. On Python 3.6 this may appear to work when its not (due to implicit dict ordering and msgpack-python implementation details).

Looking at e189a4d I don't think anything was missed when making the transition, since I don't see any partfile filtering in the old code.

@ThomasWaldmann
Copy link
Member

The "This Archive" stats are fixed by PR #4286.

The "All Archives" stats can not easily be fixed to not consider part files, due to the way they are computed.

@ThomasWaldmann
Copy link
Member

Hmm, I just noticed that by implementing #3241, the "All archives" stats values for original and compressed columns could be computed by summing up the stats values from all archive headers, if we have them in all archives (not: old archives, but maybe these could get updated by borg recreate).

Only the deduplicated size would come from hashindex stats.

ThomasWaldmann added a commit that referenced this issue Feb 4, 2019
…4286)

cache_sync: compute size/count stats, borg info: consider part files

fixes #3522
@ThomasWaldmann
Copy link
Member

Reopening it as the "All archives" fix is still needed.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Feb 5, 2019

Closing this to claim the bounty.

It only fixes "This archive" as noted above, but considering the bounty isn't much either, I consider it to be ok.

I'll reopen a new issue for the remaining fix needed for "All archives". #4329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants