Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borg info: "This archive Original size" is incorrect #5408

Closed
sshaikh opened this issue Oct 10, 2020 · 20 comments
Closed

borg info: "This archive Original size" is incorrect #5408

sshaikh opened this issue Oct 10, 2020 · 20 comments

Comments

@sshaikh
Copy link

sshaikh commented Oct 10, 2020

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes, including those in #4654 (comment), I believe I'm running a version of borg that has the fixes referred to in those.

Is this a BUG / ISSUE report or a QUESTION?

A bug

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.1.11

Operating system (distribution) and version.

Debian 5.7.10

Hardware / network configuration, and filesystems used.

Running on OMV, ext4 on LUKS on LVM

How much data is handled by borg?

~195GB according to du (and duplicacy)

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg info repo::version

Describe the problem you're observing.

"This archive original size" is 204.33GB. "This archive Compressed size" is 194.98GB which is closer to what I expect (although would have hoped that was < real size). The total deduplicated archive size (for around 120 backups) is 190GB which is about right.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, by running info again.

@ThomasWaldmann
Copy link
Member

Can you reproduce with a current borg 1.1.x release (like 1.1.14)? You can use either the binary from github releases or get it from debian backports.

@enkore
Copy link
Contributor

enkore commented Oct 11, 2020

The archive stats include the metadata stored by Borg, not just the file contents. du only accounts for the disk space used by the contents.

@sshaikh
Copy link
Author

sshaikh commented Oct 11, 2020

I thought it might be a metadata issue, but then from what I understand from those previous tickets the goal was to make "this archive original size" equal to the size of the files being backed up, perhaps to avoid questions like mine :). I believe that the previous issue was due to part files, so it would be interesting if meta data isn't excluded too (also that the meta data is 10GB on 200GB of files).

That said, my du math might be incorrect so if there's a way I should measure the actual size of the files as borg sees them, let me know. Judging by how the compressed size is more in line with what I expect, I can't help but think something sparse is being picked up.

Here's the output with the latest release:

# ./borg-linux64 -V
borg-linux64 1.1.14
# ./borg-linux64 info repo::version
...
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              204.33 GB            194.98 GB              1.22 MB
All archives:               22.36 TB             21.20 TB            190.39 GB

                       Unique chunks         Total chunks
Chunk index:                  310537             32178323

@ThomasWaldmann
Copy link
Member

Do you see a difference if you add --consider-part-files?

@ThomasWaldmann
Copy link
Member

BTW, I remember doing some fixes for stats stuff like this. IIRC, some were too big for 1.1.x and just went into master. Not sure if this issue was also fixed, though.

@sshaikh
Copy link
Author

sshaikh commented Oct 11, 2020

Unfortunately --consider-part-files makes no difference to the summary.

What I thought was the relevant fix appears in the changelog for 1.1.9 (info: consider part files for "This archive" stats, #3522).

What about deleted files - would they appear in the count for Original size?

@enkore
Copy link
Contributor

enkore commented Oct 11, 2020

How many files are in this archive?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 11, 2020

Issue #3522 (about "this archive") was fixed in master branch by #4286 and in 1.1-maint branch by #4326 .

Issue #4329 (about "all archives") was fixed only in master branch by #4515 . No 1.1-maint backport as that depends on new borg 1.2 archive metadata.

master branch will get released at some point in the future as borg 1.2.

@hashbackup
Copy link

du counts allocated disk space, for example, it says a 4-byte file uses 4K bytes:

ms:~ jim$ echo abc>abc
ms:~ jim$ du -ks abc
4	abc

In contrast, most backup programs add up the file sizes shown by ls, from the stat() syscall. If you are backing up a large number of small files, this extra padding to the next 4K boundary will make du usage much higher, but it also reflects the reality of 4K allocations. So you are comparing a disk-block-padded size from du with a borg size that I'm guessing is not padded.

@sshaikh
Copy link
Author

sshaikh commented Oct 11, 2020

@enkore 219284 files

@hashbackup but then shouldn't the borg backup report a /smaller/ size?

@ThomasWaldmann does that mean the changelog for 1.1.9 is incorrect? If so we can close this issue.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 11, 2020

Tried with borg 1.1.14+ - the observed inconsistency could be due to a sparse file, look at this:

$ dd if=/dev/zero of=sparse_file bs=1 count=0 seek=10G

$ ls -l sparse_file 
-rw-r--r-- 1 user user 10737418240 Oct 11 19:57 sparse_file  # apparently a 10GB file

$ du -h sparse_file 
0       sparse_file    # file content does not use any disk space!

# starting from empty repo

$ borg create --checkpoint-interval=10 repo::arch sparse_file 

$ borg list repo::arch
-rw-r--r-- user   user   10737418240 Sun, 2020-10-11 19:57:00 sparse_file

$ borg list --consider-part-files repo::arch
-rw-r--r-- user   user   2055208960 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_1
-rw-r--r-- user   user   2097152000 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_2
-rw-r--r-- user   user   2063597568 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_3
-rw-r--r-- user   user   2113929216 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_4
-rw-r--r-- user   user   2088763392 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_5
-rw-r--r-- user   user   318767104 Sun, 2020-10-11 19:57:00 sparse_file.borg_part_6
-rw-r--r-- user   user   10737418240 Sun, 2020-10-11 19:57:00 sparse_file

$ borg info repo::arch
Archive name: arch
Number of files: 1
Command line: /home/user/w/borg-env/bin/borg create --checkpoint-interval=10 repo::arch sparse_file
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               10.74 GB             42.12 MB             35.35 kB
All archives:               21.47 GB             84.25 MB             35.35 kB

                       Unique chunks         Total chunks
Chunk index:                       9                 2568

Notable:

  • "this archive": correct (compare to ls -l)
  • du -h is also correct, but 0 due to sparse file.
  • all archives, original/compressed size: not correct (fix for this was not applied to borg 1.1.x)

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 11, 2020

$ borg info --consider-part-files repo::arch
Archive name: arch
Number of files: 7
Command line: /home/user/w/borg-env/bin/borg create --checkpoint-interval=10 repo::arch sparse_file
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:               21.47 GB             84.25 MB             35.35 kB
All archives:               21.47 GB             84.25 MB             35.35 kB

                       Unique chunks         Total chunks
Chunk index:                       9                 2568

Notable:

  • "this archive": correct
  • "all archives": correct

@ThomasWaldmann
Copy link
Member

@sshaikh see above. changelog is correct, see also my updated comment above.

@ThomasWaldmann
Copy link
Member

@sshaikh sparse files can be anywhere where bigger runs of zeros are efficiently written to a file. Often seen with (VM) disk images.

@hashbackup
Copy link

@sshaikh I think it depends on your actual file sizes, how compressible they are, and how much metadata is stored. For example, if you have 1M files containing 4K uncompressible data each, du will report 4GB but borg will report a higher number if metadata is included.

If you are backing up larger files, the padding factor in du will be less important because it is limited to 4k per file, or 2K on average. So with larger files, the number du reports will be closer to the sum of ls, which I'm guessing is what borg uses. But then you have to add the metadata size, which would make the borg size larger.

@enkore
Copy link
Contributor

enkore commented Oct 11, 2020

IIRC borg creates something like 150-200 bytes of metadata per "small file" (including the full path - long paths need more space), so for just 200k files that overhead would not explain a difference this large. I think Thomas is right here, there's probably a sparse file around.

du(1) can consider either actual file size or disk usage. By default it is in "disk blocks mode", but it has an --apparent-size switch (which is triggered by the --bytes option as well).

@sshaikh
Copy link
Author

sshaikh commented Oct 11, 2020

I tried the approach here:

https://www.thegeekdiary.com/how-to-find-all-the-sparse-file-in-linux/

and found no sparse files in the directories being backed up.

--apparent-size with du actually reduces the total (probably because of the 4k blocks behaviour mentioned earlier).

It's curious. I may try a new backup just to see what happens.

@ThomasWaldmann
Copy link
Member

You could also use borg list --format=... repo::archive to create a file list of that archive, including individual file sizes.

Then use some script or spreadsheet program to sum up the sizes and compare to original size for "this archive".

@sshaikh
Copy link
Author

sshaikh commented Oct 12, 2020

That was a useful tip, thanks!

After an afternoon of analysis I can confirm that borg list outputs exactly the same sizes as find /dir -type f -printf '"%p" %s\n' does and so the sum of bytes also agree.

It turns out that, embarrassingly, this is the age old gibi vs giga issue. Sticking to bytes, my total is 2.04361E+11. According to google this equates to 204 gigabytes and... 190 gibibytes. So the "issue" is with the units du -h presents by default. Using du -BGB gives a total of 204 (or thereabouts) that borg reports.

The moral of the story - always resort to bytes when comparing aggregate sizes. Ironically this issue was distracting me from moving to using json output which I believe uses bytes by default, so I would have gotten there if I had just let it go sooner ;).

@sshaikh sshaikh closed this as completed Oct 12, 2020
@ThomasWaldmann
Copy link
Member

Hehe, sometimes it is easier as one thinks. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants