Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

size, csize, dsize and dcsize appear to return incorrect values #3736

Closed
BloodBlight opened this issue Mar 28, 2018 · 13 comments
Closed

size, csize, dsize and dcsize appear to return incorrect values #3736

BloodBlight opened this issue Mar 28, 2018 · 13 comments

Comments

@BloodBlight
Copy link

There appears to be an issue with the values returned from borg list, possibly for large files.

Lets pick a backup (names have been removed):

$ borg info /borg/ArchivedVMs::'VM_NAME'
Archive name: VM_NAME
Archive fingerprint: 4eea7c86cadf0c2250ba7c3d3f8d009a205d9d9baa877b46631c795e7fb5b15e
Comment: 
Hostname: prod-borg
Username: root
Time (start): Sat, 2018-03-24 07:01:16
Time (end): Sat, 2018-03-24 07:39:51
Duration: 38 minutes 35.18 seconds
Number of files: 15
Command line: /usr/bin/borg create -v --stats --progress --compression auto,zstd,9 '/borg/archive::VM_NAME' '/borg/NFS/VM_NAME'
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              214.75 GB             38.31 GB             10.88 GB
All archives:               48.49 TB             10.78 TB              4.02 TB

                       Unique chunks         Total chunks
Chunk index:                 4798344             14296034

Alright, from about 214GBs to 10GBs.

If we list the items for that backup we get:

$ borg list /borg/ArchivedVMs::'VM_NAME' --format="{size} {csize} {dcsize}{NEWLINE}"
0 0 0
544 388 388
221129 17778 17778
8684 1649 0
107374182400 19154677017 0  <----- This guy.
500359 27384 27384
171581 17665 17665
284 218 218
0 0 0
339213 22200 22200
396609 25237 25237
726586 33830 33830
341299 22287 22287
3548 1153 1153

That doesn't look right! And if we do some math:

$ borg list /borg/ArchivedVMs::'VM_NAME' --format="{size} {csize} {dcsize}{NEWLINE}" | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1024^3, y/1024^3, z/1024^3 }'
100.003 17.8393 0.000156593

The numbers being about 50% less... Alright (something with it being a sparse file maybe), but the dedupped size isn't even close.

Is this a bug, or am I misunderstanding the results?

@BloodBlight
Copy link
Author

Oh ya, I am on CentOS 7 and borg 1.1.4.

@BloodBlight
Copy link
Author

When mounting the backup:

$ cd /borg/Restore/borg/NFS/VM_NAME
$ ll -h
total 101G
-rw-------. 1 root root 388K Jun 28  2017 vmware-10.log
-rw-------. 1 root root 216K Jun 28  2017 vmware-5.log
-rw-------. 1 root root 334K Jun 28  2017 vmware-6.log
-rw-------. 1 root root 489K Jun 28  2017 vmware-7.log
-rw-------. 1 root root 332K Jun 28  2017 vmware-8.log
-rw-------. 1 root root 168K Jun 28  2017 vmware-9.log
-rw-------. 1 root root 710K Jun 28  2017 vmware.log
-rw-------. 1 root root 100G Jun 28  2017 VM_NAME-flat.vmdk
-rw-------. 1 root root 8.5K Jun 28  2017 VM_NAME.nvram
-rw-------. 1 root root  544 Jun 28  2017 VM_NAME.vmdk
-rw-r--r--. 1 root root    0 Jun 28  2017 VM_NAME.vmsd
-rwxr-xr-x. 1 root root 3.5K Jun 28  2017 VM_NAME.vmx
-rw-r--r--. 1 root root  284 Jun 28  2017 VM_NAME.vmxf

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 28, 2018

See that other ticket about borg info - same issue? #3522

@BloodBlight
Copy link
Author

Looking around, seeing a few things with "borg info" but nothing that looks to be this. Do you have a case number?

I did a restore on the archive and it wrote 100GBs of data. I think this is a case of the stats screen showing a different size that the actual size. So there are two issues here. One, the list and info screns do not match (info and progress don't match what are being backed up, so I am assuming the list data is correct). And two, dedup data is not being listed correctly.

The list shows 0 bytes used for this particular disk (the main data drive). This VM has never been backed up before, so it is the only copy in the repo. I highly doubt I got a 100% perfect dedup.

I found this because I was writing a script that would produce a detailed report of usage in the archive and the data is spat out was WAY off for dedup.

Here is a copy, you may need to tweak it, but try running it on one of yours:

Archive=/borg/ArchivedVMs
InfoPath=/borg/Info
mkdir -p $InfoPath/DATFiles     #Used for debugging.
borg list $Archive --format="{name}|{start}{NEWLINE}" > "$InfoPath/Backups.txt"
echo "Name|Date|Size [GBs]|Compressed [GBs]|Deduplicated [GBs]" > "$InfoPath/Results.txt"
X=1
while read Backup; do
        BackupName=`echo $Backup|cut -d '|' -f1`
        echo "Reading data for [$BackupName]..."
        borg list $Archive::"$BackupName" --format="{size} {csize} {dcsize}{NEWLINE}" > "$InfoPath/DATFiles/DAT-$X.txt"
        Values=`cat "$InfoPath/DATFiles/DAT-$X.txt" | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1024^3, y/1024^3, z/1024^3 }'`
        Values=`sed "s/ /|/g" <<<$Values`
        echo "$Backup|$Values" >> "$InfoPath/Results.txt"
        ((X++))
done <"$InfoPath/Backups.txt"
cat "$InfoPath/Results.txt" | sort -n -k5 -t '|' | column -s '|' -t > "$InfoPath/Results-Formated.txt"
echo "########################################################################################"
echo "# Results stored at: [$InfoPath/Results.txt] and [$InfoPath/Results-Formated.txt]... #"
echo "########################################################################################"
cat "$InfoPath/Results-Formated.txt"

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 28, 2018

The dedup size likely looks strange because the chunks are not only referenced by the whole file, but also by the part files created by checkpointing, see also #3522.

@BloodBlight
Copy link
Author

Using consider-part-files definitely helped! Now the original and compressed numbers line up perfectly if I use 1000 rather than 1024 for my math. No joy on the dedup numbers though.

borg list /borg/ArchivedVMs::"$VM_Name" --format="{size} {csize} {dcsize}{NEWLINE}" --consider-part-files | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1000^3, y/1000^3, z/1000^3 }'
214.751 38.3095 0.00016814
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              214.75 GB             38.31 GB             10.88 GB

Items still showing zero blocks:

borg list /borg/ArchivedVMs::"VM_Name" --format="{path}|{size}|{csize}|{dcsize}{NEWLINE}" --consider-part-file | column -t -s '|'
borg/NFS/VM_Name                                0             0            0
borg/NFS/VM_Name/VM_Name.vmdk                   544           388          388
borg/NFS/VM_Name/vmware-5.log                   221129        17778        17778
borg/NFS/VM_Name/VM_Name.nvram                  8684          1649         0
borg/NFS/VM_Name/VM_Name-flat.vmdk.borg_part_1  72253980394   19151517690  0
borg/NFS/VM_Name/VM_Name-flat.vmdk.borg_part_2  35120202006   3159327      0
borg/NFS/VM_Name/VM_Name-flat.vmdk              107374182400  19154677017  0
borg/NFS/VM_Name/vmware-7.log                   500359        27384        27384
borg/NFS/VM_Name/vmware-9.log                   171581        17665        17665
borg/NFS/VM_Name/VM_Name.vmxf                   284           218          218
borg/NFS/VM_Name/VM_Name.vmsd                   0             0            0
borg/NFS/VM_Name/vmware-8.log                   339213        22200        22200
borg/NFS/VM_Name/vmware-10.log                  396609        25237        25237
borg/NFS/VM_Name/vmware.log                     726586        33830        33830
borg/NFS/VM_Name/vmware-6.log                   341299        22287        22287
borg/NFS/VM_Name/VM_Name.vmx                    3548          1153         1153

@BloodBlight
Copy link
Author

So... I am noticing that borg extract with stdout is also impacted by "consider-part-files"! That seems odd, was that intended?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Mar 29, 2018

About the dedup numbers: we only count the unique chunks (chunks with refcount == 1), but if checkpointing happens within a file, there will be a bunch of .part files adding second references to all file chunks. So the chunks are not unique any more and don't count into the dedup size. This is kind of wrong, but the current stats do not differentiate between in-same-backup-references and not-in-same-backup references.

About stdout/part files: borg does not prevent the user from doing potential nonsense.

So e.g. if you call borg extract --stdout and you extract multiple files, each of them will be written to stdout, one after the other. borg does not care whether you will be able to actually dissect that again or not.

--consider-part-files on a repo with 1 file will lead to multiple file extraction if there are part files.

@BloodBlight
Copy link
Author

Alright, I think that clears everything up! I think it makes the list function less viable but…

So, is that why the borg info command and backup status screen report more data than is actually being backed up? It is getting double counted due to the part files?

Thanks for you help on this! It is greatly appreciated.

I think for my purposes I will use the dedup value from the info screen, and the original size/compressed values from the list screen. That SHOULD give me an accurate representation of the archive. I will post the final version here is anyone is interested.

@RonnyPfannschmidt
Copy link
Contributor

shouldn't part files be left out of continuations ?

@ThomasWaldmann
Copy link
Member

@RonnyPfannschmidt not sure what you mean.

@enkore
Copy link
Contributor

enkore commented Mar 31, 2018

shouldn't part files be left out of continuations ?

That would be much cleaner, but even the current part files code is a big mess (and it even creates a user-visible mess). Clean continuations would mean to track and selectively rollback txn state.

@ThomasWaldmann
Copy link
Member

i guess this is a dupe of #3522 and that one was closed, so closing this one also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants