size, csize, dsize and dcsize appear to return incorrect values #3736

BloodBlight · 2018-03-28T17:55:12Z

There appears to be an issue with the values returned from borg list, possibly for large files.

Lets pick a backup (names have been removed):

$ borg info /borg/ArchivedVMs::'VM_NAME'
Archive name: VM_NAME
Archive fingerprint: 4eea7c86cadf0c2250ba7c3d3f8d009a205d9d9baa877b46631c795e7fb5b15e
Comment: 
Hostname: prod-borg
Username: root
Time (start): Sat, 2018-03-24 07:01:16
Time (end): Sat, 2018-03-24 07:39:51
Duration: 38 minutes 35.18 seconds
Number of files: 15
Command line: /usr/bin/borg create -v --stats --progress --compression auto,zstd,9 '/borg/archive::VM_NAME' '/borg/NFS/VM_NAME'
Utilization of maximum supported archive size: 0%
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              214.75 GB             38.31 GB             10.88 GB
All archives:               48.49 TB             10.78 TB              4.02 TB

                       Unique chunks         Total chunks
Chunk index:                 4798344             14296034

Alright, from about 214GBs to 10GBs.

If we list the items for that backup we get:

$ borg list /borg/ArchivedVMs::'VM_NAME' --format="{size} {csize} {dcsize}{NEWLINE}"
0 0 0
544 388 388
221129 17778 17778
8684 1649 0
107374182400 19154677017 0  <----- This guy.
500359 27384 27384
171581 17665 17665
284 218 218
0 0 0
339213 22200 22200
396609 25237 25237
726586 33830 33830
341299 22287 22287
3548 1153 1153

That doesn't look right! And if we do some math:

$ borg list /borg/ArchivedVMs::'VM_NAME' --format="{size} {csize} {dcsize}{NEWLINE}" | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1024^3, y/1024^3, z/1024^3 }'
100.003 17.8393 0.000156593

The numbers being about 50% less... Alright (something with it being a sparse file maybe), but the dedupped size isn't even close.

Is this a bug, or am I misunderstanding the results?

The text was updated successfully, but these errors were encountered:

BloodBlight · 2018-03-28T17:56:11Z

Oh ya, I am on CentOS 7 and borg 1.1.4.

BloodBlight · 2018-03-28T18:06:18Z

When mounting the backup:

$ cd /borg/Restore/borg/NFS/VM_NAME
$ ll -h
total 101G
-rw-------. 1 root root 388K Jun 28  2017 vmware-10.log
-rw-------. 1 root root 216K Jun 28  2017 vmware-5.log
-rw-------. 1 root root 334K Jun 28  2017 vmware-6.log
-rw-------. 1 root root 489K Jun 28  2017 vmware-7.log
-rw-------. 1 root root 332K Jun 28  2017 vmware-8.log
-rw-------. 1 root root 168K Jun 28  2017 vmware-9.log
-rw-------. 1 root root 710K Jun 28  2017 vmware.log
-rw-------. 1 root root 100G Jun 28  2017 VM_NAME-flat.vmdk
-rw-------. 1 root root 8.5K Jun 28  2017 VM_NAME.nvram
-rw-------. 1 root root  544 Jun 28  2017 VM_NAME.vmdk
-rw-r--r--. 1 root root    0 Jun 28  2017 VM_NAME.vmsd
-rwxr-xr-x. 1 root root 3.5K Jun 28  2017 VM_NAME.vmx
-rw-r--r--. 1 root root  284 Jun 28  2017 VM_NAME.vmxf

ThomasWaldmann · 2018-03-28T18:45:11Z

See that other ticket about borg info - same issue? #3522

BloodBlight · 2018-03-28T20:19:57Z

Looking around, seeing a few things with "borg info" but nothing that looks to be this. Do you have a case number?

I did a restore on the archive and it wrote 100GBs of data. I think this is a case of the stats screen showing a different size that the actual size. So there are two issues here. One, the list and info screns do not match (info and progress don't match what are being backed up, so I am assuming the list data is correct). And two, dedup data is not being listed correctly.

The list shows 0 bytes used for this particular disk (the main data drive). This VM has never been backed up before, so it is the only copy in the repo. I highly doubt I got a 100% perfect dedup.

I found this because I was writing a script that would produce a detailed report of usage in the archive and the data is spat out was WAY off for dedup.

Here is a copy, you may need to tweak it, but try running it on one of yours:

Archive=/borg/ArchivedVMs
InfoPath=/borg/Info
mkdir -p $InfoPath/DATFiles     #Used for debugging.
borg list $Archive --format="{name}|{start}{NEWLINE}" > "$InfoPath/Backups.txt"
echo "Name|Date|Size [GBs]|Compressed [GBs]|Deduplicated [GBs]" > "$InfoPath/Results.txt"
X=1
while read Backup; do
        BackupName=`echo $Backup|cut -d '|' -f1`
        echo "Reading data for [$BackupName]..."
        borg list $Archive::"$BackupName" --format="{size} {csize} {dcsize}{NEWLINE}" > "$InfoPath/DATFiles/DAT-$X.txt"
        Values=`cat "$InfoPath/DATFiles/DAT-$X.txt" | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1024^3, y/1024^3, z/1024^3 }'`
        Values=`sed "s/ /|/g" <<<$Values`
        echo "$Backup|$Values" >> "$InfoPath/Results.txt"
        ((X++))
done <"$InfoPath/Backups.txt"
cat "$InfoPath/Results.txt" | sort -n -k5 -t '|' | column -s '|' -t > "$InfoPath/Results-Formated.txt"
echo "########################################################################################"
echo "# Results stored at: [$InfoPath/Results.txt] and [$InfoPath/Results-Formated.txt]... #"
echo "########################################################################################"
cat "$InfoPath/Results-Formated.txt"

ThomasWaldmann · 2018-03-28T21:16:33Z

The dedup size likely looks strange because the chunks are not only referenced by the whole file, but also by the part files created by checkpointing, see also #3522.

BloodBlight · 2018-03-28T21:34:09Z

Using consider-part-files definitely helped! Now the original and compressed numbers line up perfectly if I use 1000 rather than 1024 for my math. No joy on the dedup numbers though.

borg list /borg/ArchivedVMs::"$VM_Name" --format="{size} {csize} {dcsize}{NEWLINE}" --consider-part-files | awk -F' ' '{ x = x + $1; y = y + $2; z = z + $3 } END { print x/1000^3, y/1000^3, z/1000^3 }'
214.751 38.3095 0.00016814

------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
This archive:              214.75 GB             38.31 GB             10.88 GB

Items still showing zero blocks:

borg list /borg/ArchivedVMs::"VM_Name" --format="{path}|{size}|{csize}|{dcsize}{NEWLINE}" --consider-part-file | column -t -s '|'
borg/NFS/VM_Name                                0             0            0
borg/NFS/VM_Name/VM_Name.vmdk                   544           388          388
borg/NFS/VM_Name/vmware-5.log                   221129        17778        17778
borg/NFS/VM_Name/VM_Name.nvram                  8684          1649         0
borg/NFS/VM_Name/VM_Name-flat.vmdk.borg_part_1  72253980394   19151517690  0
borg/NFS/VM_Name/VM_Name-flat.vmdk.borg_part_2  35120202006   3159327      0
borg/NFS/VM_Name/VM_Name-flat.vmdk              107374182400  19154677017  0
borg/NFS/VM_Name/vmware-7.log                   500359        27384        27384
borg/NFS/VM_Name/vmware-9.log                   171581        17665        17665
borg/NFS/VM_Name/VM_Name.vmxf                   284           218          218
borg/NFS/VM_Name/VM_Name.vmsd                   0             0            0
borg/NFS/VM_Name/vmware-8.log                   339213        22200        22200
borg/NFS/VM_Name/vmware-10.log                  396609        25237        25237
borg/NFS/VM_Name/vmware.log                     726586        33830        33830
borg/NFS/VM_Name/vmware-6.log                   341299        22287        22287
borg/NFS/VM_Name/VM_Name.vmx                    3548          1153         1153

BloodBlight · 2018-03-28T21:49:58Z

So... I am noticing that borg extract with stdout is also impacted by "consider-part-files"! That seems odd, was that intended?

ThomasWaldmann · 2018-03-29T12:57:28Z

About the dedup numbers: we only count the unique chunks (chunks with refcount == 1), but if checkpointing happens within a file, there will be a bunch of .part files adding second references to all file chunks. So the chunks are not unique any more and don't count into the dedup size. This is kind of wrong, but the current stats do not differentiate between in-same-backup-references and not-in-same-backup references.

About stdout/part files: borg does not prevent the user from doing potential nonsense.

So e.g. if you call borg extract --stdout and you extract multiple files, each of them will be written to stdout, one after the other. borg does not care whether you will be able to actually dissect that again or not.

--consider-part-files on a repo with 1 file will lead to multiple file extraction if there are part files.

BloodBlight · 2018-03-29T14:50:57Z

Alright, I think that clears everything up! I think it makes the list function less viable but…

So, is that why the borg info command and backup status screen report more data than is actually being backed up? It is getting double counted due to the part files?

Thanks for you help on this! It is greatly appreciated.

I think for my purposes I will use the dedup value from the info screen, and the original size/compressed values from the list screen. That SHOULD give me an accurate representation of the archive. I will post the final version here is anyone is interested.

RonnyPfannschmidt · 2018-03-29T15:35:31Z

shouldn't part files be left out of continuations ?

ThomasWaldmann · 2018-03-29T15:50:59Z

@RonnyPfannschmidt not sure what you mean.

enkore · 2018-03-31T08:53:12Z

shouldn't part files be left out of continuations ?

That would be much cleaner, but even the current part files code is a big mess (and it even creates a user-visible mess). Clean continuations would mean to track and selectively rollback txn state.

ThomasWaldmann · 2019-02-11T10:15:50Z

i guess this is a dupe of #3522 and that one was closed, so closing this one also.

BloodBlight mentioned this issue Jun 21, 2018

Borg Doubling Reported Size (1.1.4) without consider-part-file #3916

Closed

ThomasWaldmann added this to the hydrogen milestone Feb 4, 2019

ThomasWaldmann closed this as completed Feb 11, 2019

ThomasWaldmann mentioned this issue Jun 29, 2019

"Original size" of backup is twice that of the source folder #4654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

size, csize, dsize and dcsize appear to return incorrect values #3736

size, csize, dsize and dcsize appear to return incorrect values #3736

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 28, 2018 •

edited

Loading

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 28, 2018 •

edited

Loading

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 29, 2018 •

edited

Loading

BloodBlight commented Mar 29, 2018

RonnyPfannschmidt commented Mar 29, 2018

ThomasWaldmann commented Mar 29, 2018

enkore commented Mar 31, 2018 •

edited

Loading

ThomasWaldmann commented Feb 11, 2019

size, csize, dsize and dcsize appear to return incorrect values #3736

size, csize, dsize and dcsize appear to return incorrect values #3736

Comments

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 28, 2018 • edited Loading

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 28, 2018 • edited Loading

BloodBlight commented Mar 28, 2018

BloodBlight commented Mar 28, 2018

ThomasWaldmann commented Mar 29, 2018 • edited Loading

BloodBlight commented Mar 29, 2018

RonnyPfannschmidt commented Mar 29, 2018

ThomasWaldmann commented Mar 29, 2018

enkore commented Mar 31, 2018 • edited Loading

ThomasWaldmann commented Feb 11, 2019

ThomasWaldmann commented Mar 28, 2018 •

edited

Loading

ThomasWaldmann commented Mar 28, 2018 •

edited

Loading

ThomasWaldmann commented Mar 29, 2018 •

edited

Loading

enkore commented Mar 31, 2018 •

edited

Loading