Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borg check --repair --verify-data: inconsistent/sporadic chunk id verification failure #5822

Closed
jurgenhaas opened this issue Jun 5, 2021 · 12 comments

Comments

@jurgenhaas
Copy link

jurgenhaas commented Jun 5, 2021

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

1.1.16

Operating system (distribution) and version.

Ubuntu 20.04

Hardware / network configuration, and filesystems used.

Intel NUC with ect4

How much data is handled by borg?

400GB on that host, 274GB in the problematic repository

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg check --repair --verify-data --verbose /var/backups/borg/nextcloud_privat

Describe the problem you're observing.

My monthly check reported a problem with the integrity of some chunks, so I started to analyse the repository. The command above logs the following statement:

Starting repository check
Starting repository index check
Completed repository check, no problems found.
Starting archive consistency check...
Enter passphrase for key /var/backups/borg/nextcloud_privat:
Starting cryptographic data integrity verification...
chunk f856bf715978bb1a4acc9d2f4b23bfc319a2452058541ae5c8ee086b75279ff9, integrity error: Data integrity error: Chunk f856bf715978bb1a4acc9d2f4b23bfc319a2452058541ae5c8ee086b75279ff9: id verification failed
Found defect chunks. They will be deleted now, so affected files can get repaired now and maybe healed later.
chunk %s not deleted, did not consistently fail.
Finished cryptographic data integrity verification, verified 298463 chunks with 1 integrity errors.
Analyzing archive ps1-2020-05-31T04:07:42.711754 (1/28)
Analyzing archive ps1-2020-06-30T04:12:33.529670 (2/28)
Analyzing archive ps1-2020-07-31T04:13:08.699011 (3/28)


Analyzing archive ps1-2021-06-03T04:33:13.815852 (27/28)
Analyzing archive ps1-2021-06-04T05:15:33.195727 (28/28)
Writing Manifest.
Committing repo (may take a while, due to compact_segments)...
Finished committing repo.
Archive consistency check complete, problems found.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, no matter how often I run this. It continues reporting that integrity error but still doesn't delete the reported chunk.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 5, 2021

chunk %s not deleted, did not consistently fail.

this is obviously a bug (you should not see %s but a chunkid), there was an argument missing there.

Fixed by #5823.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 5, 2021

About your "bug" report: AFAICS now, this is not a bug, but expected borg behaviour:

chunk f856bf715978bb1a4acc9d2f4b23bfc319a2452058541ae5c8ee086b75279ff9, integrity error: Data integrity error: Chunk f856bf715978bb1a4acc9d2f4b23bfc319a2452058541ae5c8ee086b75279ff9: id verification failed
Found defect chunks. They will be deleted now, so affected files can get repaired now and maybe healed later.
chunk (like same id as above) not deleted, did not consistently fail.

Before actually deleting stuff with integrity problems, borg is careful:

It just tries a second time to get the chunk from the repo and verifies the mac, decrypts, decompresses the data and then checks if the wanted-chunkid corresponds to the computed-chunkid:

                        encrypted_data = self.repository.get(defect_chunk)
                        _chunk_id = None if defect_chunk == Manifest.MANIFEST_ID else defect_chunk
                        self.key.decrypt(_chunk_id, encrypted_data)

The main idea with that code is not to delete chunks just because of sporadic failures.

If it doesn't cause issues again, you get that "did not consistently fail" message and it does not delete the chunk.
If it would cause issues again, it would delete the corrupt chunk.

The bad news is of course:

  • that there must be some root cause for sporadic failures and "id verification" one time failing and one time succeeding
  • that can't be explained by e.g. some I/O problem, but IMHO points to unreliable CPU, RAM or mainboard.

So I suggest you run some passes of memtest86+ on that system.

@ThomasWaldmann ThomasWaldmann changed the title borg check --repair --verify-data does not delete chunk with integrity error borg check --repair --verify-data: incosistent/sporadic chunk id verification failure Jun 5, 2021
@ThomasWaldmann ThomasWaldmann changed the title borg check --repair --verify-data: incosistent/sporadic chunk id verification failure borg check --repair --verify-data: inconsistent/sporadic chunk id verification failure Jun 5, 2021
@ThomasWaldmann
Copy link
Member

After you have verified that CPU/RAM is working ok, you could also run the borg repair command again.

It would be interesting to see if it still finds problematic chunks and if so, whether they have same of different ids.

@ThomasWaldmann
Copy link
Member

Theoretically, there could be also a bug in borg's code (incorrect detection of inconsistent behaviour), but let's first assume that there actually is inconsistent behaviour.

@jurgenhaas
Copy link
Author

Sorry for the delay, I'm still trying to get to the GRUB menu during startup to run the memtest from there. Still trying, getting back.

ThomasWaldmann added a commit that referenced this issue Jun 9, 2021
fix missing parameter in "did not consistently fail" msg, see #5822
@ThomasWaldmann
Copy link
Member

@jurgenhaas for memtest86+ you need to boot with legacy BIOS and you usually can get to the grub menu with hammering shift or esc(?) after the bios screen (or reconfigure grub to always show the menu for some seconds).

you could also make a USB stick with memtest86+ and boot from that via BIOS boot order (F12 or so).

@jurgenhaas
Copy link
Author

OK, I get the Grub menu displayed during boot but it doesn't contain memtest86+. I've looked into /etc/grub.d/20-memtest86+ and there is an if clause that exits because /sys/firmware/efi is available. When I switch to legacy boot mode in BIOS, no bootable device is being found. Then I flushed memtest86+ to a USB stick, but that's not recognized as bootable either.
I know this is off-topic here but I still appreciate any help on getting this somehow tested.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 9, 2021

Some devices might be only bootable via UEFI, e.g. NVME PCIe SSDs.

But if your system still supports legacy boot, you should be able to boot from a memtest86+ usb stick.
Some BIOSes might need the USB device to be added to the boot order or enable USB init at boot time or so.

If you still have a DVD drive (internal SATA or external via USB), you could also use a ubuntu 18.04 DVD, it has a working version of memtest86+.

ThomasWaldmann added a commit that referenced this issue Jun 16, 2021
fix missing parameter in "did not consistently fail" msg, see #5822
@jurgenhaas
Copy link
Author

Think I've tried everything possible. With legacy mode enforced, the NUC won't boot under any circumstances. With a USB device it does recognize it but then won't boot from it either.
But I found a command line tool memtester which almost does the same and when I ran it the first time, it cam back all clear:

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 200MB (209715200 bytes)
got  200MB (209715200 bytes), trying mlock ...locked.
Loop 1/1:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

But I ran it several times and it reports different problems for most of the runs:

  Bit Flip            : testing 121FAILURE: 0x00008000 != 0x20008000 at offset 0x02603e00.
  Bit Spread          : testing 105FAILURE: 0x01400000 != 0x21400000 at offset 0x01185b00.
  Bit Flip            : testing 193FAILURE: 0x01000000 != 0x21000000 at offset 0x2b456f00.

It's different every time I run it. Not sure what that means, you're probably going to tell me I should replace the RAM in that box?

@ThomasWaldmann
Copy link
Member

Some notable things in that output:

  • it only tested 200MB? maybe it can only test free memory?
  • it is always the same data bit failing.

guess the very first thing i would try is to re-seat memory modules (and if that does not help: the cpu). maybe just some bad contact.

if that doesn't help, try with a new memory module.

@ThomasWaldmann
Copy link
Member

ok, so the minor borg bug was already fixed and the main issue is a hardware issue, thus i am closing this.

@ghost ghost mentioned this issue Aug 26, 2021
@jurgenhaas
Copy link
Author

I know I'm late with this feedback, but I wanted everyone let know that re-seating memory modules resolved the issue. TBH, I was sceptical with this advice, but it really works.

Thanks again @ThomasWaldmann for helping me out with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants