Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Segment entry checksum mismatch #8473

Open
sshaikh opened this issue Oct 15, 2024 · 20 comments
Open

More Segment entry checksum mismatch #8473

sshaikh opened this issue Oct 15, 2024 · 20 comments

Comments

@sshaikh
Copy link

sshaikh commented Oct 15, 2024

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes, and have acknowledged similar issues but feel that this may be different as the HW has passed tests.

Is this a BUG / ISSUE report or a QUESTION?

ISSUE

Client: borg 1.4.0
Server: borg 1.4.0

Client: Linux client 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Server: Linux server 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux

Hardware / network configuration, and filesystems used.

Server: ext4

How much data is handled by borg?

~210GB

Full borg commandline that lead to the problem (leave away excludes and passwords)

I use borgmatic to create, prune, compact and check twice daily.

Describe the problem you're observing.

Every few months since upgrading to 1.4.0 I get Segment entry checksum mismatch errors on the remote repo. Repairs make them go away. I accept that this points to a HW issue but:

  1. The check in the same backup run fails, when the create, prune and compact immediately preceding it complete successfully. However, once in this state no actions can be taken on that repo.
  2. SMART and memtests have passed multiple runs. The SMART data looks fine and the "error" counts are within thresholds.
  3. Each time I run check (without a repair), I get a different (but possibly growing) set of Segment entry checksum mismatch errors (ie on different segments)

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Once the checks fail, then they always fail. However it is not clear when the failures start, only that the create, prune and compact immediately preceding them seem to succeed fully.

Include any warning/errors/backtraces from the system logs

$ sudo borg check -v -p  ssh://borg@server/backup/borg
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 6, offset 136653910]
Remote: Data integrity error: Segment entry checksum mismatch [segment 21, offset 343224890]
Remote: Data integrity error: Segment entry checksum mismatch [segment 22, offset 126726781]
Remote: Data integrity error: Segment entry checksum mismatch [segment 28, offset 465106529]
Remote: Data integrity error: Segment entry checksum mismatch [segment 31, offset 471072385]
Remote: Data integrity error: Segment entry checksum mismatch [segment 40, offset 94471655]
Remote: Data integrity error: Segment entry checksum mismatch [segment 78, offset 214925215]
Remote: Data integrity error: Segment entry checksum mismatch [segment 87, offset 57019635]
Remote: Data integrity error: Segment entry checksum mismatch [segment 88, offset 96948053]
Remote: Data integrity error: Segment entry checksum mismatch [segment 112, offset 65653905]

etc

I first reported the issue here (and closed it thinking it was a one off):
#8230

I have possibly more comprehensive info for this occurrence here:
https://projects.torsion.org/borgmatic-collective/borgmatic/issues/920

I have not repaired the repo this time in case it needs to be looked at in this state.

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

What is the expectation if borg running as a server encounters a hardware issue at the time of backup (or prune or compact)? Is it possible for them to complete to success?

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

SMART output. FWIW this disc does spin down when not in use.

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.0-26-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11
Device Model:     ST3500320AS
Serial Number:    
LU WWN Device Id: 5 000c50 009c05306
Firmware Version: SD1A
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database 7.3/5319
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 1.5 Gb/s
Local Time is:    Tue Oct 15 14:59:40 2024 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.


General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  625) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 113) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       114380031
  3 Spin_Up_Time            0x0003   094   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2744
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       16
  7 Seek_Error_Rate         0x000f   071   060   030    Pre-fail  Always       -       120722964166
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22285
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   098   037   020    Old_age   Always       -       2080
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       655373
189 High_Fly_Writes         0x003a   096   096   000    Old_age   Always       -       4
190 Airflow_Temperature_Cel 0x0022   070   044   045    Old_age   Always   In_the_past 30 (0 6 38 20 0)
194 Temperature_Celsius     0x0022   030   056   000    Old_age   Always       -       30 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   048   022   000    Old_age   Always       -       114380031
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     22281         -
# 2  Short offline       Completed without error       00%     22257         -
# 3  Extended offline    Aborted by host               90%     22233         -
# 4  Short offline       Completed without error       00%     22206         -
# 5  Short offline       Completed without error       00%     22185         -
# 6  Short offline       Completed without error       00%     22161         -
# 7  Short offline       Completed without error       00%     22137         -
# 8  Short offline       Completed without error       00%     22113         -
# 9  Extended offline    Completed without error       00%     22099         -
#10  Short offline       Completed without error       00%     22090         -
#11  Extended offline    Aborted by host               80%     22067         -
#12  Short offline       Completed without error       00%     22039         -
#13  Short offline       Completed without error       00%     22013         -
#14  Short offline       Completed without error       00%     21994         -
#15  Short offline       Completed without error       00%     21970         -
#16  Short offline       Completed without error       00%     21946         -
#17  Short offline       Completed without error       00%     21927         -
#18  Extended offline    Aborted by host               80%     21904         -
#19  Short offline       Completed without error       00%     21876         -
#20  Short offline       Completed without error       00%     21855         -
#21  Short offline       Completed without error       00%     21843         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

@ThomasWaldmann
Copy link
Member

About the SMART stats:

  • it has 16 reallocated sectors. so it has some issues. you should keep an eye on this: if the count stays stable, it could be tolerated, but if the count is increasing, better get a new disk. i have already RMAd disks with fewer reallocated sectors.
  • there was a good extended offline test at 22099 hours.
  • you can avoid that an extended offline test is aborted by keeping the disk a bit busy, e.g. running some command that causes a disk access once a minute.

@ThomasWaldmann
Copy link
Member

About borg:

  • if you use a ssh: repo, the segment checksums (crc32) are computed server-side (on the machine running borg serve).
  • borg serve receives chunk data from the borg client, then computes the crc32, then stores checksum and data to disk as a "segment entry".

So if the checksum is incorrect, either the checksum or the data is corrupted after the checksum computation was done:

  • when storing the segment entry to disk
  • while the segment entry rests on disk (hdd media issues)
  • when reading the segment entry from disk
  • while the checksum and the data is in memory, before the checksum is checked

So, it is very likely either a RAM or disk issue, but theoretically it could be also other hw components involved.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 15, 2024

So, considering you have multiple error-free runs of memtest86+ (?) and the disk definitely has some issues, I'ld exchange the HDD.

It does not show many runtime hours, but otoh such 500GB SATA disks could be quite old overall (15y?), so maybe the runtime just had one or more overflows and it should show a much higher value?

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

I have some alternative disks to try so happy to try that as an easy first step. I presume there's nothing more to do than just copying over the files? I'll leave this task open for a month or two.

@ThomasWaldmann
Copy link
Member

Yes, maybe use rsync -avH --delete (be careful) so you can restart in case it gets interrupted. After all is synced, you could do another rsync run using their "checksums checking" option just to make sure.

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

BTW just for a sensibility check - it's only the disk that the repo resides on that is relevant for HW testing?

Oh and presumably I should repair-create-repair before moving data over?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 15, 2024

Suspect is all hw between the cpu/ram running "borg serve" and the disk where the repo data is stored on.

Better run the check --repair AFTER it is on the new disk.

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

Is there a reason why the checks list different segments each time they're run?

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Oct 15, 2024

If you only run borg create, segments should not change, just new segments will be created (with higher numbers).

If you run borg compact though, it will compact from old to new segments. But when doing that, guess it will check CRC32 when reading old segments.

Iteration over segments might be in hashindex order for some commands, so do not assume that it is sorted or stable order.

@sshaikh
Copy link
Author

sshaikh commented Oct 15, 2024

But these would be two non repairing checks that are run one after another, with no create, prune or compacts in between. Even if the order can be different (although they always seem to be in ascending order), the segments listed are different, as are the total number of segments listed. I would expect checks to be repeatable?

@ThomasWaldmann
Copy link
Member

That would rather indicate that your hw works in a non-reproducable / unstable way.

@sshaikh
Copy link
Author

sshaikh commented Oct 20, 2024

I transferred the repo to a new disk and attempted to run the repair. I get the following:

$ sudo borg check -v -p --repair  ssh://borg@server/backup/borg
This is a potentially dangerous function.
check --repair might lead to data loss (for kinds of corruption it is not
capable of dealing with). BE VERY CAREFUL!

Type 'YES' if you understand this and want to continue: YES
Remote: Starting repository check
Remote: Data integrity error: Segment entry checksum mismatch [segment 123, offset 130189715]
Remote: attempting to recover /backup/borg/data/0/123
Remote: Data integrity error: Segment entry checksum mismatch [segment 132, offset 398640498]
Remote: attempting to recover /backup/borg/data/0/132
Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]
Traceback (most recent call last):
  File "borg/archiver.py", line 5391, in main
  File "borg/archiver.py", line 5309, in run
  File "borg/archiver.py", line 191, in wrapper
  File "borg/archiver.py", line 365, in do_check
  File "borg/remote.py", line 491, in do_rpc
  File "borg/remote.py", line 727, in call
  File "borg/remote.py", line 836, in call_many
  File "borg/remote.py", line 772, in handle_error
borg.helpers.errors.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 132, offset 356431574]

Platform: Linux nas 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64
Linux: Unknown Linux
Borg: 1.4.0  Python: CPython 3.11.9 msgpack: 1.0.8 fuse: llfuse 1.5.0 [pyfuse3,llfuse]
PID: 1865711  CWD: /user
sys.argv: ['borg', 'check', '-v', '-p', '--repair', 'ssh://borg@server/backup/borg']
SSH_ORIGINAL_COMMAND: None

This seems a little odd that the repair fails?

@ThomasWaldmann
Copy link
Member

Yes. It is also odd that one does not see exactly where it fails.

Can you use a completely different machine, make a new copy of the repo and try the repo check again?

@sshaikh
Copy link
Author

sshaikh commented Oct 21, 2024

In the meantime, the most recent scheduled backup completed successfully (including prunes, compacts and checks). Very strange!

I'll try the different HW stack next.

@ThomasWaldmann
Copy link
Member

Added the hardware issue label. Not 100% sure yet, but that's my best guess.

@sshaikh
Copy link
Author

sshaikh commented Oct 21, 2024

Another data point: tried another check (non repair), post backup, and both client and server's borg processes have hung (0 CPU)

As the already copied repo set has moved on ("repo1"), I will go back to the repo version on the original disk (repo0) for the HW test.

@sshaikh
Copy link
Author

sshaikh commented Oct 21, 2024

"repo0" on completely new HW checks fine (no errors) which points to a HW issue on the original server, although on the bright side it validates the original HDD. I think that's enough evidence.

I'm testing the RAM again and more thoroughly, but in the meantime can you advise on what borg does during check so I can possibly try to reproduce? Perhaps calculating hashes in a loop or something?

@ThomasWaldmann
Copy link
Member

The repository part of borg check reads all data in the segment files and does a crc32 check for each entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants