-
-
Notifications
You must be signed in to change notification settings - Fork 742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Segment entry checksum mismatch #8473
Comments
What is the expectation if borg running as a server encounters a hardware issue at the time of backup (or prune or compact)? Is it possible for them to complete to success? |
SMART output. FWIW this disc does spin down when not in use.
|
About the SMART stats:
|
About borg:
So if the checksum is incorrect, either the checksum or the data is corrupted after the checksum computation was done:
So, it is very likely either a RAM or disk issue, but theoretically it could be also other hw components involved. |
So, considering you have multiple error-free runs of memtest86+ (?) and the disk definitely has some issues, I'ld exchange the HDD. It does not show many runtime hours, but otoh such 500GB SATA disks could be quite old overall (15y?), so maybe the runtime just had one or more overflows and it should show a much higher value? |
I have some alternative disks to try so happy to try that as an easy first step. I presume there's nothing more to do than just copying over the files? I'll leave this task open for a month or two. |
Yes, maybe use |
BTW just for a sensibility check - it's only the disk that the repo resides on that is relevant for HW testing? Oh and presumably I should repair-create-repair before moving data over? |
Suspect is all hw between the cpu/ram running "borg serve" and the disk where the repo data is stored on. Better run the check --repair AFTER it is on the new disk. |
Is there a reason why the checks list different segments each time they're run? |
If you only run borg create, segments should not change, just new segments will be created (with higher numbers). If you run borg compact though, it will compact from old to new segments. But when doing that, guess it will check CRC32 when reading old segments. Iteration over segments might be in hashindex order for some commands, so do not assume that it is sorted or stable order. |
But these would be two non repairing |
That would rather indicate that your hw works in a non-reproducable / unstable way. |
I transferred the repo to a new disk and attempted to run the repair. I get the following:
This seems a little odd that the repair fails? |
Yes. It is also odd that one does not see exactly where it fails. Can you use a completely different machine, make a new copy of the repo and try the repo check again? |
In the meantime, the most recent scheduled backup completed successfully (including prunes, compacts and checks). Very strange! I'll try the different HW stack next. |
Added the hardware issue label. Not 100% sure yet, but that's my best guess. |
Another data point: tried another As the already copied repo set has moved on ("repo1"), I will go back to the repo version on the original disk (repo0) for the HW test. |
"repo0" on completely new HW checks fine (no errors) which points to a HW issue on the original server, although on the bright side it validates the original HDD. I think that's enough evidence. I'm testing the RAM again and more thoroughly, but in the meantime can you advise on what borg does during check so I can possibly try to reproduce? Perhaps calculating hashes in a loop or something? |
The repository part of |
Have you checked borgbackup docs, FAQ, and open GitHub issues?
Yes, and have acknowledged similar issues but feel that this may be different as the HW has passed tests.
Is this a BUG / ISSUE report or a QUESTION?
ISSUE
Client: borg 1.4.0
Server: borg 1.4.0
Client: Linux client 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Server: Linux server 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64 GNU/Linux
Hardware / network configuration, and filesystems used.
Server: ext4
How much data is handled by borg?
~210GB
Full borg commandline that lead to the problem (leave away excludes and passwords)
I use borgmatic to create, prune, compact and check twice daily.
Describe the problem you're observing.
Every few months since upgrading to 1.4.0 I get
Segment entry checksum mismatch
errors on the remote repo. Repairs make them go away. I accept that this points to a HW issue but:check
in the same backup run fails, when the create, prune and compact immediately preceding it complete successfully. However, once in this state no actions can be taken on that repo.check
(without a repair), I get a different (but possibly growing) set ofSegment entry checksum mismatch
errors (ie on different segments)Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.
Once the checks fail, then they always fail. However it is not clear when the failures start, only that the create, prune and compact immediately preceding them seem to succeed fully.
Include any warning/errors/backtraces from the system logs
I first reported the issue here (and closed it thinking it was a one off):
#8230
I have possibly more comprehensive info for this occurrence here:
https://projects.torsion.org/borgmatic-collective/borgmatic/issues/920
I have not repaired the repo this time in case it needs to be looked at in this state.
The text was updated successfully, but these errors were encountered: