Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrity Error: Segment entry checksum mismatch #5863

Closed
WarglBlargl opened this issue Jun 21, 2021 · 10 comments
Closed

Integrity Error: Segment entry checksum mismatch #5863

WarglBlargl opened this issue Jun 21, 2021 · 10 comments

Comments

@WarglBlargl
Copy link

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borgbackup 1.1.9

Operating system (distribution) and version.

Kernel: 4.19.0-14-amd64 x86_64

Distributor ID: Debian
Description: Debian GNU/Linux 10 (buster)
Release: 10
Codename: buster

Hardware / network configuration, and filesystems used.

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465,8G 0 disk
├─sda1 8:1 0 487M 0 part /boot
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 465,3G 0 part
└─sda5_crypt 254:0 0 465,3G 0 crypt
├─cloud--vg-root 254:1 0 464,3G 0 lvm /
└─cloud--vg-swap_1 254:2 0 980M 0 lvm
sdb 8:16 0 465,8G 0 disk
└─sdb1 8:17 0 465,8G 0 part /mnt/backup

Borg Backup Repo is saved on sdb1.

How much data is handled by borg?

The disk which is backed up has 500GB and 50% of it in use:
/dev/mapper/cloud--vg-root 478171912 222746256 231066132 50% /

Backups are stored on a second 500GB disk :
/dev/sdb1 479672040 286625248 168627516 63% /mnt/backup

Full borg commandline that lead to the problem (leave away excludes and passwords)

root@cloud: borg list /mnt/backup/borg-backups/backup.borg/

Data integrity error: Segment entry checksum mismatch [segment 630, offset 60450982]
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4455, in main
    exit_code = archiver.run(args)
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4387, in run
    return set_ec(func(args))
  File "/usr/lib/python3/dist-packages/borg/archiver.py", line 141, in wrapper
    kwargs['manifest'], kwargs['key'] = Manifest.load(repository, compatibility)
  File "/usr/lib/python3/dist-packages/borg/helpers.py", line 330, in load
    cdata = repository.get(cls.MANIFEST_ID)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 1070, in get
    self.index = self.open_index(self.get_transaction_id())
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 376, in get_transaction_id
    self.check_transaction()
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 373, in check_transaction
    self.replay_segments(replay_from, segments_transaction_id)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 812, in replay_segments
    self._update_index(segment, objects)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 822, in _update_index
    for tag, key, offset, size in objects:
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 1353, in iter_objects
    read_data=read_data)
  File "/usr/lib/python3/dist-packages/borg/repository.py", line 1451, in _read
    segment, offset))
borg.helpers.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 630, offset 60450982]

Platform: Linux cloud 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64
Linux: debian 10.9
Borg: 1.1.9  Python: CPython 3.7.3
PID: 2351  CWD: /opt/gitlab
sys.argv: ['/usr/bin/borg', 'list', '/mnt/backup/borg-backups/backup.borg/']
SSH_ORIGINAL_COMMAND: None

root@cloud:borg check --repair /mnt/backup/borg-backups/backup.borg/

'check --repair' is an experimental feature that might result in data loss.
Type 'YES' if you understand this and want to continue: YES
Data integrity error: Segment entry checksum mismatch [segment 166, offset 360036033]
Data integrity error: Segment entry checksum mismatch [segment 235, offset 252144524]
Data integrity error: Segment entry checksum mismatch [segment 256, offset 411389105]
Data integrity error: Segment entry checksum mismatch [segment 257, offset 47798425]

After 5 hours i had to stop the repair. Now started again.

Describe the problem you're observing.

I am not able to access the Repository or list any backups.
This has happened before and i deleted the entire repository in order to get it working again, but seems like this didn't help.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Can be reproduced on my system by simply running borg list .

Include any warning/errors/backtraces from the system logs

@infectormp
Copy link
Contributor

Did you check hdd and fs for errors?
Also please update borg to the latest version

@WarglBlargl
Copy link
Author

hdd check
root@cloud:~/.config/borg/keys# sudo smartctl -H /dev/sdb1

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Upgrade output:
root@cloud:~/.config/borg/keys# borg upgrade -v /mnt/backup/borg-backups/backup.borg/

making a hardlink copy in /mnt/backup/borg-backups/backup.borg.before-upgrade-2021-06-21-10:29:26
opening attic repository with borg and converting
no key file found for repository
converting repo index /mnt/backup/borg-backups/backup.borg/index.622
converting 597 segments...
converting borg 0.xx to borg current
no key file found for repository

root@cloud:~/.config/borg/keys# ls -l ~/.config/borg/keys

-rw------- 1 root root 553 Feb 15 14:43 mnt_backup_borg_backups_backup_borg

When i deleted and reinitialized the borg repo it was about one week ago so the keys file from 15 Feb is old. Could that be a problem?

@WarglBlargl
Copy link
Author

WarglBlargl commented Jun 21, 2021

root@cloud: sudo fsck -fv /dev/sdb1

fsck from util-linux 2.33.1
e2fsck 1.44.5 (15-Dec-2018)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

         638 inodes used (0.00%, out of 30531584)
           7 non-contiguous files (1.1%)
           2 non-contiguous directories (0.3%)
             # of inodes with ind/dind/tind blocks: 572/571/0
    74080467 blocks used (60.67%, out of 122096390)
           0 bad blocks
           1 large file

         611 regular files
          18 directories
           0 character device files
           0 block device files
           0 fifos
        2424 links
           0 symbolic links (0 fast symbolic links)
           0 sockets
------------
        3053 files

@WarglBlargl
Copy link
Author

It is not even possible to delete the repo anymore:

root@cloud:borg delete --force --force /mnt/backup/borg-backups/backup.borg
´´´
Data integrity error: Segment entry checksum mismatch [segment 630, offset 60450982]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4455, in main
exit_code = archiver.run(args)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 4387, in run
return set_ec(func(args))
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 154, in wrapper
return method(self, args, repository=repository, **kwargs)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1242, in do_delete
return self._delete_repository(args, repository)
File "/usr/lib/python3/dist-packages/borg/archiver.py", line 1311, in _delete_repository
manifest, key = Manifest.load(repository, Manifest.NO_OPERATION_CHECK)
File "/usr/lib/python3/dist-packages/borg/helpers.py", line 330, in load
cdata = repository.get(cls.MANIFEST_ID)
File "/usr/lib/python3/dist-packages/borg/repository.py", line 1070, in get
self.index = self.open_index(self.get_transaction_id())
File "/usr/lib/python3/dist-packages/borg/repository.py", line 376, in get_transaction_id
self.check_transaction()
File "/usr/lib/python3/dist-packages/borg/repository.py", line 373, in check_transaction
self.replay_segments(replay_from, segments_transaction_id)
File "/usr/lib/python3/dist-packages/borg/repository.py", line 812, in replay_segments
self._update_index(segment, objects)
File "/usr/lib/python3/dist-packages/borg/repository.py", line 822, in _update_index
for tag, key, offset, size in objects:
File "/usr/lib/python3/dist-packages/borg/repository.py", line 1353, in iter_objects
read_data=read_data)
File "/usr/lib/python3/dist-packages/borg/repository.py", line 1451, in _read
segment, offset))
borg.helpers.IntegrityError: Data integrity error: Segment entry checksum mismatch [segment 630, offset 60450982]

Platform: Linux cloud 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64
Linux: debian 10.10
Borg: 1.1.9 Python: CPython 3.7.3
PID: 15054 CWD: /opt/nextcloud
sys.argv: ['/usr/bin/borg', 'delete', '--force', '--force', '/mnt/backup/borg-backups/backup.borg']
SSH_ORIGINAL_COMMAND: None
´´´

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 21, 2021

Upgrading borg meant to go from 1.1.9 to 1.1.16 (not to run "borg upgrade"). See the debian backports repo.

BTW, SMART almost always says "PASSED" except if the device is obviously rather dead. So can you please provide the full smartctl -a output?

"Segment entry checksum mismatch" means that the CRC32 check failed. If borg check --repair says that, I guess it is rather sure that your on-disk data is corrupted (segment files == stuff in REPO/data/...).

If your disk is OK, bad RAM could also be the culprit, so I suggest to run a few passes of memtest86+.

@WarglBlargl
Copy link
Author

Thanks for your answer @ThomasWaldmann , i am new to handling linux servers so this helps alot.
I tried upgrading borg via "apt-get isntall borgbackup=1.1.16" but it only finds version 1.1.9.
Now i downloaded and unpacked " borgbackup_1.1.16.orig.tar.gz " but i am not sure how to proceed with the installation.

This is the complete smartctl output:
root@cloud:~# sudo smartctl -a /dev/sdb

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-14-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST3500413AS
Serial Number:    Z2ABDHD2
LU WWN Device Id: 5 000c50 03615e772
Firmware Version: JC47
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jun 25 12:08:24 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                                                             ort.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  83) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_                                                                                                             FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   098   006    Pre-fail  Always       -                                                                                                                    55304838
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -                                                                                                                    0
  4 Start_Stop_Count        0x0032   097   097   020    Old_age   Always       -                                                                                                                    3986
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -                                                                                                                    0
  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -                                                                                                                    334469647
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -                                                                                                                    21626
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -                                                                                                                    0
 12 Power_Cycle_Count       0x0032   097   097   020    Old_age   Always       -                                                                                                                    3840
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -                                                                                                                    0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -                                                                                                                    0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -                                                                                                                    184
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -                                                                                                                    3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -                                                                                                                    0
190 Airflow_Temperature_Cel 0x0022   063   056   045    Old_age   Always       -                                                                                                                    37 (Min/Max 36/42)
194 Temperature_Celsius     0x0022   037   044   000    Old_age   Always       -                                                                                                                    37 (0 12 0 0 0)
195 Hardware_ECC_Recovered  0x001a   032   023   000    Old_age   Always       -                                                                                                                    55304838
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -                                                                                                                    0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -                                                                                                                    0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -                                                                                                                    0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -                                                                                                                    32024 (69 98 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -                                                                                                                    209112772
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -                                                                                                                    1293707575

SMART Error Log Version: 1
ATA Error Count: 184 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 184 occurred at disk power-on lifetime: 1378 hours (57 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                             .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bc e5 14 02  Error: WP at LBA = 0x0214e5bc = 34923964

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 f0 dd 1b 42 00      02:35:53.340  WRITE FPDMA QUEUED
  61 00 08 d0 9a 18 42 00      02:35:53.339  WRITE FPDMA QUEUED
  61 00 08 00 3f 11 42 00      02:35:53.339  WRITE FPDMA QUEUED
  61 00 08 e8 dd 17 42 00      02:35:53.339  WRITE FPDMA QUEUED
  61 00 7b 40 09 cf 41 00      02:35:53.338  WRITE FPDMA QUEUED

Error 183 occurred at disk power-on lifetime: 1378 hours (57 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                             .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bc e5 14 02  Error: WP at LBA = 0x0214e5bc = 34923964

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 20 ff ff ff 4f 00      02:35:50.360  WRITE FPDMA QUEUED
  60 00 40 42 63 14 42 00      02:35:50.358  READ FPDMA QUEUED
  61 00 60 ff ff ff 4f 00      02:35:50.358  WRITE FPDMA QUEUED
  60 00 00 a0 e5 14 42 00      02:35:50.354  READ FPDMA QUEUED
  60 00 08 48 de ba 41 00      02:35:50.354  READ FPDMA QUEUED

Error 182 occurred at disk power-on lifetime: 1378 hours (57 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                             .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bc e5 14 02  Error: WP at LBA = 0x0214e5bc = 34923964

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 f0 3e 11 42 00      02:35:47.369  WRITE FPDMA QUEUED
  61 00 08 00 3f 11 42 00      02:35:47.368  WRITE FPDMA QUEUED
  61 00 68 e8 2a 12 42 00      02:35:47.368  WRITE FPDMA QUEUED
  60 00 40 f2 74 14 42 00      02:35:47.368  READ FPDMA QUEUED
  60 00 00 a0 e5 14 42 00      02:35:47.367  READ FPDMA QUEUED

Error 181 occurred at disk power-on lifetime: 1378 hours (57 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                             .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bc e5 14 02  Error: UNC at LBA = 0x0214e5bc = 34923964

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 10 60 13 cf 41 00      02:35:44.387  READ FPDMA QUEUED
  60 00 28 78 b1 bb 41 00      02:35:44.384  READ FPDMA QUEUED
  60 00 08 da dd e0 41 00      02:35:44.361  READ FPDMA QUEUED
  60 00 02 88 3c 4b 45 00      02:35:44.361  READ FPDMA QUEUED
  60 00 38 b8 6f 57 43 00      02:35:44.361  READ FPDMA QUEUED

Error 180 occurred at disk power-on lifetime: 1378 hours (57 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                             .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 bc e5 14 02  Error: WP at LBA = 0x0214e5bc = 34923964

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 e0 1f d0 41 00      02:35:41.290  WRITE FPDMA QUEUED
  61 00 08 e8 3e 11 42 00      02:35:41.290  WRITE FPDMA QUEUED
  61 00 08 f8 3e 11 42 00      02:35:41.289  WRITE FPDMA QUEUED
  61 00 08 68 ca cd 41 00      02:35:41.289  WRITE FPDMA QUEUED
  61 00 08 c8 1a d1 41 00      02:35:41.289  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA                                                                                                             _of_first_error
# 1  Short offline       Completed without error       00%     21528         -
# 2  Short offline       Completed without error       00%     21461         -
# 3  Short offline       Completed without error       00%     18527         -
# 4  Short offline       Completed without error       00%     18527         -
# 5  Short offline       Completed without error       00%      5495         -
# 6  Short offline       Aborted by host               90%      5495         -
# 7  Short offline       Completed without error       00%      5349         -
# 8  Short offline       Aborted by host               90%      5349         -
# 9  Short offline       Completed without error       00%      5201         -
#10  Short offline       Aborted by host               90%      5201         -
#11  Short offline       Aborted by host               10%      5073         -
#12  Short offline       Aborted by host               90%      5073         -
#13  Short offline       Completed without error       00%      5048         -
#14  Short offline       Aborted by host               90%      5048         -
#15  Short offline       Completed without error       00%      4730         -
#16  Short offline       Aborted by host               90%      4730         -
#17  Short offline       Completed without error       00%      4612         -
#18  Short offline       Aborted by host               90%      4612         -
#19  Short offline       Completed without error       00%      4492         -
#20  Short offline       Aborted by host               90%      4492         -
#21  Short offline       Completed without error       00%      4352         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

memtester output:
Dont know if this is enough or memtest86+ needed
root@cloud:~# sudo memtester 4000 1

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 4000MB (4194304000 bytes)
got  4000MB (4194304000 bytes), trying mlock ...locked.
Loop 1/1:
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jun 25, 2021

That disk has issues, see "UNC" (uncorrectable error). Maybe you want to get a new one before it is too late.

The smart attributes tables was truncated at the right (RAW values column missing), so i can't tell much more.

Also, you've never run a SMART long test. smartctl -t long /dev/sdX (can take a while, but good for a full test).

The memtester is not the one I meant, but at least the 4000MB it tested seemed to have worked while it tested them.

@WarglBlargl
Copy link
Author

When testing with 10 GB the memtester threw a bunch of errors so i guess i will have to replace the RAM also:
root@cloud:/var/log/performance# sudo memtester 10000 1

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 10000MB (10485760000 bytes)
got  10000MB (10485760000 bytes), trying mlock ...locked.
Loop 1/1:
  Stuck Address       : testing   1FAILURE: possible bad address line at offset 0x1cfe9d1c0.
Skipping to next test...
  Random Value        : FAILURE: 0xbfbbd3877efb460a != 0xbfbbd3977efb460a at offset 0x9769d9b8.
FAILURE: 0x68c295a21307d6bf != 0x68c295b21307d6bf at offset 0x9769d9b8.
  Compare XOR         : FAILURE: 0x31ed7201940002f9 != 0x31ed7211940002f9 at offset 0x9769d9b8.
  Compare SUB         : FAILURE: 0x4728c4501cdb8ca8 != 0x9ce002d01cdb8ca8 at offset 0x9769d9b8.
  Compare MUL         : FAILURE: 0x00000001 != 0x1000000002 at offset 0x9769d9b8.
  Compare DIV         : FAILURE: 0x7fbfec07ddefaf25 != 0x7fbfec17ddefaf27 at offset 0x9769d9b8.
  Compare OR          : FAILURE: 0x5fbd4001c5a3a305 != 0x5fbd4011c5a3a307 at offset 0x9769d9b8.
  Compare AND         :   Sequential Increment: ok
  Solid Bits          : testing   0FAILURE: 0x00000000 != 0x1000000000 at offset 0x9769d9b8.
  Block Sequential    : testing   0FAILURE: 0x00000000 != 0x1000000000 at offset 0x9769d9b8.
  Checkerboard        : testing   0FAILURE: 0xaaaaaaaaaaaaaaaa != 0xaaaaaabaaaaaaaaa at offset 0x9769d9b8.
  Bit Spread          : testing  34FAILURE: 0xffffffebffffffff != 0xfffffffbffffffff at offset 0x9769d9b8.
  Bit Flip            : testing   0FAILURE: 0x00000001 != 0x1000000001 at offset 0x9769d9b8.
  Walking Ones        : testing  36FAILURE: 0xffffffefffffffff != 0xffffffffffffffff at offset 0x9769d9b8.
  Walking Zeroes      : testing   0FAILURE: 0x00000001 != 0x1000000001 at offset 0x9769d9b8.
  8-bit Writes        : -FAILURE: 0x7cffa5e03eb798fb != 0x7cffa5f03eb798fb at offset 0x9769d9b8.
  16-bit Writes       : -FAILURE: 0xbaf3552ddf88320e != 0xbaf3553ddf88320e at offset 0x9769d9b8.

Done.

Rest of the truncated smartctl output:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   113   098   006    Pre-fail  Always       -       55360802
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   097   097   020    Old_age   Always       -       3986
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   084   060   030    Pre-fail  Always       -       334473896
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21767
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   020    Old_age   Always       -       3840
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       184
188 Command_Timeout         0x0032   099   099   000    Old_age   Always       -       4295032836
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   056   045    Old_age   Always       -       37 (Min/Max 36/42)
194 Temperature_Celsius     0x0022   037   044   000    Old_age   Always       -       37 (0 12 0 0 0)
195 Hardware_ECC_Recovered  0x001a   026   023   000    Old_age   Always       -       55360802
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       32165 (99 12 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       209112772
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1293722474

Thanks for your help

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Jul 1, 2021

Your RAM is defect (or CPU or board, if the problem stays after replacing RAM).

Your disk is at least questionable:

187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       **184**

See there: https://www.backblaze.com/blog/hard-drive-smart-stats/

(also see my above comment about UNC smart log entries - i would also exchange that disk)

@ThomasWaldmann
Copy link
Member

Closing, not a borg, but a hardware issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants