Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt data after recovering from a suspended pool #13879

Open
allanjude opened this issue Sep 12, 2022 · 2 comments
Open

Corrupt data after recovering from a suspended pool #13879

allanjude opened this issue Sep 12, 2022 · 2 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@allanjude
Copy link
Contributor

System information

Type Version/Name
Distribution Name Linux
Distribution Version Ubuntu 18.04
Kernel Version 5.3.0-26-generic 28~18.04.1~Ubuntu
Architecture x86_64
OpenZFS Version 0.8.2-1 (and 2.1.99)

Describe the problem you're observing

After a pool becomes suspended due to losing too many disks, some files that were written just before the pool was suspended are unrecoverable. ZFS should know if the write completed successfully, and not discard the dirty data until it is written properly.

We expect PART of this problem is that zio_flush() sets the ZIO_FLAG_DONT_PROPAGATE flag, so errors are not sent to the parent ZIO. Even without that, we still see this problem. We are investigating further.

Describe how to reproduce the problem

We used zinject to FAULT more disks than the RAID-Z configuration can withstand. After removing the zinject handlers, and running zpool clear there are persistent checksum errors or completely unreadable files.

We were able to better reproduce this on real hardware, by using enclosure management tools to power off multiple disks from the pool at once causing it to become faulted.

Include any warning/errors/backtraces from the system logs

Sep 09 18:46:24 ZFS kernel: mpt3sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Sep 09 18:46:24 ZFS kernel: scsi_io_completion_action: 67 callbacks suppressed
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#338 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#338 CDB: Write(10) 2a 00 00 00 67 05 00 00 01 00
Sep 09 18:46:24 ZFS kernel: print_req_error: 71 callbacks suppressed
Sep 09 18:46:24 ZFS kernel: blk_update_request: I/O error, dev sddg, sector 210984 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#671 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#671 CDB: Write(10) 2a 00 00 13 12 9b 00 00 03 00
Sep 09 18:46:24 ZFS kernel: blk_update_request: I/O error, dev sddg, sector 9999576 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
Sep 09 18:46:24 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=2 offset=5111394304 size=12288 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=2 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] tag#178 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] tag#178 CDB: Read(10) 28 00 cb bb ff f0 00 00 01 00
Sep 09 18:46:27 ZFS kernel: blk_update_request: I/O error, dev sddh, sector 27344764800 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Sep 09 18:46:27 ZFS kernel: Buffer I/O error on dev sddh, logical block 3418095600, async page read
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=1 offset=14000435503104 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: Buffer I/O error on dev sddg, logical block 3418095600, async page read
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=14000435503104 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=14000435240960 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970ac920-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=99635200 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=1337597952 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=2480975872 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=1 offset=14000435240960 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=31156936704 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=31909847040 size=24576 flags=40080c80
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=6168944640 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=5111427072 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: removing handle(0x0019), sas_addr(0x5000cca2970c1ce5)
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: enclosure logical id(0x5000ccab040d5080), slot(42)
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( 1   )
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] Synchronizing SCSI cache
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 09 18:46:27 ZFS kernel: WARNING: Pool 'pool-name' has encountered an uncorrectable I/O failure and has been suspended.

then the pool was resumed once the HDDs powered back up with zpool clear

# zpool status pool-name
  pool: pool-name
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep  3 02:09:21 2022
        823M scanned at 823M/s, 825M issued at 825M/s, 74.3G total
        24.8M resilvered, 1.08% done, 0 days 00:01:31 to go
config:

        NAME                        STATE     READ WRITE CKSUM
        pool-name                   ONLINE       0     0     0
          raidz3-0                  ONLINE       0     0     0
            wwn-0x5000cca2970a9044  ONLINE       0     0   442  (resilvering)
            wwn-0x5000cca2970ac920  ONLINE       0     0     0  (resilvering)
            wwn-0x5000cca2970c1ce4  ONLINE       0     0     0  (resilvering)
            wwn-0x5000cca2970c1fe4  ONLINE       0     0   234  (resilvering)
            wwn-0x5000cca2970c55c8  ONLINE       0     0    84  (resilvering)
            wwn-0x5000cca2970c55fc  ONLINE       0     0     0
            wwn-0x5000cca2970c5600  ONLINE       0     0     0
            wwn-0x5000cca2970c5980  ONLINE       0     0     0
            wwn-0x5000cca2970c5d3c  ONLINE       0     0     0
            wwn-0x5000cca2970c750c  ONLINE       0     0     0
            wwn-0x5000cca2970c9468  ONLINE       0     0     0
            wwn-0x5000cca2970c98a4  ONLINE       0     0     0
            wwn-0x5000cca2970cba40  ONLINE       0     0     0
            wwn-0x5000cca2970cbd90  ONLINE       0     0     0

errors: 788 data errors, use '-v' for a list
@allanjude allanjude added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 12, 2022
@alek-p
Copy link
Contributor

alek-p commented Sep 13, 2022

I've ran into this on FreeBSD as well

@stale
Copy link

stale bot commented Sep 16, 2023

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants