Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

homerl · 2019-07-02T05:00:08Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.6
Linux Kernel	3.10.0-957.el7_lustre.x86_64
Architecture	x86_64
ZFS Version	0.7.9
SPL Version	0.7.9

Describe the problem you're observing

The single HDD show some SCSI error and it disappeared in HBA command line(sas3ircu)
MMP will not write succeed forever.

Describe how to reproduce the problem

Only the 1:0:89:0: [sdck] , no others
The HDD has broken. it will not come back. after the HDD offline , the zpool has suspend

Include any warning/errors/backtraces from the system logs

zfs version
[Mon Jul  1 07:58:43 2019] SPL: Loaded module v0.7.9-1
[Mon Jul  1 07:58:46 2019] ZFS: Loaded module v0.7.9-1, ZFS pool version 5000, ZFS filesystem version 5

Jul  1 01:59:02 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146bc62b800)
Jul  1 01:59:02 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 CDB: Read(16) 88 00 00 00 00 02 43 70 20 18 00 00 00 01 00 00
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa146bc62b800)
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 CDB: Read(16) 88 00 00 00 00 02 43 70 20 18 00 00 00 01 00 00
Jul  1 01:59:06 oss-server-21 kernel: blk_update_request: I/O error, dev sdck, sector 9721356312
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa13c2fe83800)
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#0 CDB: Write(16) 8a 00 00 00 00 04 8c 3f fd fe 00 00 00 02 00 00
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
......
Jul  1 02:00:38 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa156b4f5ce00)
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146262ef480)
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 CDB: Write(16) 8a 00 00 00 00 04 3d 97 91 71 00 00 00 01 00 00
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa146262ef480)
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 CDB: Write(16) 8a 00 00 00 00 04 3d 97 91 71 00 00 00 01 00 00
Jul  1 02:00:46 oss-server-21 kernel: blk_update_request: I/O error, dev sdck, sector 18213212529
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146262ec380)
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: tag#3 CDB: Test Unit Ready 00 00 00 00 00 00
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:49 oss-server-21 kernel: WARNING: MMP writes to pool 'ost_29' have not succeeded in over 100s; suspending pool
Jul  1 02:00:49 oss-server-21 kernel: WARNING: Pool 'ost_29' has encountered an uncorrectable I/O failure and has been suspended.

The text was updated successfully, but these errors were encountered:

spmfox · 2019-07-03T02:28:38Z

Hi there, I'm not a developer or expert on this - so please anyone else chime in. However I think this is a known issue being discussed in #5242

The only reason I know this was due to this happening to me recently. I have a regular pool of multiple disks and a backup pool with one drive. For whatever reason the USB drive decided to disconnect (even the system couldn't talk to it any more). I could not export the pool due to it being a single disk and it was suspended like yours. I did some research, found that issue, and eventually just rebooted. Drive came back normally after that and its been business as usual.

I hope that helps.

h1z1 · 2019-07-10T00:00:01Z

By defaullt zfs sets failmode to wait. zpool get failmode tank. You can set it to continue or panic (more appropriate for a cluster).

homerl · 2019-08-02T01:29:19Z

Hi h1z1
continue mode is too dangerous.

devZer0 · 2019-09-16T19:49:31Z

please retry with latest zfs version and also post output of "zpool status"

adilger · 2019-10-15T08:25:33Z

I think that this issue is a duplicate of #7709 and #8495. The patch db2af93 should address this problem under normal usage. It doesn't resolve the issue of re-activating the pool after MMP has suspended it, but it should avoid the MMP suspension in the first place.

homerl closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

homerl commented Jul 2, 2019

spmfox commented Jul 3, 2019

h1z1 commented Jul 10, 2019

homerl commented Aug 2, 2019

devZer0 commented Sep 16, 2019

adilger commented Oct 15, 2019

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

Comments

homerl commented Jul 2, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

spmfox commented Jul 3, 2019

h1z1 commented Jul 10, 2019

homerl commented Aug 2, 2019

devZer0 commented Sep 16, 2019

adilger commented Oct 15, 2019