Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

Closed
homerl opened this issue Jul 2, 2019 · 5 comments
Closed

Single HDD offline cause zpool suspend at ZOL 0.7.9 #8981

homerl opened this issue Jul 2, 2019 · 5 comments

Comments

@homerl
Copy link

homerl commented Jul 2, 2019

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7.6
Linux Kernel 3.10.0-957.el7_lustre.x86_64
Architecture x86_64
ZFS Version 0.7.9
SPL Version 0.7.9

Describe the problem you're observing

The single HDD show some SCSI error and it disappeared in HBA command line(sas3ircu)
MMP will not write succeed forever.

Describe how to reproduce the problem

Only the 1:0:89:0: [sdck] , no others
The HDD has broken. it will not come back. after the HDD offline , the zpool has suspend

Include any warning/errors/backtraces from the system logs

zfs version
[Mon Jul  1 07:58:43 2019] SPL: Loaded module v0.7.9-1
[Mon Jul  1 07:58:46 2019] ZFS: Loaded module v0.7.9-1, ZFS pool version 5000, ZFS filesystem version 5
Jul  1 01:59:02 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146bc62b800)
Jul  1 01:59:02 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 CDB: Read(16) 88 00 00 00 00 02 43 70 20 18 00 00 00 01 00 00
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 01:59:02 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa146bc62b800)
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#5 CDB: Read(16) 88 00 00 00 00 02 43 70 20 18 00 00 00 01 00 00
Jul  1 01:59:06 oss-server-21 kernel: blk_update_request: I/O error, dev sdck, sector 9721356312
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa13c2fe83800)
Jul  1 01:59:06 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#0 CDB: Write(16) 8a 00 00 00 00 04 8c 3f fd fe 00 00 00 02 00 00
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 01:59:06 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
......
Jul  1 02:00:38 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa156b4f5ce00)
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146262ef480)
Jul  1 02:00:42 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 CDB: Write(16) 8a 00 00 00 00 04 3d 97 91 71 00 00 00 01 00 00
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 02:00:42 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: task abort: SUCCESS scmd(ffffa146262ef480)
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: [sdck] tag#7 CDB: Write(16) 8a 00 00 00 00 04 3d 97 91 71 00 00 00 01 00 00
Jul  1 02:00:46 oss-server-21 kernel: blk_update_request: I/O error, dev sdck, sector 18213212529
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: attempting task abort! scmd(ffffa146262ec380)
Jul  1 02:00:46 oss-server-21 kernel: sd 1:0:89:0: tag#3 CDB: Test Unit Ready 00 00 00 00 00 00
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: _scsih_tm_display_info: handle(0x0065), sas_address(0x5000cca25198617d), phy(37)
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: enclosurelogical id(0x500304800928aebf), slot(36)
Jul  1 02:00:46 oss-server-21 kernel: scsi target1:0:89: enclosure level(0x0001), connector name(     )
Jul  1 02:00:49 oss-server-21 kernel: WARNING: MMP writes to pool 'ost_29' have not succeeded in over 100s; suspending pool
Jul  1 02:00:49 oss-server-21 kernel: WARNING: Pool 'ost_29' has encountered an uncorrectable I/O failure and has been suspended.
@spmfox
Copy link

spmfox commented Jul 3, 2019

Hi there, I'm not a developer or expert on this - so please anyone else chime in. However I think this is a known issue being discussed in #5242

The only reason I know this was due to this happening to me recently. I have a regular pool of multiple disks and a backup pool with one drive. For whatever reason the USB drive decided to disconnect (even the system couldn't talk to it any more). I could not export the pool due to it being a single disk and it was suspended like yours. I did some research, found that issue, and eventually just rebooted. Drive came back normally after that and its been business as usual.

I hope that helps.

@h1z1
Copy link

h1z1 commented Jul 10, 2019

By defaullt zfs sets failmode to wait. zpool get failmode tank. You can set it to continue or panic (more appropriate for a cluster).

@homerl
Copy link
Author

homerl commented Aug 2, 2019

Hi h1z1
continue mode is too dangerous.

@devZer0
Copy link

devZer0 commented Sep 16, 2019

please retry with latest zfs version and also post output of "zpool status"

@adilger
Copy link
Contributor

adilger commented Oct 15, 2019

I think that this issue is a duplicate of #7709 and #8495. The patch db2af93 should address this problem under normal usage. It doesn't resolve the issue of re-activating the pool after MMP has suspended it, but it should avoid the MMP suspension in the first place.

@homerl homerl closed this as completed May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants