'zfs list' hangs after a pool is suspended #5345

rgmiller · 2016-10-27T19:48:00Z

I encountered an issue last night where 'zfs list' hangs indefinitely. I have traces from 'dmesg' that I've attached. The issue seems to be related to the fact that a pool was suspended for uncorrectable I/O errors while 'zfs list' was running.

Note that I can still read & write to my (other) pools normally. However, I can't use either the 'zfs' or 'zpool' utils any more. My suspicion is that the hung 'zfs' process has a lock on a mutex, but that's just a guess.

This is on 0.6.5.8, running on CentOS7.

Some observations about the attached dmesg traces:

I think the traces from 10:06 are unrelated. Everything was working just fine until I ran 'zfs list' at around 22:22.
If I've done my elapsed time math correctly, I started 'zfs list' at about 22:22, but the important dmesg traces start at 22:41. I don't know if I did the math wrong or if the times from dmesg are incorrect. (The dmesg man page warns that this is a possibility.)
There's a warning about a pool being suspended for uncorrectable I/O errors. I suspect that's the root of the problem. I suspect that having a pool die while 'zfs list' is running is a corner case that hasn't seen much testing. Presumably something didn't get cleaned up properly and that's what hung the 'zfs list' process?
Further down in the trace (line 191) you can see a 'zpool' process that appears to be waiting on a mutex. I'm assuming that the mutex it's waiting for is held by the 'zfs list' process.

The system is still up and running, as is the hung 'zfs list' process, so if anyone wants more data, I can try to get it. (Also, we don't need to worry about the suspended pool: it was just for testing and was using an eSATA enclosure that I knew was flaky. I'm not particularly surprised it failed.)

dmesg.txt

GregorKopka · 2016-10-31T07:02:47Z

Related issue from the zpool side: #3461 #2878

In theory you should be able to zpool clear (or reconnect the drive and zpool online it in case of OFFLINE) to unsuspend the pool.

In practice zfs might have deadlocked on itself, it would be interesting to know which it is for your case.

rgmiller · 2016-10-31T15:46:28Z

I think zfs is deadlocked. Take a look at line 193 of the dmesg.txt file I attached. That's the call stack for a 'zpool status' process and it's stuck waiting on a mutex.

I can try a 'zpool clear' remotely, but fiddling with the eSATA enclosure will have to wait until I get home this evening.

rgmiller · 2016-10-31T15:48:26Z

Yep: 'zpool clear' hangs, too.

GregorKopka · 2016-10-31T18:33:57Z

As there is currently no way to remove a suspended pool: you will have to reboot. All zpool and zfs invocations (even when not related to the suspended pool) will be blocked by the defunct pool.
#2878 has some more details on this, sadly noone is known to actively work on this.

rgmiller · 2016-10-31T19:15:11Z

OK, I'll go ahead and reboot. What should we do about this issue? Call it a duplicate of #2878 and close it?

kingneutron · 2016-10-31T22:22:57Z

--Your issue interests me because I experienced pretty much the same thing today. Host: Dell Studio 1550 laptop with 8GB RAM, Ubuntu 16.04-64-LTS, running latest kernel 4.4.0-45-generic and ZFS 0.6.5.6-0ubuntu14.

--I just got a new Marvell 9128 chipset-based eSATA card for my older laptop (StarTech.com 2 Port SATA 6 Gbps ExpressCard eSATA Controller Card - ECESAT32) and am testing a Probox external eSATA/USB3 case ( Mediasonic ProBox HF2-SU3S2 4 Bay 3.5" SATA HDD Enclosure - USB 3.0 & eSATA Support) with 4x1TB WD RED NAS drives for use as a local ZFS DAS.

--Long story short, I started a zpool scrub on the 4-drive zRAID10 and after ~35GB (IIRC), all 4 drives dropped off with I/O errors. Rebooted and made sure the cabling was not being jostled, and now the scrub is still going with good I/O (~25-40MB/sec per drive, with the occasional drop off to ~18MB/sec.) The pool overall is averaging ~60MB/sec.

--I should note that this same external case + drive pool passed a scrub with no issues on another system running slower SATA speed (~1.5Gb) and Ubuntu 14.04 with latest kernel just a couple of days ago.


  pool: zredtera1
state: ONLINE
scan: scrub in progress since Mon Oct 31 15:51:29 2016
300G scanned out of 654G at 63.2M/s, 1h35m to go

0 repaired, 45.89% done

config:

NAME                                          STATE     READ WRITE CKSUM
    zredtera1                                     ONLINE       0     0     0

      mirror-0                                    ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J1NL656R  ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J6KTJC0J  ONLINE       0     0     0

      mirror-1                                    ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J4KD08T6  ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J3CK81ZP  ONLINE       0     0     0

errors: No known data errors

--When the drives dropped off the system, the kernel messages were (see attached file)

--Probably the zfs code needs a better way to recover from hardware failure like this, but understandably ZFS wasn't originally written to run on a laptop with a jackleg eSATA array. I hope somebody does fix this issue.
hard-drive-fail-kernel-messages.txt

behlendorf · 2016-10-31T22:28:42Z

@rgmiller let's call it a duplicate of #2878 and close it. I think this would be a great issue for someone interested in ZFS development to tackle.

behlendorf closed this as completed Oct 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'zfs list' hangs after a pool is suspended #5345

'zfs list' hangs after a pool is suspended #5345

rgmiller commented Oct 27, 2016

GregorKopka commented Oct 31, 2016

rgmiller commented Oct 31, 2016

rgmiller commented Oct 31, 2016

GregorKopka commented Oct 31, 2016

rgmiller commented Oct 31, 2016

kingneutron commented Oct 31, 2016 •

edited

Loading

behlendorf commented Oct 31, 2016

'zfs list' hangs after a pool is suspended #5345

'zfs list' hangs after a pool is suspended #5345

Comments

rgmiller commented Oct 27, 2016

GregorKopka commented Oct 31, 2016

rgmiller commented Oct 31, 2016

rgmiller commented Oct 31, 2016

GregorKopka commented Oct 31, 2016

rgmiller commented Oct 31, 2016

kingneutron commented Oct 31, 2016 • edited Loading

behlendorf commented Oct 31, 2016

kingneutron commented Oct 31, 2016 •

edited

Loading