Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'zfs list' hangs after a pool is suspended #5345

Closed
rgmiller opened this issue Oct 27, 2016 · 7 comments
Closed

'zfs list' hangs after a pool is suspended #5345

rgmiller opened this issue Oct 27, 2016 · 7 comments

Comments

@rgmiller
Copy link

I encountered an issue last night where 'zfs list' hangs indefinitely. I have traces from 'dmesg' that I've attached. The issue seems to be related to the fact that a pool was suspended for uncorrectable I/O errors while 'zfs list' was running.

Note that I can still read & write to my (other) pools normally. However, I can't use either the 'zfs' or 'zpool' utils any more. My suspicion is that the hung 'zfs' process has a lock on a mutex, but that's just a guess.

This is on 0.6.5.8, running on CentOS7.

Some observations about the attached dmesg traces:

  • I think the traces from 10:06 are unrelated. Everything was working just fine until I ran 'zfs list' at around 22:22.
  • If I've done my elapsed time math correctly, I started 'zfs list' at about 22:22, but the important dmesg traces start at 22:41. I don't know if I did the math wrong or if the times from dmesg are incorrect. (The dmesg man page warns that this is a possibility.)
  • There's a warning about a pool being suspended for uncorrectable I/O errors. I suspect that's the root of the problem. I suspect that having a pool die while 'zfs list' is running is a corner case that hasn't seen much testing. Presumably something didn't get cleaned up properly and that's what hung the 'zfs list' process?
  • Further down in the trace (line 191) you can see a 'zpool' process that appears to be waiting on a mutex. I'm assuming that the mutex it's waiting for is held by the 'zfs list' process.

The system is still up and running, as is the hung 'zfs list' process, so if anyone wants more data, I can try to get it. (Also, we don't need to worry about the suspended pool: it was just for testing and was using an eSATA enclosure that I knew was flaky. I'm not particularly surprised it failed.)

dmesg.txt

@GregorKopka
Copy link
Contributor

Related issue from the zpool side: #3461 #2878

In theory you should be able to zpool clear (or reconnect the drive and zpool online it in case of OFFLINE) to unsuspend the pool.

In practice zfs might have deadlocked on itself, it would be interesting to know which it is for your case.

@rgmiller
Copy link
Author

I think zfs is deadlocked. Take a look at line 193 of the dmesg.txt file I attached. That's the call stack for a 'zpool status' process and it's stuck waiting on a mutex.

I can try a 'zpool clear' remotely, but fiddling with the eSATA enclosure will have to wait until I get home this evening.

@rgmiller
Copy link
Author

Yep: 'zpool clear' hangs, too.

@GregorKopka
Copy link
Contributor

As there is currently no way to remove a suspended pool: you will have to reboot. All zpool and zfs invocations (even when not related to the suspended pool) will be blocked by the defunct pool.
#2878 has some more details on this, sadly noone is known to actively work on this.

@rgmiller
Copy link
Author

OK, I'll go ahead and reboot. What should we do about this issue? Call it a duplicate of #2878 and close it?

@kingneutron
Copy link

kingneutron commented Oct 31, 2016

--Your issue interests me because I experienced pretty much the same thing today. Host: Dell Studio 1550 laptop with 8GB RAM, Ubuntu 16.04-64-LTS, running latest kernel 4.4.0-45-generic and ZFS 0.6.5.6-0ubuntu14.

--I just got a new Marvell 9128 chipset-based eSATA card for my older laptop (StarTech.com 2 Port SATA 6 Gbps ExpressCard eSATA Controller Card - ECESAT32) and am testing a Probox external eSATA/USB3 case ( Mediasonic ProBox HF2-SU3S2 4 Bay 3.5" SATA HDD Enclosure - USB 3.0 & eSATA Support) with 4x1TB WD RED NAS drives for use as a local ZFS DAS.

--Long story short, I started a zpool scrub on the 4-drive zRAID10 and after ~35GB (IIRC), all 4 drives dropped off with I/O errors. Rebooted and made sure the cabling was not being jostled, and now the scrub is still going with good I/O (~25-40MB/sec per drive, with the occasional drop off to ~18MB/sec.) The pool overall is averaging ~60MB/sec.

--I should note that this same external case + drive pool passed a scrub with no issues on another system running slower SATA speed (~1.5Gb) and Ubuntu 14.04 with latest kernel just a couple of days ago.

pool: zredtera1

state: ONLINE

scan: scrub in progress since Mon Oct 31 15:51:29 2016

300G scanned out of 654G at 63.2M/s, 1h35m to go

0 repaired, 45.89% done

config:
NAME STATE READ WRITE CKSUM

    zredtera1                                     ONLINE       0     0     0

      mirror-0                                    ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J1NL656R  ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J6KTJC0J  ONLINE       0     0     0

      mirror-1                                    ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J4KD08T6  ONLINE       0     0     0

        ata-WDC_WD10EFRX-68FYTN0_WD-WCC4J3CK81ZP  ONLINE       0     0     0

errors: No known data errors

--When the drives dropped off the system, the kernel messages were (see attached file)

--Probably the zfs code needs a better way to recover from hardware failure like this, but understandably ZFS wasn't originally written to run on a laptop with a jackleg eSATA array. I hope somebody does fix this issue.
hard-drive-fail-kernel-messages.txt

@behlendorf
Copy link
Contributor

@rgmiller let's call it a duplicate of #2878 and close it. I think this would be a great issue for someone interested in ZFS development to tackle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants