Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add slow disk diagnosis to ZED #15469

Merged
merged 1 commit into from
Feb 8, 2024
Merged

Conversation

don-brady
Copy link
Contributor

Motivation and Context

Slow disk response times can be indicative of a failing drive. ZFS currently tracks slow I/Os (slower than zio_slow_io_ms) and generates events (ereport.fs.zfs.delay). But no action is taken, like is done for checksum or I/O errors. It would be nice to have the option to degrade a slow disk that is slowing down throughput and allow for it to be replaced with a spare.

Description

Add new VDEV properties, VDEV_PROP_SLOW_IO_N and VDEV_PROP_SLOW_IO_T that ZED can consume for the SERD engine parameters. They default to 10 slow I/Os in 30 seconds.
Add a new SERD engine type, zc_serd_slow_io, that can track these events and initiate a zpool_vdev_degrade() when the limit has occurred.

If multiple VDEVs in a pool are undergoing slow I/Os, then skip the zpool_vdev_degrade().

Also did some minor code cleanup in the zfs diagnosis module.

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.

How Has This Been Tested?

  1. Ran ZTS functional/events
  2. Added two new ZTS tests to verify the changes:
  • zed_slow_io: verify that override properties function as expected and a VDEV gets degraded
  • zed_slow_io_many_vdevs: confirm that when multiple VDEVs are involved no degrade occurs.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@tonyhutter
Copy link
Contributor

Just some quick questions as I was looking over the code -

  1. Is it correct to say that zfs would only call zpool_vdev_degrade() in response to slow IOs if you have VDEV_PROP_SLOW_IO_[N|T] set? That is, if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?

  2. I'm not really clear on what marking a vdev degraded means from a functional standpoint. I see it defined here:

       VDEV_STATE_DEGRADED, /* Replicated vdev with unhealthy kids    */

...but that makes it sound like a top level vdev state. Does something special happen to leaf vdevs marked DEGRADED? Like we stop allocating data on them or something? Or is it functionally the same as being FAULTED?

  1. Can you add the description from this PR into your commit message for the Add slow disk diagnosis to ZED commit?

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 1, 2023
@don-brady
Copy link
Contributor Author

2. I'm not really clear on what marking a vdev degraded means from a functional standpoint.

An individual vdev is degraded when it trips the checksum SERD diagnosis. For this PR, we are extending that to include a too many slow I/Os diagnosis. A degraded vdev is mentioned in pool-concepts:

                   The number of checksum errors exceeds acceptable levels and the de‐
                   vice is degraded as an indication that something may be wrong.  ZFS
                   continues to use the device as necessary.

I will update the man page to reflect that slow I/Os can also cause a degraded state.

A hot spare will replace a vdev that is degraded or faulted. We are leveraging this behavior so that a slow disk can be replaced with a spare.

@don-brady
Copy link
Contributor Author

  1. if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?

Yes, with the default SERD slow I/O thresholds a generic pool would see a degrade if one, and only one, vdev is slow (30 sec+ delay events). The current behavior is that the per-vdev slow I/O counters increase and zevents are generated, but no diagnosis occurs.

@tonyhutter
Copy link
Contributor

if you just have a vanilla pool without setting any vdev properties, could its vdevs ever get marked degraded on slow IOs?

Yes, with the default SERD slow I/O thresholds a generic pool would see a degrade if one, and only one, vdev is slow (30 sec+ delay events).

I think you may only want to enable the behavior if the user has set VDEV_PROP_SLOW_IO_[N|T] to something. I'm just worried that if it's enabled by default that there's going to be a lot of "why are my vdevs getting degraded?" issues getting opened up. I wouldn't be surprised if long IO delays are more normal than people are aware of (they're very common at our site).

cmd/zed/agents/zfs_diagnosis.c Outdated Show resolved Hide resolved
man/man7/vdevprops.7 Show resolved Hide resolved
@don-brady
Copy link
Contributor Author

Rebased to resolve merge conflicts. Addressed review feedback.

@behlendorf
Copy link
Contributor

Just for reference I noticed some test failures for CentOS 7

@pashford
Copy link
Contributor

Is this related to the following messages I'm finding in kmsg on my QNAP?

2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969 -----
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] 100ms 300ms 500ms 1s 5s 10s
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] write 0 0 0 9 0 0

Thank you.

@don-brady
Copy link
Contributor Author

Fixed commit message to be checkstyle compliant.

@don-brady
Copy link
Contributor Author

Is this related to the following messages I'm finding in kmsg on my QNAP?

2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969 ----- 2023-08-09 10:18:13 -07:00 <7> [ 606.733785] 100ms 300ms 500ms 1s 5s 10s 2023-08-09 10:18:13 -07:00 <7> [ 606.733785] write 0 0 0 9 0 0

Thank you.

I believe it is related in that the above message is likely a summary of a zevent for a slow I/O. If you have cli (shell) access you can confirm with zpool status -s <pool>.

@behlendorf
Copy link
Contributor

@don-brady when you get a chance can you rebase this. I also the new test case if failing on the CentOS 7 builder.

    FAIL events/zed_slow_io (expected PASS)

@don-brady
Copy link
Contributor Author

FAIL events/zed_slow_io (expected PASS)

The CentOS 7 test failure was very interesting. On this OS, and only on this OS, a spa txg sync was coming in during/after the test file reads and generating a lot of slow I/Os from the txg sync writes which in turn was causing a degrade where one was not expected for the test.

The current zinject -D mechanism does not discriminate the type of I/O and was ignoring the -T option. So I extended zio_handle_io_delay() to also consider the io_type before injecting a delay. And the slow I/O tests will now inject delays using "-T read" to have a more predictable outcome.

Also updated zinject(8) man page to indicate that -T is an optional param for -D

diff --git a/module/zfs/zio_inject.c b/module/zfs/zio_inject.c
index 3598351c4..609182f4a 100644
--- a/module/zfs/zio_inject.c
+++ b/module/zfs/zio_inject.c
@@ -605,6 +605,10 @@ zio_handle_io_delay(zio_t *zio)
                if (vd->vdev_guid != handler->zi_record.zi_guid)
                        continue;

+               if (handler->zi_record.zi_iotype != ZIO_TYPES &&
+                   handler->zi_record.zi_iotype != zio->io_type)
+                               continue;
+
                /*
                 * Defensive; should never happen as the array allocation
                 * occurs prior to inserting this handler on the list.
@@ -96,7 +97,7 @@ function start_slow_io
        log_must set_tunable64 ZIO_SLOW_IO_MS 10
        log_must set_tunable64 SLOW_IO_EVENTS_PER_SECOND 1000

-       log_must zinject -d $VDEV -D10:1 $POOL
+       log_must zinject -d $VDEV -D10:1 -T read $POOL
 }

 function stop_slow_io

@don-brady don-brady force-pushed the zed-slowio-serd branch 2 times, most recently from f1fe8ce to 258c87e Compare December 9, 2023 05:59
@don-brady
Copy link
Contributor Author

Another fix for ZTS for recent Ubuntu failures.

@don-brady don-brady force-pushed the zed-slowio-serd branch 2 times, most recently from 9a447eb to a9d10d4 Compare December 12, 2023 00:23
@pashford
Copy link
Contributor

Is this related to the following messages I'm finding in kmsg on my QNAP?
2023-08-09 10:18:13 -07:00 <7> [ 606.733785] ----- [DISK SLOW] Pool zpool1 vd 8851305046343565969

I believe it is related in that the above message is likely a summary of a zevent for a slow I/O. If you have cli (shell) access you can confirm with zpool status -s <pool>.

The ZFS on QNAP is a branch from the original Solaris release and is too old to have the '-s' option.

Thank you for trying.

@don-brady don-brady force-pushed the zed-slowio-serd branch 2 times, most recently from 19e552d to dded2f8 Compare December 18, 2023 20:45
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. @don-brady could you just rebase this on the latest master branch for force update the PR. We should get a cleaner CI run.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 19, 2023
@don-brady don-brady force-pushed the zed-slowio-serd branch 2 times, most recently from 98a1410 to ea98ab7 Compare January 6, 2024 03:38
Copy link
Contributor

@allanjude allanjude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed-by: Allan Jude <allan@klarasystems.com>

@behlendorf
Copy link
Contributor

@don-brady it looks like this change is causing at least the following ZTS failures.

https://github.com/openzfs/zfs/actions/runs/7429220144?pr=15469

 Tests with results other than PASS that are unexpected:
    FAIL fault/auto_spare_001_pos (expected PASS)
    FAIL fault/auto_spare_002_pos (expected PASS)
    FAIL fault/auto_spare_multiple (expected PASS)

@don-brady
Copy link
Contributor Author

Investigating ZTS auto_spare test failures...

Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
@don-brady
Copy link
Contributor Author

Update
I was able to root cause why seemingly unrelated ZTS tests involving ZED were failing after our new tests for diagnosing slow I/Os were added.

In the ZTS, there are ZED related tests in functional/events and also in functional/fault. In the functional/event tests, ZED is started and stopped before each test and in the functional/fault tests ZED is started in the setup script and stopped in the cleanup script. Both methods are fine and insure that ZED is starting in a know state with whatever zedlets the suite needs installed.

The mystery I encountered, was that in one of the auto_spare tests, it was failing to replace a faulted disk with a spare after 60 seconds (should take less than a second). It was as if ZED was stuck or stalled. And if I removed the new zed_slow_io_xxx tests, the auto_spare test failures go away. I think this confirms that the ZED diagnosis changes in the PR were not the source of auto_spare tests failing on Ubuntu hosts.

Root Cause
The problem is that when ZED was started for the fault suite of tests, there is a bunch of events that were generated from the zed_slow_io_many_vdevs test still hanging out in the kernel. So when ZED is started during setup for the fault tests suite, it starts processing those stale events. Note that these are events for a pool that no longer exists! ZED is so backlogged that it cannot process the latest events in time. It's effectively stuck in the past.

Normally ZED keeps persistent state (zed.state) about what events it has processed. But in the ZTS, that state resides in a temporary directory that is removed each time zed_cleanup is called. So it is important to manually drain the events before starting ZED process so that it won't reprocess unrelated stale events in the next test run!

To fix this failure, I added a call to zed_events_drain before starting up ZED in the setup scripts. This insures that ZED doesn't have a backlog of events when it starts.

@behlendorf behlendorf merged commit cbe8822 into openzfs:master Feb 8, 2024
24 of 25 checks passed
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#15469
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#15469
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 13, 2024
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes openzfs#15469
tonyhutter pushed a commit that referenced this pull request May 2, 2024
Slow disk response times can be indicative of a failing drive. ZFS
currently tracks slow I/Os (slower than zio_slow_io_ms) and generates
events (ereport.fs.zfs.delay).  However, no action is taken by ZED,
like is done for checksum or I/O errors.  This change adds slow disk
diagnosis to ZED which is opt-in using new VDEV properties:
  VDEV_PROP_SLOW_IO_N
  VDEV_PROP_SLOW_IO_T

If multiple VDEVs in a pool are undergoing slow I/Os, then it skips
the zpool_vdev_degrade().

Sponsored-By: OpenDrives Inc.
Sponsored-By: Klara Inc.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Signed-off-by: Don Brady <don.brady@klarasystems.com>
Closes #15469
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants