Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Mark Maybee's changes to allow for Direct IO with read/writes. … #5

Closed
wants to merge 71 commits into from

Conversation

bwatkinson
Copy link

…I also added kernel compatibility checks for get_user_pages_unlocked.

Adding Direct IO support for Read/Writes

This is the first commit of Mark Maybee's code for implementing Direct IO for read and write in ZFS 0.8.x. Direct IO allows for the ARC to be bypassed for read and writes reducing the number of memory copies ZFS.

Motivation and Context

Currently ZFS Zpool's created using NVMe SSD's as the underlying physical VDEV's were experiencing poor performance characteristics for both asynchronous writes and synchronous reads. By adding Direct IO, we are able to bypass the ARC avoiding the overhead of memory copies that were restricting NVMe performance.

Description

This is the first go at getting Direct IO implemented. Other modifications will need to be added in order for this to eventually be pulled into master. The main idea behind the code is splicing the user buffers into abd's and mapping the user pages into kernel space.

How Has This Been Tested?

Currently testing is still ongoing to observe performance gains from using Direct IO with NVMe SSD Zpools, but verification between kernel's 3.10, 4.8, and 4.20 have been tested for compilation.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • [ x] New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

PaulZ-98 and others added 30 commits August 15, 2019 08:27
The call to txg_wait_synced in zfsvfs_teardown should
be made conditional on the objset having dirty data.
This can prevent unnecessary txg_wait_synced during
some unmount operations.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes openzfs#9115
This patch introduces an assertion that can catch pitfalls in
development where there is a mismatch between the size of
reads and writes between a *_phys structure and its respective
in-core structure when bonus buffers are used.

This debugging-aid should be complementary to the verification
done by ztest in ztest_verify_dnode_bt().

A side to this patch is that we now clear out any extra bytes
past a bonus buffer's new size when the buffer is shrinking.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#8348
In zfs_log_write(), we can use dmu_read_by_dnode() rather than
dmu_read() thus avoiding unnecessary dnode_hold() calls.

We get a 2-5% performance gain for large sequential_writes tests, >=128K
writes to files with recordsize=8K.

Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes openzfs#9156
Even though the bug's writeup (Github issue openzfs#9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.

From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed].  Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:

[1] The waiter didn't call into any of the two functions - which I
    find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
    with?).
[2] The waiter did call in one of the above function but it passed 0 as
    the space/delta to be dirtied (or undirtied) and then the callee
    returned immediately (e.g both `dsl_pool_dirty_space()` and
    `dsl_pool_undirty_space()` return immediately when space is 0).

In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.

Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.

In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.

In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
 * We have written all of the accounted dirty data, so our
 * dp_space_towrite should now be zero.  However, some seldom-used
 * code paths do not adhere to this (e.g. dbuf_undirty(), also
 * rounding error in dbuf_write_physdone).
 * Shore up the accounting of any dirtied space now.
 */
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````

Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.

Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.

For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.

For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.

Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.

Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.

The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes openzfs#9137
* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions
At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for
its initramfs. Include both in build when initramfs is configured.

* contrib/initramfs: include 60-zvol.rules and zvol_id
Include 60-zvol.rules and zvol_id and set udev as predependency instead
of debians zdev. This makes debians additional zdev hook unneeded.

* Correct initconfdir substitution for some distros
Not every Linux distro is using @sysconfdir@/default but @initconfdir@
which is already determined by configure. Let's use it.

* systemd: prevent possible conflict between systemd and sysvinit
Systemd will not load a sysvinit service if a unit exists with the same
name. This prevents conflicts between sysvinit and systemd.
In ZFS there is one sysvinit service that does not have a systemd
service but a target counterpart, zfs-import.target.
Usually it does not make any sense to install both but it is possisble.
Let's prevent any conflict by masking zfs-import.service by default.
This does not harm even if init.d/zfs-import does not exist.

Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Alex Ingram <reimu@reimuhakurei.net>
Tested-by: Dreamcat4 <dreamcat4@gmail.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes openzfs#7904 
Closes openzfs#9089
On systems with large amounts of storage and high fragmentation, a huge 
amount of space can be used by storing metaslab range trees. Since 
metaslabs are only unloaded during a txg sync, and only if they have 
been inactive for 8 txgs, it is possible to get into a state where all 
of the system's memory is consumed by range trees and metaslabs, and 
txgs cannot sync. While ZFS knows how to evict ARC data when needed, 
it has no such mechanism for range tree data. This can result in boot 
hangs for some system configurations.

First, we add the ability to unload metaslabs outside of syncing 
context. Second, we store a multilist of all loaded metaslabs, sorted 
by their selection txg, so we can quickly identify the oldest 
metaslabs.  We use a multilist to reduce lock contention during heavy 
write workloads. Finally, we add logic that will unload a metaslab 
when we're loading a new metaslab, if we're using more than a certain 
fraction of the available memory on range trees.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9128
It used to be possible for zfs receive (and other operations related 
to clone swap) to bypass refquotas. This can cause a number of issues, 
and there should be an automated test for it.

Added tests for rollback and receive not overriding refquota.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9139
Existing zfs initramfs script logic will attempt to set the 'noop' 
scheduler if it's available on the vdev block devices. Newer kernels 
have the similar 'none' scheduler on multiqueue devices; this change 
alters the initramfs script logic to also attempt to set this scheduler 
if it's available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes openzfs#9042
Uses obj-m instead, due to kernel changes.

See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Dominic Pearson <dsp@technoanimal.net>
Closes openzfs#9169
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
    zap_lockdir_impl+0x858/0x16c0 [zfs]
    zap_lockdir+0xd2/0x100 [zfs]
    zap_lookup_norm+0x7f/0x100 [zfs]
    zap_lookup+0x12/0x20 [zfs]
    sa_setup+0x902/0x1380 [zfs]
    zfsvfs_init+0x3d6/0xb20 [zfs]
    zfsvfs_create+0x5dd/0x900 [zfs]
    zfs_domount+0xa3/0xe20 [zfs]

and thread 2 trying to get sa_lock, either in sa_setup:
   sa_setup+0x742/0x1380 [zfs]
   zfsvfs_init+0x3d6/0xb20 [zfs]
   zfsvfs_create+0x5dd/0x900 [zfs]
   zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
   sa_build_index+0x13d/0x790 [zfs]
   sa_handle_get_from_db+0x368/0x500 [zfs]
   zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
   zfs_znode_alloc+0x3da/0x1a40 [zfs]
   zfs_zget+0x39a/0x6e0 [zfs]
   zfs_root+0x101/0x160 [zfs]
   zfs_domount+0x91f/0xea0 [zfs]

From there, there are different locking paths back to something
holding zap->zap_rwlock.

The deadlock scenarios involve multiple different ZFS filesystems
being mounted.  sa_lock is common to these scenarios, and the sa
struct involved is private to a mount.  Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.

The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes openzfs#9110
When there are many snapshots, calls to zfs_ioc_space_snaps() (e.g. from
`zfs destroy -nv pool/fs@snap1%snap10000`) can be very slow, resulting
in poor performance because we are holding the dp_config_rwlock the
entire time, blocking spa_sync() from continuing.  With around ten
thousand snapshots, we've seen up to 500 seconds in this ioctl,
iterating over up to 50,000,000 bpobjs, ~99% of which are the empty
bpobj.

By creating a fast path for zfs_ioc_space_snaps() handling of the
empty_bpobj, we can achieve a ~5x performance improvement of this ioctl
(when there are many snapshots, and the deadlist is mostly
empty_bpobj's).

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58348
Closes openzfs#8744
Automake can perform program name transformations at install time.
However, arc_summary has its own name transformation taking place,
which interferes with the automake transforms. The automake transforms
must be taken into account in order to resolve the conflict.

Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
$fs used with the wrong sed command where should be $mntpnt instead
to match a variable exported by read_mtab()

The fix is mostly to reuse the sed command found in read_mtab()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Alexey Smirnoff <fling@member.fsf.org>
Closes openzfs#9168
Split long lines where adding license info to dist archive.

Remove extra colon from target line.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9189
Fix some switch() fall-though compiler errors:

    abd.c:1504:9: error: this statement may fall through

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#9170
The ancient version of blkid (v2.17.2) used in CentOS 6 will not
detect the newly created pool unless it has been written to.
Force a pool sync so `zpool import` will detect the newly created
pool.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9199
When checking ZFS_IOC_* numbers, print which numbers are wrong rather
than silently failing.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9187
Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`.
This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and
`('Z' << 8)` is wrong.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9188
Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland
configuration file, as done with the other ZFS_DKMS_* options.

It has been introduced with commit e45c173 ("dkms: Enable
debuginfo option to be set with zfs sysconfig file") but isn't
mentioned anywhere other than the 'dkms.conf' file (generated).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes openzfs#9191
Automake can perform program name transformations at install time.
However, arc_summary has its own name transformation taking place,
which interferes with the automake transforms. The automake transforms
must be taken into account in order to resolve the conflict.

Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
The mdb_set_uint32 function requires that the values passed in be
decimal.  This was overlooked initially because the matching Linux
function accepts both decimal and hexadecimal values.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes openzfs#9125
Closes openzfs#9195
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes openzfs#9174
The slog tests fail when attempting to create pools using file vdevs
that already exist from previous test runs. Remove these files in the
setup for the test.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9194
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Commit a887d65 updated the dbufstats such that escalated privileges
are required.  Since all tests under cli_user are run with normal
privileges move this test case to a location where it will be run
required privileges.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#9118 
Closes openzfs#9196
Split the arguments for ${TEST_RUNNER} across multiple lines for
clarity. Also added quotes in the message to match the invoked command.

Unquoted variables in argument lists are subject to splitting. In this
particular case we can't quote the variable because it is an optional
argument. Use the method suggested in the description linked below,
instead.

The technique is to use an unquoted variable with an alternate value.

https://github.com/koalaman/shellcheck/wiki/SC2086

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9212
Other than this test, zpool list -p is not well tested by any of the 
automated tests.  Add a test for zpool list -p.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes openzfs#9134
The double-colon looked like a typo, but it's actually an obscure
feature. Rules with :: may appear multiple times and are run
independently of one another in the order they appear. The use of ::
for distclean-local was conventional, not accidental.

Add comments to indicate the intentional use of double-colon rules.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes openzfs#9210
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes openzfs#9203
Copy link

@mmaybee mmaybee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable, however I have a couple of comments:

  1. I think we should leave out the "user-space" version of the interface until such time as it will actually be consumed. At the moment it is just going to be dead code.

  2. The naming: abd_get_from_pages/abd_free_from_pages violates normal naming conventions. I think we should either embed the extra code from abd_free_from_pages in abd_put() (leveraging ABD_FLAG_DIO_PAGE) or we should rename abd_get_from_pages to abd_alloc_from_pages. I lean toward the former, as it reduces the number of interfaces.

@bwatkinson
Copy link
Author

bwatkinson commented Sep 6, 2019

@mmaybee

  1. I am pretty neutral about this. I am fine with removing the user space version as it is not necessary at this point. I just figured it might be easier for us to implement it than someone else having to maybe do it at a later time. @behlendorf do you have an opinion on whether to remove the user space version?

  2. I actually agree with you on this. I am prefer the former as well. I was originally going to go this route anyways. I just need to check for the ABD_FLAG_DIO_PAGE in abd_put() so we do not run the abd_verify() function there. If we were to call verify after removing the sg_table that would lead to segfaults (I know because I saw it happen lol). It is any easy enough change though and I will go ahead and update that code and push the changes Monday.

@mmaybee and @behlendorf I think we could be ready for an official pull release against the zfsonlinux master branch once we get the idea of how do we only limit read and write for Direct IO to the PAGE_SIZE limitation. I have been really thinking about how to go about this, but it definitely is tricky. No matter what, I believe we are kinda stuck in some ways on dn->datablksz check due to checksum verifies for reads. I think a bigger issue also comes when we think about a Direct IO request (that is within the PAGE_SIZE limitation), but the offset tiggers having to pull in 2 datablksz chunks to perform the checksum verifies (AKA the request is in between both). We are limited to the data buffer size the user gives us, so we can not grab and write more data into the user buffer than what is provided. I was thinking we would have to eventually having to perform a data copy again at some point... Also, what do we do about compression? Technically the semantics of O_DIRECT should lead to no data copies; however, with compression an implicit data copy is required. I may just be overthinking the datablksz restriction and the compression as well. It is also possible that I do not have a complete understanding of the issue, so please correct me if I am wrong on either of these thoughts. I am still learning the intricacies of the ZFS software stack, so I might just be out in left field on these thoughts : ). I think till we iron out the right way to only restrict to to the PAGE_SIZE limitation, we should hold off on an official pull request on master. Thoughts or opinions on this?

@mmaybee
Copy link

mmaybee commented Sep 6, 2019

For data services like compression and encryption, my opinion is that "Direct IO" is really just a "reduced data copy" interface. That is, we perform the normal IO pipeline processing (with the necessary data copies for the data transforms) but avoid the final "copy out" because we have a mapped buffer from the user. The only alternative I could think of would be to return the "raw" data when doing direct IO so that we return a compressed and/or encrypted block... but I don't think this is really useful.

For relaxing the alignment constraints to PAGE_SIZE: my approach here would be to use the "multi list" functionality that I added to the abd interface to allow one to combine together multiple abd buffers into a single logical buffer (I did this for the aggregation functionality in vdev_queue). If we have a read that does not cover an entire block then we "add in" an abd buffer to the beginning or end of the one created from the users buffer to construct one the size of the data block, and then perform a normal read of the block. For the write side, we would create a similar abd buffer, but then would also need to to a read/modify/write for the partial blocks.

@behlendorf
Copy link
Owner

1 I think we should leave out the "user-space" version

That's fine with me, I agree if it's just going to be dead code let's not include it until it's needed.

my opinion is that "Direct IO" is really just a "reduced data copy" interface.

I agree completely. The behavior of Direct IO has always been described as a best effort to "minimize cache effects of the I/O to and from this file". When things like compression and encryption are involved there's realistically only so much which can be done.

For relaxing the alignment constraints to PAGE_SIZE: my approach here would be to use the "multi list" functionality that I added to the abd interface to allow one to combine together multiple abd buffers into a single logical buffer

This makes the most sense to me. @mmaybee if you prefer, this could be pushed as a stand alone PR to improve the aggregation logic, or folded in to this larger change. Either way is OK with me.

@mmaybee
Copy link

mmaybee commented Sep 6, 2019

Let me try to fold my stuff into this code and make sure that it actually works :-)

@bwatkinson
Copy link
Author

I will go ahead and remove the user space interface and also update the get_from_user_pages to use abd_put as well. I will get those changes pushed up on Monday.

Makes sense on the “reduce data copy” with Direct IO.

@mmaybee, I am guessing you are already working on the “add in” abd buffer stuff for loosing the constraints to only PAGE_SIZE for Direct IO? I am definitely willing to help out with this, so please don’t hesitate to push some work my way if you have already started on this. Otherwise, I can start working on modifications to allow for “add in” buffers.

Also, I definitely don’t mind continuing to integrate the changes into this branch as we get prepared for an official PR. The current process of me migrating code from the Slack Channel into this branch works for me, but if either of you think it would be more beneficial to just push changes directly to this branch just let me know.

@mmaybee
Copy link

mmaybee commented Sep 6, 2019

@bwatkinson, yea, I already have an implementation of the abd extension. I am now folding it into your bits. Once I have done a sanity check I will toss it over to you and we can work on applying it to the direct IO path.

@bwatkinson
Copy link
Author

@mmaybee sounds good!

… updated the abd code so now abd_put is only used to free up the sg_table instead of having to call abd_free.
@bwatkinson
Copy link
Author

@bwatkinson, yea, I already have an implementation of the abd extension. I am now folding it into your bits. Once I have done a sanity check I will toss it over to you and we can work on applying it to the direct IO path.

@mmaybee I have pushed up the changes we talked about with only using abd_put() with the Direct IO path. I also removed the user space version of abd_get_from_user_pages(). Just let me know when you have the abd extension ready (or you would like me to help out with anything else) and we can work on applying it to the Direct IO pathl.

@behlendorf
Copy link
Owner

@bwatkinson in order to open a zfsonlinux PR we're going to want to get this rebased on the current version of the master branch. It should be pretty straight forward, though there may be a couple conflicts to resolve.

@bwatkinson
Copy link
Author

@bwatkinson in order to open a zfsonlinux PR we're going to want to get this rebased on the current version of the master branch. It should be pretty straight forward, though there may be a couple conflicts to resolve.

@behlendorf I did a rebase, I believe, last week, so hopefully there will not be too many conflicts when going to rebase again for the PR. I did notice that some of the organization has changed with the module/os/linux/(spl/zfs), but hopefully the conflicts will not be too out of control : ). I figured I would do the rebase after we integrated @mmaybee's abd extension code.

@bwatkinson
Copy link
Author

bwatkinson commented Sep 9, 2019

@behlendorf and @mmaybee, so I have actually found another bug, which I was hoping you both might have some insights into. So, if we write using Direct IO and immediately delete file, we get to a NULL pointer dereference in arc_release(). I attached a quick stack trace, but it definitely appears to be a race condition. If I wait 1 second and then delete the file, after using Direct IO to write the file, the null pointer deference is no longer an issue.

Also, as a quick test, I also tried setting primarycache=none and the issue still occurs.

Screen Shot 2019-09-09 at 4 25 41 PM

I have traced the issue down to the following section of code in dbuf_free_range():

Screen Shot 2019-09-10 at 10 13 48 AM

We are currently calling dbuf_unoverride. What strange is we are marking each of the dbuf's as DB_NOFILL in dmu_write_direct() and DB_UNCACHED in dmu_write_direct_done(), so at some point we are getting to another dbuf that falls down to the db->db_last_dirty != NULL check. Any ideas why we would be meeting the if condition to call dbuf_unoverride here if we do not wait a certain amount of time before deleting the file just written using Direct IO?

@mmaybee
Copy link

mmaybee commented Sep 10, 2019

The call do dbuf_unoverride() is correct if we try to "free" this block in the same txg that we originally wrote it in. We use the override logic to commit the direct IO block pointer to disk (this is why we "abuse" the dbuf state with DB_NOFILL and DB_UNCACHED). A direct IO dbuf is never "filled" with data and so has no cached content... yet it is dirty and has a block pointer that must be written out in syncing context.

I suspect that you may be looking at the wrong call to dbuf_unoverride() though. You are correct that we should not get to the call from dbuf_free_range(). But there is also a call from dbuf_undirty(). This is only protected by a DB_NOFILL check, and by the time we get here our state should be DB_UNCACHED. We need logic here and in dbuf_unoverride() that handles the direct IO case. (i.e., dirty, but uncached => no dr_data).

@bwatkinson
Copy link
Author

The call do dbuf_unoverride() is correct if we try to "free" this block in the same txg that we originally wrote it in. We use the override logic to commit the direct IO block pointer to disk (this is why we "abuse" the dbuf state with DB_NOFILL and DB_UNCACHED). A direct IO dbuf is never "filled" with data and so has no cached content... yet it is dirty and has a block pointer that must be written out in syncing context.

I suspect that you may be looking at the wrong call to dbuf_unoverride() though. You are correct that we should not get to the call from dbuf_free_range(). But there is also a call from dbuf_undirty(). This is only protected by a DB_NOFILL check, and by the time we get here our state should be DB_UNCACHED. We need logic here and in dbuf_unoverride() that handles the direct IO case. (i.e., dirty, but uncached => no dr_data).

@mmaybee, yeah the stack trace was leading me to believe that we were hitting dbuf_unoverride() from dbuf_free_range(), but very well could be happening in dbuf_undirty(). I noticed the call to dbuf_unoverride() in there as well, but the stack trace got me all screwed up lol. In any case, I still need to add the logic in both spots anyways. I will let you know if this fixes the issue. On a side note, I noticed we might have mistake in dmu_write_direct_done() with an ASSERT. I believe we need to have ASSERT(dr->dt.dl.dr_data == NULL) correct?

Screen Shot 2019-09-10 at 1 53 00 PM

@mmaybee
Copy link

mmaybee commented Sep 10, 2019

@bwatkinson yup, correct... would have shown up when we did a debug compile! :-)
BTW, I think, probably, in both dbuf_undirty() and dbuf_unoverride() you just need to put an: if (db->db_state != DB_UNCACHED) around the calls to the arc functions trying to de-reference dr_data.

@bwatkinson
Copy link
Author

bwatkinson commented Sep 10, 2019

@bwatkinson yup, correct... would have shown up when we did a debug compile! :-)
BTW, I think, probably, in both dbuf_undirty() and dbuf_unoverride() you just need to put an: if (db->db_state != DB_UNCACHED) around the calls to the arc functions trying to de-reference dr_data.

@mmaybee, so here is the modification I made to dbuf_undirty():

Screen Shot 2019-09-10 at 2 25 23 PM

I just added the extra check for DB_UNCACHED. Also, below is the modifications I made to dbuf_unoverride():

Screen Shot 2019-09-10 at 2 30 14 PM

I believe we are okay just bailing at this point in dbuf_unoverride(); however, I was not certain if I should move the check down to the arc_release() and only call it if db->db_state != DB_UNCACHED.

@mmaybee
Copy link

mmaybee commented Sep 10, 2019

@bwatkinson For the dbuf_unoverride() case, we still need to free the block that was written out in dmu_write_direct(). So we need to get to the call to zio_free(). Otherwise we will leak the block. I would just put the if-statement before the call to arc_release().
For dbuf_undirty(), we need to make sure we call dbuf_unoverride() (for the reasons I stated above). So I would create a new if-statement to check for UNCACHED prior to the ASSERTs and call to arc_buf_destroy().

@bwatkinson
Copy link
Author

@bwatkinson For the dbuf_unoverride() case, we still need to free the block that was written out in dmu_write_direct(). So we need to get to the call to zio_free(). Otherwise we will leak the block. I would just put the if-statement before the call to arc_release().
For dbuf_undirty(), we need to make sure we call dbuf_unoverride() (for the reasons I stated above). So I would create a new if-statement to check for UNCACHED prior to the ASSERTs and call to arc_buf_destroy().

@mmaybee, yeah I just thought about that before you replied... Sorry, I went brain dead there for a bit lol. I will update the code correctly, check that everything works, and push up the changes. Sorry for all the confusion on this.

…ould cause a NULL pointer derefence. Just needed to check if the dbuf was in a DB_UNCACHED state, and if so, not call any of the ARC related functions in dbuf_unoverride and dbuf_undirty.
behlendorf pushed a commit that referenced this pull request Dec 14, 2019
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed).  Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().

Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.

Found by AddressSanitizer:

ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
    #0 0x7fe88b0c166d  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
    #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
    #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
    #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #4 0x55ffbc769fba in ztest_execute ztest.c:6714
    #5 0x55ffbc779a90 in ztest_thread ztest.c:6761
    #6 0x7fe889cbc6da in start_thread
    openzfs#7 0x7fe8899e588e in __clone

0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
    #0 0x7fe88b14e7b8 in __interceptor_free
    #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
    #2 0x7fe88ae543ba in nvpair_free nvpair.c:844
    #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
    #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
    #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
    #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714
    openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#9 0x7fe889cbc6da in start_thread

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#9706
@bghira
Copy link

bghira commented Feb 18, 2020

openzfs#10018

@behlendorf behlendorf closed this Feb 18, 2020
behlendorf pushed a commit that referenced this pull request Feb 28, 2020
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed).  Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().

Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.

Found by AddressSanitizer:

ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
    #0 0x7fe88b0c166d  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
    #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
    #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
    #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #4 0x55ffbc769fba in ztest_execute ztest.c:6714
    #5 0x55ffbc779a90 in ztest_thread ztest.c:6761
    #6 0x7fe889cbc6da in start_thread
    openzfs#7 0x7fe8899e588e in __clone

0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
    #0 0x7fe88b14e7b8 in __interceptor_free
    #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
    #2 0x7fe88ae543ba in nvpair_free nvpair.c:844
    #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
    #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
    #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
    #6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714
    openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#9 0x7fe889cbc6da in start_thread

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#9706
@bwatkinson bwatkinson deleted the direct branch January 22, 2021 02:49
behlendorf pushed a commit that referenced this pull request Feb 22, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
behlendorf pushed a commit that referenced this pull request May 28, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
behlendorf pushed a commit that referenced this pull request May 30, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    #6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    #6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.