Skip to content

Commit

Permalink
Fix read errors race after block cloning
Browse files Browse the repository at this point in the history
Investigating read errors triggering panic fixed in openzfs#16042 I've
found that we have a race in a sync process between the moment
dirty record for cloned block is removed and the moment dbuf is
destroyed.  If dmu_buf_hold_array_by_dnode() take a hold on a
cloned dbuf before it is synced/destroyed, then dbuf_read_impl()
may see it still in DB_NOFILL state, but without the dirty record.
Such case is not an error, but equivalent to DB_CACHED, since
the dbuf block pointer is already updated by dbuf_write_ready().
Unfortunately it is impossible to safely change the dbuf state
to DB_CACHED there, since there may already be another cloning
in progress, that dropped dbuf lock before creating a new dirty
record, protected only by the range lock.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
  • Loading branch information
amotin committed Apr 2, 2024
1 parent 39be46f commit 12433ce
Showing 1 changed file with 11 additions and 4 deletions.
15 changes: 11 additions & 4 deletions module/zfs/dbuf.c
Original file line number Diff line number Diff line change
Expand Up @@ -1581,6 +1581,7 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
}

if (db->db_state == DB_UNCACHED) {
uncached:
if (db->db_blkptr == NULL) {
bpp = NULL;
} else {
Expand All @@ -1593,12 +1594,18 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
ASSERT3S(db->db_state, ==, DB_NOFILL);

/*
* Block cloning: If we have a pending block clone,
* we don't want to read the underlying block, but the content
* of the block being cloned, so we have the most recent data.
* If we have a pending block clone, we don't want to read
* the underlying block, but the content of the block being
* cloned, pointed by the dirty record, so we have the most
* recent data. If there is no dirty record, then we hit a
* race in a sync process when the dirty record is already
* removed, while the dbuf is not yet destroyed. Such case
* is equivalent to uncached.
*/
dr = list_head(&db->db_dirty_records);
if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
if (dr == NULL)
goto uncached;
if (!dr->dt.dl.dr_brtwrite) {
err = EIO;
goto early_unlock;
}
Expand Down

0 comments on commit 12433ce

Please sign in to comment.