ZFS traverse_visitbp optimization to limit prefetch. #11803

jsai20 · 2021-03-26T13:20:08Z

Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched,
so that prefetching blocks gets a enough time window before their demand
read is issued.

Signed-off-by: Jitendra Patidar jitendra.patidar@nutanix.com
Closes #11802

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

adamdmoss · 2021-03-29T19:23:14Z

I gave this a spin for a couple of day. Nothing exploded. Not sure I could measure a perf/mem gain/loss in casual use.

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

jsai20 · 2021-03-30T07:34:14Z

I gave this a spin for a couple of day. Nothing exploded. Not sure I could measure a perf/mem gain/loss in casual use.

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

Thanks for the testing and your comment.
As per my understanding, On system with limited resources (IO bandwidth, System Memory) shared between block traversal (like zfs diff, zfs send) workload and primary user/application workload, its good to limit prefetch from traversal in one go to good enough number.
On High end systems, which has very low IO latency from underneath storage, has large amount of system memory (Larger size ARC) And Primary workload not consuming much of available IO band width and ARC, then probably its OK to do bulk Prefetch in one go, although not sure, if its really necessary to do bulk prefetch.
In Current code, Traversering an indirect block could do bulk prefetch upto 1024 blocks (As 128K size indirect block could contain 1024 block pointers). If needed to be do bulk prefetch for any user/application workload and high end system configuration, then current behavior of bulk prefetch can be restored by tunning "zfs_traverse_indirect_prefetch_limit" to value >=1024.

amotin

I see sense in this change. Originally till d7958b4 ZFS used 8 times smaller indirect blocks (16KB instead of 128KB), that means 64 times less prefetch size here. We at iXsystems found this indirect block size pretty bad for random workloads, causing excessive read/write inflation, that is why we are still using 32KB now. But in general case I think this prefetch limit idea is good. Implementation does not look great to me though. Please see inline comments.

module/zfs/dmu_traverse.c

amotin · 2021-03-30T15:55:52Z

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

@adamdmoss Unlike zfetch, this code is not about speculative prefetch, but more about read-ahead. As I understand, most if not all blocks that are prefetched here should be read later, the only question is how much later. With existing code it means 128MB for each indirect block level, which for some lets say L5 indirects of huge file means 1024 ^^ 5 data blocks ahead, that does not make much sense and can get evicted from ARC sooner that processing get to it.

module/zfs/dmu_traverse.c

jsai20 · 2021-03-30T18:11:56Z

I see sense in this change. Originally till d7958b4 ZFS used 8 times smaller indirect blocks (16KB instead of 128KB), that means 64 times less prefetch size here. We at iXsystems found this indirect block size pretty bad for random workloads, causing excessive read/write inflation, that is why we are still using 32KB now. But in general case I think this prefetch limit idea is good. Implementation does not look great to me though. Please see inline comments.

Thanks for the review. Please see responses to comments inline.

module/zfs/dmu_traverse.c

amotin

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

jsai20 · 2021-04-06T04:34:50Z

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

Thanks @amotin for the review.
If you are Ok, We can start prefetch from index=1 (remove 1st block prefetch) just for prefetchlimit = 1 and for rest, can start at index = 0 only; So, that any possibility of regression by removing 1st block prefetch is avoided.

    prefetchlimit = zfs_traverse_indirect_prefetch_limit;
    if (prefetchlimit == 1)
            pidx = 1;
    else
            pidx = 0;

And, with this, for setting next value of "ptidx", inside "for loop", would now need to do, ptidx = MAX(pidx, i + 1);

jsai20 · 2021-04-06T12:48:19Z

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

Thanks @amotin for the review.
If you are Ok, We can start prefetch from index=1 (remove 1st block prefetch) just for prefetchlimit = 1 and for rest, can start at index = 0 only; So, that any possibility of regression by removing 1st block prefetch is avoided.
    prefetchlimit = zfs_traverse_indirect_prefetch_limit;
    if (prefetchlimit == 1)
            pidx = 1;
    else
            pidx = 0;
And, with this, for setting next value of "ptidx", inside "for loop", would now need to do, ptidx = MAX(pidx, i + 1);

Just updated this change.

amotin · 2021-04-06T13:59:56Z

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

jsai20 · 2021-04-06T16:58:49Z

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

Ok.

jsai20 · 2021-04-07T16:10:11Z

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

Ok.

Perf Testing:
This optimisation (Limiting prefetch done during traverse of indirect blocks) is primarily meant to minimize any negative impact to other primary workloads running on system in parallel to workloads doing block traversal (zfs send). Negative impacts possible due to bulk prefetch done by traversal code while traversing indirect blocks.
And considering necessary/limited prefetch is still done by traverse code, it shouldn’t have any negative impact (No regression) to zfs send perf.

Ran couple of basic test scenarios to validate no negative impact (No regression) to zfs send with this optimization.

Test System Memory and CPU Config:

Total online memory: 12G
CPU(s): 4
Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

Test Scenario 1:
Source Dataset containing Small files with number of indirect blocks like, L2 = 1, L1 = 2.
Without optimization, two L1 blocks (at index 0 and index 1) gets prefetched during traversal, while with optimization only one block (at Index 1, skipping block at index 0) gets prefetched.
Taken 2 runs both with and without optimization to account for variation.
With averaging out numbers from 2 runs, Performance looks similar, with and without optimization.

$ sudo dd if=/dev/urandom of=/testpool/testds/134M-file bs=1k count=131200
sudo rm 131200+0 records in
131200+0 records out
134348800 bytes (134 MB) copied, 7.55323 s, 17.8 MB/s
$ ls -i /testpool/testds/134M-file
256 /testpool/testds/134M-file
$

$ for i in seq 1 200;do sudo dd if=/dev/urandom of=/testpool/testds/134M-file-$i bs=1k count=131200;done

$ ls /testpool/testds/
134M-file 134M-file-119 134M-file-14 134M-file-160 134M-file-181 134M-file-21 134M-file-42 134M-file-63 134M-file-84
134M-file-1 134M-file-12 134M-file-140 134M-file-161 134M-file-182 134M-file-22 134M-file-43 134M-file-64 134M-file-85
134M-file-10 134M-file-120 134M-file-141 134M-file-162 134M-file-183 134M-file-23 134M-file-44 134M-file-65 134M-file-86
134M-file-100 134M-file-121 134M-file-142 134M-file-163 134M-file-184 134M-file-24 134M-file-45 134M-file-66 134M-file-87
134M-file-101 134M-file-122 134M-file-143 134M-file-164 134M-file-185 134M-file-25 134M-file-46 134M-file-67 134M-file-88
134M-file-102 134M-file-123 134M-file-144 134M-file-165 134M-file-186 134M-file-26 134M-file-47 134M-file-68 134M-file-89
134M-file-103 134M-file-124 134M-file-145 134M-file-166 134M-file-187 134M-file-27 134M-file-48 134M-file-69 134M-file-9
134M-file-104 134M-file-125 134M-file-146 134M-file-167 134M-file-188 134M-file-28 134M-file-49 134M-file-7 134M-file-90
134M-file-105 134M-file-126 134M-file-147 134M-file-168 134M-file-189 134M-file-29 134M-file-5 134M-file-70 134M-file-91
134M-file-106 134M-file-127 134M-file-148 134M-file-169 134M-file-19 134M-file-3 134M-file-50 134M-file-71 134M-file-92
134M-file-107 134M-file-128 134M-file-149 134M-file-17 134M-file-190 134M-file-30 134M-file-51 134M-file-72 134M-file-93
134M-file-108 134M-file-129 134M-file-15 134M-file-170 134M-file-191 134M-file-31 134M-file-52 134M-file-73 134M-file-94
134M-file-109 134M-file-13 134M-file-150 134M-file-171 134M-file-192 134M-file-32 134M-file-53 134M-file-74 134M-file-95
134M-file-11 134M-file-130 134M-file-151 134M-file-172 134M-file-193 134M-file-33 134M-file-54 134M-file-75 134M-file-96
134M-file-110 134M-file-131 134M-file-152 134M-file-173 134M-file-194 134M-file-34 134M-file-55 134M-file-76 134M-file-97
134M-file-111 134M-file-132 134M-file-153 134M-file-174 134M-file-195 134M-file-35 134M-file-56 134M-file-77 134M-file-98
134M-file-112 134M-file-133 134M-file-154 134M-file-175 134M-file-196 134M-file-36 134M-file-57 134M-file-78 134M-file-99
134M-file-113 134M-file-134 134M-file-155 134M-file-176 134M-file-197 134M-file-37 134M-file-58 134M-file-79
134M-file-114 134M-file-135 134M-file-156 134M-file-177 134M-file-198 134M-file-38 134M-file-59 134M-file-8
134M-file-115 134M-file-136 134M-file-157 134M-file-178 134M-file-199 134M-file-39 134M-file-6 134M-file-80
134M-file-116 134M-file-137 134M-file-158 134M-file-179 134M-file-2 134M-file-4 134M-file-60 134M-file-81
134M-file-117 134M-file-138 134M-file-159 134M-file-18 134M-file-20 134M-file-40 134M-file-61 134M-file-82
134M-file-118 134M-file-139 134M-file-16 134M-file-180 134M-file-200 134M-file-41 134M-file-62 134M-file-83
$
$ sudo zfs snapshot testpool/testds@snap1

Without Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
cat: /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit: No such file or directory

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:12:14 149M testpool/testds@snap1
06:12:15 253M testpool/testds@snap1
06:12:16 393M testpool/testds@snap1
06:12:17 523M testpool/testds@snap1
06:12:18 695M testpool/testds@snap1
….
06:12:34 1.80G testpool/testds@snap1
06:12:35 1.86G testpool/testds@snap1
06:12:36 1.96G testpool/testds@snap1
….
06:16:54 25.0G testpool/testds@snap1
06:16:55 25.0G testpool/testds@snap1
06:16:56 25.1G testpool/testds@snap1

real 4m44.335s
user 0m0.170s
sys 0m8.695s

2nd run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
07:05:27 115M testpool/testds@snap1
07:05:28 244M testpool/testds@snap1
07:05:29 341M testpool/testds@snap1
07:05:30 416M testpool/testds@snap1
07:05:31 515M testpool/testds@snap1
….
07:05:50 1.68G testpool/testds@snap1
07:05:51 1.77G testpool/testds@snap1
07:05:52 1.82G testpool/testds@snap1
….
07:10:39 25.0G testpool/testds@snap1
07:10:40 25.1G testpool/testds@snap1
07:10:41 25.1G testpool/testds@snap1

real 5m16.421s
user 0m0.174s
sys 0m8.571s

With Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
32

1st run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:34:34 169M testpool/testds@snap1
06:34:35 290M testpool/testds@snap1
06:34:36 384M testpool/testds@snap1
06:34:37 534M testpool/testds@snap1
06:34:38 686M testpool/testds@snap1
….
06:34:54 1.78G testpool/testds@snap1
06:34:55 1.84G testpool/testds@snap1
06:34:56 1.90G testpool/testds@snap1
….
06:39:15 25.0G testpool/testds@snap1
06:39:16 25.1G testpool/testds@snap1
06:39:17 25.2G testpool/testds@snap1

real 4m45.321s
user 0m0.165s
sys 0m8.708s

2nd run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:52:45 160M testpool/testds@snap1
06:52:46 320M testpool/testds@snap1
06:52:47 390M testpool/testds@snap1
06:52:48 522M testpool/testds@snap1
06:52:49 686M testpool/testds@snap1
…..
06:53:06 1.78G testpool/testds@snap1
06:53:07 1.83G testpool/testds@snap1
06:53:08 1.88G testpool/testds@snap1
…..
06:57:51 25.0G testpool/testds@snap1
06:57:52 25.0G testpool/testds@snap1
06:57:53 25.1G testpool/testds@snap1

real 5m9.334s
user 0m0.157s
sys 0m8.635s

Test Scenario 2:
Test Dataset containing a large file such that it contains multiple number of indirect blocks at L1 pointed by L2.
It tests Bulk prefetch done without optimization, while limited prefetch is done with optimization.

Taken 2 runs both with and without optimization, to account variation between runs.
Performance Number are stable across runs and its same both with and without optimization.

$ sudo dd if=/dev/urandom of=/testpool/testds/large-file bs=1k count=26240000
26240000+0 records in
26240000+0 records out
26869760000 bytes (27 GB) copied, 1483.99 s, 18.1 MB/s
$
$ ls /testpool/testds/
large-file
$ ls -i /testpool/testds/large-file
2 /testpool/testds/large-file
$ sudo zdb -ddddd testpool/testds 2 |egrep " L0 " |wc -l
205000
$ sudo zdb -ddddd testpool/testds 2 |egrep " L1 " |wc -l
201
$ sudo zdb -ddddd testpool/testds 2 |egrep " L2 " |wc -l
1
$

Without Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
cat: /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit: No such file or directory

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
07:54:52 82.9M testpool/testds@snap1
07:54:53 195M testpool/testds@snap1
07:54:54 269M testpool/testds@snap1
07:54:55 342M testpool/testds@snap1
07:54:56 420M testpool/testds@snap1
…..
07:55:13 1.72G testpool/testds@snap1
07:55:14 1.87G testpool/testds@snap1
07:55:15 1.99G testpool/testds@snap1
….
07:59:05 24.9G testpool/testds@snap1
07:59:06 24.9G testpool/testds@snap1
07:59:07 25.0G testpool/testds@snap1

real 4m17.322s
user 0m0.166s
sys 0m8.632s

2nd Run (After reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:06:03 79.6M testpool/testds@snap1
08:06:04 189M testpool/testds@snap1
08:06:05 266M testpool/testds@snap1
08:06:06 334M testpool/testds@snap1
08:06:07 408M testpool/testds@snap1
…..
08:06:23 1.67G testpool/testds@snap1
08:06:24 1.84G testpool/testds@snap1
08:06:25 1.98G testpool/testds@snap1
…..
08:10:20 24.9G testpool/testds@snap1
08:10:21 25.0G testpool/testds@snap1
08:10:22 25.1G testpool/testds@snap1

real 4m21.343s
user 0m0.187s
sys 0m8.701s

With Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
32

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:18:13 85.8M testpool/testds@snap1
08:18:14 199M testpool/testds@snap1
08:18:15 278M testpool/testds@snap1
08:18:16 353M testpool/testds@snap1
08:18:17 426M testpool/testds@snap1
…..
08:18:30 1.34G testpool/testds@snap1
08:18:31 1.45G testpool/testds@snap1
08:18:32 1.62G testpool/testds@snap1
…..
08:22:30 24.9G testpool/testds@snap1
08:22:31 25.0G testpool/testds@snap1
08:22:32 25.0G testpool/testds@snap1

real 4m21.322s
user 0m0.162s
sys 0m8.604s

2nd Run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:38:08 77.3M testpool/testds@snap1
08:38:09 169M testpool/testds@snap1
08:38:10 252M testpool/testds@snap1
08:38:11 334M testpool/testds@snap1
08:38:12 405M testpool/testds@snap1
…..
08:38:28 1.66G testpool/testds@snap1
08:38:29 1.83G testpool/testds@snap1
08:38:30 1.97G testpool/testds@snap1
…..
08:42:26 24.9G testpool/testds@snap1
08:42:27 25.0G testpool/testds@snap1
08:42:28 25.1G testpool/testds@snap1

real 4m21.309s
user 0m0.157s
sys 0m8.525s

behlendorf

The rational for this seems reasonable to me. I've added some additional review comments. Thanks for providing that test data showing this shouldn't effect the common case. I presume you ended up focusing on reducing the prefetch here because it was causing a noticeable performance problem in your environment. Do you have any data which shows how much this change improved things for your use case?

module/zfs/dmu_traverse.c

jsai20 · 2021-04-13T15:11:34Z

The rational for this seems reasonable to me. I've added some additional review comments. Thanks for providing that test data showing this shouldn't effect the common case. I presume you ended up focusing on reducing the prefetch here because it was causing a noticeable performance problem in your environment. Do you have any data which shows how much this change improved things for your use case?

Thanks @behlendorf for reviewing changes.
Our Internal Performance team done some performance testing of this fix along with one more improvement, I done around "accounting ARC uses for prefetch buffers, defining some minimum limit for prefetch buffers eligible to get skipped on eviction and effectively avoid eviction of prefetch buffers until prefetch buffers usage is with in defined limit".
Results were positive and helped improving performance for both primary workload as well as send/recv workload in our testing environment for some workloads.
I don't have performance results only with this change. I Possibly, could try to get consolidated results of both the changes (This one and other ARC change, as mentioned).
I am thinking to open improvement request for other change too, may be in next couple of days.

Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue. So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go. Module Param: zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing an indirect block. Local counters: prefetched: Local counter to account for number prefetch done. pidx: Index for which next prefetch to be issued. ptidx: Index at which next prefetch to be triggered. Keep "ptidx" somewhere in the middle of blocks prefetched, so that blocks prefetch read gets the enough time window before their demand read is issued. Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes openzfs#11802

Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue. So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go. Module Param: zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing an indirect block. Local counters: prefetched: Local counter to account for number prefetch done. pidx: Index for which next prefetch to be issued. ptidx: Index at which next prefetch to be triggered. Keep "ptidx" somewhere in the middle of blocks prefetched, so that blocks prefetch read gets the enough time window before their demand read is issued. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com> Closes openzfs#11802 Closes openzfs#11803

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) openzfs#11803. Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) #11803. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14603

When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) openzfs#11803. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#14603

behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 26, 2021

behlendorf requested a review from amotin March 26, 2021 17:42

ahrens self-requested a review March 26, 2021 18:09

jsai20 force-pushed the traverse_visibp_optimization branch from a614ff8 to a803e65 Compare March 30, 2021 06:47

jsai20 force-pushed the traverse_visibp_optimization branch from a803e65 to 2167ff7 Compare March 30, 2021 08:19

amotin requested changes Mar 30, 2021

View reviewed changes

amotin reviewed Mar 30, 2021

View reviewed changes

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

jsai20 force-pushed the traverse_visibp_optimization branch from 2167ff7 to 3417862 Compare March 31, 2021 05:09

amotin reviewed Mar 31, 2021

View reviewed changes

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

jsai20 force-pushed the traverse_visibp_optimization branch 2 times, most recently from f0c25a4 to adb3783 Compare April 2, 2021 05:43

amotin approved these changes Apr 5, 2021

View reviewed changes

jsai20 force-pushed the traverse_visibp_optimization branch from adb3783 to df1a053 Compare April 6, 2021 12:46

jsai20 force-pushed the traverse_visibp_optimization branch from df1a053 to 9870854 Compare April 6, 2021 17:04

behlendorf reviewed Apr 13, 2021

View reviewed changes

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved

jsai20 force-pushed the traverse_visibp_optimization branch from 9870854 to 693b4c7 Compare April 13, 2021 14:50

behlendorf approved these changes Apr 13, 2021

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 13, 2021

jsai20 force-pushed the traverse_visibp_optimization branch from 693b4c7 to 18ca9ec Compare April 14, 2021 04:53

behlendorf merged commit 08795ab into openzfs:master Apr 15, 2021

jsai20 deleted the traverse_visibp_optimization branch April 23, 2021 11:51

ahrens mentioned this pull request Mar 10, 2023

fix prefetching of indirect blocks while destroying #14603

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS traverse_visitbp optimization to limit prefetch. #11803

ZFS traverse_visitbp optimization to limit prefetch. #11803

jsai20 commented Mar 26, 2021 •

edited

Loading

adamdmoss commented Mar 29, 2021

jsai20 commented Mar 30, 2021

amotin left a comment

amotin commented Mar 30, 2021

jsai20 commented Mar 30, 2021

amotin left a comment

jsai20 commented Apr 6, 2021 •

edited

Loading

jsai20 commented Apr 6, 2021 •

edited

Loading

amotin commented Apr 6, 2021

jsai20 commented Apr 6, 2021

jsai20 commented Apr 7, 2021

behlendorf left a comment

jsai20 commented Apr 13, 2021

ZFS traverse_visitbp optimization to limit prefetch. #11803

ZFS traverse_visitbp optimization to limit prefetch. #11803

Conversation

jsai20 commented Mar 26, 2021 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

adamdmoss commented Mar 29, 2021

jsai20 commented Mar 30, 2021

amotin left a comment

Choose a reason for hiding this comment

amotin commented Mar 30, 2021

jsai20 commented Mar 30, 2021

amotin left a comment

Choose a reason for hiding this comment

jsai20 commented Apr 6, 2021 • edited Loading

jsai20 commented Apr 6, 2021 • edited Loading

amotin commented Apr 6, 2021

jsai20 commented Apr 6, 2021

jsai20 commented Apr 7, 2021

real 4m44.335s user 0m0.170s sys 0m8.695s

real 5m16.421s user 0m0.174s sys 0m8.571s

real 4m45.321s user 0m0.165s sys 0m8.708s

real 5m9.334s user 0m0.157s sys 0m8.635s

real 4m17.322s user 0m0.166s sys 0m8.632s

real 4m21.343s user 0m0.187s sys 0m8.701s

real 4m21.322s user 0m0.162s sys 0m8.604s

real 4m21.309s user 0m0.157s sys 0m8.525s

behlendorf left a comment

Choose a reason for hiding this comment

jsai20 commented Apr 13, 2021

jsai20 commented Mar 26, 2021 •

edited

Loading

jsai20 commented Apr 6, 2021 •

edited

Loading

jsai20 commented Apr 6, 2021 •

edited

Loading

real 4m44.335s
user 0m0.170s
sys 0m8.695s

real 5m16.421s
user 0m0.174s
sys 0m8.571s

real 4m45.321s
user 0m0.165s
sys 0m8.708s

real 5m9.334s
user 0m0.157s
sys 0m8.635s

real 4m17.322s
user 0m0.166s
sys 0m8.632s

real 4m21.343s
user 0m0.187s
sys 0m8.701s

real 4m21.322s
user 0m0.162s
sys 0m8.604s

real 4m21.309s
user 0m0.157s
sys 0m8.525s