Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS traverse_visitbp optimization to limit prefetch. #11803

Merged
merged 1 commit into from
Apr 15, 2021

Conversation

jsai20
Copy link
Contributor

@jsai20 jsai20 commented Mar 26, 2021

Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched,
so that prefetching blocks gets a enough time window before their demand
read is issued.

Signed-off-by: Jitendra Patidar jitendra.patidar@nutanix.com
Closes #11802

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 26, 2021
@behlendorf behlendorf requested a review from amotin March 26, 2021 17:42
@ahrens ahrens self-requested a review March 26, 2021 18:09
@adamdmoss
Copy link
Contributor

I gave this a spin for a couple of day. Nothing exploded. Not sure I could measure a perf/mem gain/loss in casual use.

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from a614ff8 to a803e65 Compare March 30, 2021 06:47
@jsai20
Copy link
Contributor Author

jsai20 commented Mar 30, 2021

I gave this a spin for a couple of day. Nothing exploded. Not sure I could measure a perf/mem gain/loss in casual use.

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

Thanks for the testing and your comment.
As per my understanding, On system with limited resources (IO bandwidth, System Memory) shared between block traversal (like zfs diff, zfs send) workload and primary user/application workload, its good to limit prefetch from traversal in one go to good enough number.
On High end systems, which has very low IO latency from underneath storage, has large amount of system memory (Larger size ARC) And Primary workload not consuming much of available IO band width and ARC, then probably its OK to do bulk Prefetch in one go, although not sure, if its really necessary to do bulk prefetch.
In Current code, Traversering an indirect block could do bulk prefetch upto 1024 blocks (As 128K size indirect block could contain 1024 block pointers). If needed to be do bulk prefetch for any user/application workload and high end system configuration, then current behavior of bulk prefetch can be restored by tunning "zfs_traverse_indirect_prefetch_limit" to value >=1024.

@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from a803e65 to 2167ff7 Compare March 30, 2021 08:19
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see sense in this change. Originally till d7958b4 ZFS used 8 times smaller indirect blocks (16KB instead of 128KB), that means 64 times less prefetch size here. We at iXsystems found this indirect block size pretty bad for random workloads, causing excessive read/write inflation, that is why we are still using 32KB now. But in general case I think this prefetch limit idea is good. Implementation does not look great to me though. Please see inline comments.

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
@amotin
Copy link
Member

amotin commented Mar 30, 2021

I have a very mild reservation, which is that I would expect users to tune prefetch tuneables according to how their userspace workload behaves, i.e. how useful prefetch appears to be under reasonably predictable userspace test cases. Whereas sanely tuning the traversal limit seems rather difficult, and depends on opaque zfs structural decisions.

@adamdmoss Unlike zfetch, this code is not about speculative prefetch, but more about read-ahead. As I understand, most if not all blocks that are prefetched here should be read later, the only question is how much later. With existing code it means 128MB for each indirect block level, which for some lets say L5 indirects of huge file means 1024 ^^ 5 data blocks ahead, that does not make much sense and can get evicted from ARC sooner that processing get to it.

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
@jsai20
Copy link
Contributor Author

jsai20 commented Mar 30, 2021

I see sense in this change. Originally till d7958b4 ZFS used 8 times smaller indirect blocks (16KB instead of 128KB), that means 64 times less prefetch size here. We at iXsystems found this indirect block size pretty bad for random workloads, causing excessive read/write inflation, that is why we are still using 32KB now. But in general case I think this prefetch limit idea is good. Implementation does not look great to me though. Please see inline comments.

Thanks for the review. Please see responses to comments inline.

@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from 2167ff7 to 3417862 Compare March 31, 2021 05:09
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch 2 times, most recently from f0c25a4 to adb3783 Compare April 2, 2021 05:43
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

@jsai20
Copy link
Contributor Author

jsai20 commented Apr 6, 2021

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

Thanks @amotin for the review.
If you are Ok, We can start prefetch from index=1 (remove 1st block prefetch) just for prefetchlimit = 1 and for rest, can start at index = 0 only; So, that any possibility of regression by removing 1st block prefetch is avoided.

    prefetchlimit = zfs_traverse_indirect_prefetch_limit;
    if (prefetchlimit == 1)
            pidx = 1;
    else
            pidx = 0;

And, with this, for setting next value of "ptidx", inside "for loop", would now need to do, ptidx = MAX(pidx, i + 1);

@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from adb3783 to df1a053 Compare April 6, 2021 12:46
@jsai20
Copy link
Contributor Author

jsai20 commented Apr 6, 2021

I have no more objections. It would still be good to see some benchmarks, showing that the reduced prefetch depth and removed prefetch of the first block do not cause performance regressions.

Thanks @amotin for the review.
If you are Ok, We can start prefetch from index=1 (remove 1st block prefetch) just for prefetchlimit = 1 and for rest, can start at index = 0 only; So, that any possibility of regression by removing 1st block prefetch is avoided.

    prefetchlimit = zfs_traverse_indirect_prefetch_limit;
    if (prefetchlimit == 1)
            pidx = 1;
    else
            pidx = 0;

And, with this, for setting next value of "ptidx", inside "for loop", would now need to do, ptidx = MAX(pidx, i + 1);

Just updated this change.

@amotin
Copy link
Member

amotin commented Apr 6, 2021

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

@jsai20
Copy link
Contributor Author

jsai20 commented Apr 6, 2021

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

Ok.

@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from df1a053 to 9870854 Compare April 6, 2021 17:04
@jsai20
Copy link
Contributor Author

jsai20 commented Apr 7, 2021

Just updated this change.

I don't like this update. It complicates the code without proven motivation. Lets return back and run at least some reasonable benchmarks, comparing behavior with this patch and without.

Ok.

Perf Testing:
This optimisation (Limiting prefetch done during traverse of indirect blocks) is primarily meant to minimize any negative impact to other primary workloads running on system in parallel to workloads doing block traversal (zfs send). Negative impacts possible due to bulk prefetch done by traversal code while traversing indirect blocks.
And considering necessary/limited prefetch is still done by traverse code, it shouldn’t have any negative impact (No regression) to zfs send perf.

Ran couple of basic test scenarios to validate no negative impact (No regression) to zfs send with this optimization.

Test System Memory and CPU Config:

Total online memory: 12G
CPU(s): 4
Model name: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

Test Scenario 1:
Source Dataset containing Small files with number of indirect blocks like, L2 = 1, L1 = 2.
Without optimization, two L1 blocks (at index 0 and index 1) gets prefetched during traversal, while with optimization only one block (at Index 1, skipping block at index 0) gets prefetched.
Taken 2 runs both with and without optimization to account for variation.
With averaging out numbers from 2 runs, Performance looks similar, with and without optimization.

$ sudo dd if=/dev/urandom of=/testpool/testds/134M-file bs=1k count=131200
sudo rm 131200+0 records in
131200+0 records out
134348800 bytes (134 MB) copied, 7.55323 s, 17.8 MB/s
$ ls -i /testpool/testds/134M-file
256 /testpool/testds/134M-file
$

$ sudo zdb -ddddd testpool/testds 256 |egrep " L0 " |wc -l
1025
$ sudo zdb -ddddd testpool/testds 256 |egrep " L1 " |wc -l
2
$ sudo zdb -ddddd testpool/testds 256 |egrep " L2 " |wc -l
1
$

$ for i in seq 1 200;do sudo dd if=/dev/urandom of=/testpool/testds/134M-file-$i bs=1k count=131200;done

$ ls /testpool/testds/
134M-file 134M-file-119 134M-file-14 134M-file-160 134M-file-181 134M-file-21 134M-file-42 134M-file-63 134M-file-84
134M-file-1 134M-file-12 134M-file-140 134M-file-161 134M-file-182 134M-file-22 134M-file-43 134M-file-64 134M-file-85
134M-file-10 134M-file-120 134M-file-141 134M-file-162 134M-file-183 134M-file-23 134M-file-44 134M-file-65 134M-file-86
134M-file-100 134M-file-121 134M-file-142 134M-file-163 134M-file-184 134M-file-24 134M-file-45 134M-file-66 134M-file-87
134M-file-101 134M-file-122 134M-file-143 134M-file-164 134M-file-185 134M-file-25 134M-file-46 134M-file-67 134M-file-88
134M-file-102 134M-file-123 134M-file-144 134M-file-165 134M-file-186 134M-file-26 134M-file-47 134M-file-68 134M-file-89
134M-file-103 134M-file-124 134M-file-145 134M-file-166 134M-file-187 134M-file-27 134M-file-48 134M-file-69 134M-file-9
134M-file-104 134M-file-125 134M-file-146 134M-file-167 134M-file-188 134M-file-28 134M-file-49 134M-file-7 134M-file-90
134M-file-105 134M-file-126 134M-file-147 134M-file-168 134M-file-189 134M-file-29 134M-file-5 134M-file-70 134M-file-91
134M-file-106 134M-file-127 134M-file-148 134M-file-169 134M-file-19 134M-file-3 134M-file-50 134M-file-71 134M-file-92
134M-file-107 134M-file-128 134M-file-149 134M-file-17 134M-file-190 134M-file-30 134M-file-51 134M-file-72 134M-file-93
134M-file-108 134M-file-129 134M-file-15 134M-file-170 134M-file-191 134M-file-31 134M-file-52 134M-file-73 134M-file-94
134M-file-109 134M-file-13 134M-file-150 134M-file-171 134M-file-192 134M-file-32 134M-file-53 134M-file-74 134M-file-95
134M-file-11 134M-file-130 134M-file-151 134M-file-172 134M-file-193 134M-file-33 134M-file-54 134M-file-75 134M-file-96
134M-file-110 134M-file-131 134M-file-152 134M-file-173 134M-file-194 134M-file-34 134M-file-55 134M-file-76 134M-file-97
134M-file-111 134M-file-132 134M-file-153 134M-file-174 134M-file-195 134M-file-35 134M-file-56 134M-file-77 134M-file-98
134M-file-112 134M-file-133 134M-file-154 134M-file-175 134M-file-196 134M-file-36 134M-file-57 134M-file-78 134M-file-99
134M-file-113 134M-file-134 134M-file-155 134M-file-176 134M-file-197 134M-file-37 134M-file-58 134M-file-79
134M-file-114 134M-file-135 134M-file-156 134M-file-177 134M-file-198 134M-file-38 134M-file-59 134M-file-8
134M-file-115 134M-file-136 134M-file-157 134M-file-178 134M-file-199 134M-file-39 134M-file-6 134M-file-80
134M-file-116 134M-file-137 134M-file-158 134M-file-179 134M-file-2 134M-file-4 134M-file-60 134M-file-81
134M-file-117 134M-file-138 134M-file-159 134M-file-18 134M-file-20 134M-file-40 134M-file-61 134M-file-82
134M-file-118 134M-file-139 134M-file-16 134M-file-180 134M-file-200 134M-file-41 134M-file-62 134M-file-83
$
$ sudo zfs snapshot testpool/testds@snap1

Without Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
cat: /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit: No such file or directory

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:12:14 149M testpool/testds@snap1
06:12:15 253M testpool/testds@snap1
06:12:16 393M testpool/testds@snap1
06:12:17 523M testpool/testds@snap1
06:12:18 695M testpool/testds@snap1
….
06:12:34 1.80G testpool/testds@snap1
06:12:35 1.86G testpool/testds@snap1
06:12:36 1.96G testpool/testds@snap1
….
06:16:54 25.0G testpool/testds@snap1
06:16:55 25.0G testpool/testds@snap1
06:16:56 25.1G testpool/testds@snap1

real 4m44.335s
user 0m0.170s
sys 0m8.695s

2nd run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
07:05:27 115M testpool/testds@snap1
07:05:28 244M testpool/testds@snap1
07:05:29 341M testpool/testds@snap1
07:05:30 416M testpool/testds@snap1
07:05:31 515M testpool/testds@snap1
….
07:05:50 1.68G testpool/testds@snap1
07:05:51 1.77G testpool/testds@snap1
07:05:52 1.82G testpool/testds@snap1
….
07:10:39 25.0G testpool/testds@snap1
07:10:40 25.1G testpool/testds@snap1
07:10:41 25.1G testpool/testds@snap1

real 5m16.421s
user 0m0.174s
sys 0m8.571s

With Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
32

1st run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:34:34 169M testpool/testds@snap1
06:34:35 290M testpool/testds@snap1
06:34:36 384M testpool/testds@snap1
06:34:37 534M testpool/testds@snap1
06:34:38 686M testpool/testds@snap1
….
06:34:54 1.78G testpool/testds@snap1
06:34:55 1.84G testpool/testds@snap1
06:34:56 1.90G testpool/testds@snap1
….
06:39:15 25.0G testpool/testds@snap1
06:39:16 25.1G testpool/testds@snap1
06:39:17 25.2G testpool/testds@snap1

real 4m45.321s
user 0m0.165s
sys 0m8.708s

2nd run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.2G
total estimated size is 25.2G
TIME SENT SNAPSHOT testpool/testds@snap1
06:52:45 160M testpool/testds@snap1
06:52:46 320M testpool/testds@snap1
06:52:47 390M testpool/testds@snap1
06:52:48 522M testpool/testds@snap1
06:52:49 686M testpool/testds@snap1
…..
06:53:06 1.78G testpool/testds@snap1
06:53:07 1.83G testpool/testds@snap1
06:53:08 1.88G testpool/testds@snap1
…..
06:57:51 25.0G testpool/testds@snap1
06:57:52 25.0G testpool/testds@snap1
06:57:53 25.1G testpool/testds@snap1

real 5m9.334s
user 0m0.157s
sys 0m8.635s

Test Scenario 2:
Test Dataset containing a large file such that it contains multiple number of indirect blocks at L1 pointed by L2.
It tests Bulk prefetch done without optimization, while limited prefetch is done with optimization.

Taken 2 runs both with and without optimization, to account variation between runs.
Performance Number are stable across runs and its same both with and without optimization.

$ sudo dd if=/dev/urandom of=/testpool/testds/large-file bs=1k count=26240000
26240000+0 records in
26240000+0 records out
26869760000 bytes (27 GB) copied, 1483.99 s, 18.1 MB/s
$
$ ls /testpool/testds/
large-file
$ ls -i /testpool/testds/large-file
2 /testpool/testds/large-file
$ sudo zdb -ddddd testpool/testds 2 |egrep " L0 " |wc -l
205000
$ sudo zdb -ddddd testpool/testds 2 |egrep " L1 " |wc -l
201
$ sudo zdb -ddddd testpool/testds 2 |egrep " L2 " |wc -l
1
$

Without Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
cat: /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit: No such file or directory

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
07:54:52 82.9M testpool/testds@snap1
07:54:53 195M testpool/testds@snap1
07:54:54 269M testpool/testds@snap1
07:54:55 342M testpool/testds@snap1
07:54:56 420M testpool/testds@snap1
…..
07:55:13 1.72G testpool/testds@snap1
07:55:14 1.87G testpool/testds@snap1
07:55:15 1.99G testpool/testds@snap1
….
07:59:05 24.9G testpool/testds@snap1
07:59:06 24.9G testpool/testds@snap1
07:59:07 25.0G testpool/testds@snap1

real 4m17.322s
user 0m0.166s
sys 0m8.632s

2nd Run (After reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:06:03 79.6M testpool/testds@snap1
08:06:04 189M testpool/testds@snap1
08:06:05 266M testpool/testds@snap1
08:06:06 334M testpool/testds@snap1
08:06:07 408M testpool/testds@snap1
…..
08:06:23 1.67G testpool/testds@snap1
08:06:24 1.84G testpool/testds@snap1
08:06:25 1.98G testpool/testds@snap1
…..
08:10:20 24.9G testpool/testds@snap1
08:10:21 25.0G testpool/testds@snap1
08:10:22 25.1G testpool/testds@snap1

real 4m21.343s
user 0m0.187s
sys 0m8.701s

With Optimization:
$ cat /sys/module/zfs/parameters/zfs_traverse_indirect_prefetch_limit
32

1st Run:
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:18:13 85.8M testpool/testds@snap1
08:18:14 199M testpool/testds@snap1
08:18:15 278M testpool/testds@snap1
08:18:16 353M testpool/testds@snap1
08:18:17 426M testpool/testds@snap1
…..
08:18:30 1.34G testpool/testds@snap1
08:18:31 1.45G testpool/testds@snap1
08:18:32 1.62G testpool/testds@snap1
…..
08:22:30 24.9G testpool/testds@snap1
08:22:31 25.0G testpool/testds@snap1
08:22:32 25.0G testpool/testds@snap1

real 4m21.322s
user 0m0.162s
sys 0m8.604s

2nd Run (After Reboot):
$ time sudo zfs send -v -c testpool/testds@snap1 >/dev/null
full send of testpool/testds@snap1 estimated size is 25.1G
total estimated size is 25.1G
TIME SENT SNAPSHOT testpool/testds@snap1
08:38:08 77.3M testpool/testds@snap1
08:38:09 169M testpool/testds@snap1
08:38:10 252M testpool/testds@snap1
08:38:11 334M testpool/testds@snap1
08:38:12 405M testpool/testds@snap1
…..
08:38:28 1.66G testpool/testds@snap1
08:38:29 1.83G testpool/testds@snap1
08:38:30 1.97G testpool/testds@snap1
…..
08:42:26 24.9G testpool/testds@snap1
08:42:27 25.0G testpool/testds@snap1
08:42:28 25.1G testpool/testds@snap1

real 4m21.309s
user 0m0.157s
sys 0m8.525s

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rational for this seems reasonable to me. I've added some additional review comments. Thanks for providing that test data showing this shouldn't effect the common case. I presume you ended up focusing on reducing the prefetch here because it was causing a noticeable performance problem in your environment. Do you have any data which shows how much this change improved things for your use case?

module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
module/zfs/dmu_traverse.c Outdated Show resolved Hide resolved
@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from 9870854 to 693b4c7 Compare April 13, 2021 14:50
@jsai20
Copy link
Contributor Author

jsai20 commented Apr 13, 2021

The rational for this seems reasonable to me. I've added some additional review comments. Thanks for providing that test data showing this shouldn't effect the common case. I presume you ended up focusing on reducing the prefetch here because it was causing a noticeable performance problem in your environment. Do you have any data which shows how much this change improved things for your use case?

Thanks @behlendorf for reviewing changes.
Our Internal Performance team done some performance testing of this fix along with one more improvement, I done around "accounting ARC uses for prefetch buffers, defining some minimum limit for prefetch buffers eligible to get skipped on eviction and effectively avoid eviction of prefetch buffers until prefetch buffers usage is with in defined limit".
Results were positive and helped improving performance for both primary workload as well as send/recv workload in our testing environment for some workloads.
I don't have performance results only with this change. I Possibly, could try to get consolidated results of both the changes (This one and other ARC change, as mentioned).
I am thinking to open improvement request for other change too, may be in next couple of days.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 13, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802
@jsai20 jsai20 force-pushed the traverse_visibp_optimization branch from 693b4c7 to 18ca9ec Compare April 14, 2021 04:53
@behlendorf behlendorf merged commit 08795ab into openzfs:master Apr 15, 2021
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Apr 21, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
@jsai20 jsai20 deleted the traverse_visibp_optimization branch April 23, 2021 11:51
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 6, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 7, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 10, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 13, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 25, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ghost pushed a commit to truenas/zfs that referenced this pull request May 27, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Traversal code, traverse_visitbp() does visit blocks recursively.
Indirect (Non L0) Block of size 128k could contain, 1024 block pointers
of 128 bytes. In case of full traverse OR incremental traverse, where
all blocks were modified, it could traverse large number of blocks
pointed by indirect. Traversal code does issue prefetch of blocks
traversed below indirect. This could result into large number of
async reads queued on vdev queue. So, account for prefetch issued for
blocks pointed by indirect and limit max prefetch in one go.

Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch while traversing
an indirect block.

Local counters:
prefetched: Local counter to account for number prefetch done.
pidx: Index for which next prefetch to be issued.
ptidx: Index at which next prefetch to be triggered.

Keep "ptidx" somewhere in the middle of blocks prefetched, so that
blocks prefetch read gets the enough time window before their demand
read is issued.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Jitendra Patidar <jitendra.patidar@nutanix.com>
Closes openzfs#11802 
Closes openzfs#11803
ahrens added a commit to ahrens/zfs that referenced this pull request Mar 10, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
ahrens added a commit to ahrens/zfs that referenced this pull request Mar 11, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 19, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
behlendorf pushed a commit that referenced this pull request Mar 24, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) #11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14603
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Mar 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this pull request Sep 26, 2023
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) openzfs#11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#14603
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZFS traverse_visitbp optimization to limit prefetch.
4 participants