Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase L2ARC write rate and headroom #15457

Merged
merged 1 commit into from
Nov 9, 2023
Merged

Conversation

shodanshok
Copy link
Contributor

@shodanshok shodanshok commented Oct 26, 2023

Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Todays we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member

amotin commented Oct 26, 2023

  • I support write size increase, it is definitely overdue. The question is only how high to set it. Would be good to recalculate it into TBW value for a typical drive over its life time, considering worst case 24x7 operation.
  • Complete disable of headroom though I consider dangerous, we do not want to repeatedly scan a a terabyte of ARC in 4KB blocks where almost nothing is L2ARC-eligible. There should remain some safety barrier.

@shodanshok
Copy link
Contributor Author

This patch is in draft form not because it does anything complex (it just changes two constants used for default values), but because I would like to get feedback from others.

I deployed increased l2arc_write_max and l2arc_headroom=0 on some KVM servers with very good results. Current enterprise TLC SSDs have endurance to spare as L2ARC devices and even consumer SSDs are more than enough, so these changes should not pose any practical issues to device lifetime. L2ARC hit rate is very good, for example:

L2ARC breakdown:                                                    4.0M
        Hit ratio:                                     89.7 %       3.6M
        Miss ratio:                                    10.3 %     411.2k
        Feeds:                                                    780.2k

VMs deployed on such servers "feel" much more like SSDs-backed even after an host reboot.

However, these KVM hosts have 64-192 GB of RAM only - so I don't really know if l2arc_headroom=0 is appropriate (or not) for bigger machines.

Thanks.

@amotin
Copy link
Member

amotin commented Oct 26, 2023

I think we've already discussed that before. L2ARC was designed to cache data that are going to be evicted from ARC. Headroom controls how much more data we expect to be evicted from ARC per second, that L2ARC should care. If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm. It can be discussed how good idea it is to write L2ARC while ARC is not full yet and what should we do with prefetched data and headroom in that case, but the fix here would likely be not a blind headroom disable, but some changes to the code logic. That is your feedback from me.

@amotin
Copy link
Member

amotin commented Oct 26, 2023

Just on a level of ideas: in case persistent L2ARC is enabled, while ARC is still cold and L2ARC is not full, L2ARC could write only MFU buffers and without headroom. It would give persistent L2ARC a boost of the most useful data in case of reboot. After ARC warmed up operation could return to original algorithm, including heardoom.

@shodanshok
Copy link
Contributor Author

I support write size increase, it is definitely overdue. The question is only how high to set it. Would be good to recalculate it into TBW value for a typical drive over its life time, considering worst case 24x7 operation.

I find a plain TBW value to be overly pessimistic, as the cache device is not going to write at full-speed all the time. At the current 8 MB/s, a worst case estimate is 8 * 86400 * 365 = 240 TB/year, while the SSDs of one KVM server (2x 500 GB Samsung 850 EVO) are 6 years old and each has written a total of ~60.5 TB (10 TB/year only). Since last reboot, 9 days ago:

                                      capacity     operations     bandwidth 
pool                                alloc   free   read  write   read  write
cache                                   -      -      -      -      -      -
  pci-0000:01:00.1-ata-3.0-part5     261G   159G      2      2   166K   136K
  pci-0000:01:00.1-ata-4.0-part5     255G   165G      2      2   161K   136K

As a side note, on this server l2arc_write_max=268435456 (256M) since at least 6 months.

Complete disable of headroom though I consider dangerous, we do not want to repeatedly scan a a terabyte of ARC in 4KB blocks where almost nothing is L2ARC-eligible. There should remain some safety barrier.

I share that concern, even if on these 64-192 GB servers I did not see anything wrong. Maybe because I am using 128K recordsize? Anyway anything scanning 1-4 GB ARC should be ok asl2arc_headroom.

I think we've already discussed that before.

Maybe in #15201?

If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm.

Is this the current logic? I don't remember the feed thread doing that (stopping after some ineligible buffers are found).

the fix here would likely be not a blind headroom disable

I agree. At the same time, I remember this very useful comment #15201 (comment) stating that the ARC sublists only contains eligible buffers, so the feed thread should not really scan the entire ARC.

Thanks.

@amotin
Copy link
Member

amotin commented Oct 26, 2023

If the first in the row of eviction are data of some other pool or data that are not L2ARC-eligible (IIRC we've discussed that during ARC warmup too much data were not eligible due to ongoing prefetch), then L2ARC does not need to write anything and should not need to look deeper, it should stop. This logic is still valid when ARC is warm.

Is this the current logic? I don't remember the feed thread doing that (stopping after some ineligible buffers are found).

Feed thread scans up to headroom, but skips ineligible buffers. If none of scanned buffers are eligible -- nothing will be written.

the fix here would likely be not a blind headroom disable

I agree. At the same time, I remember this very useful comment #15201 (comment) stating that the ARC sublists only contains eligible buffers, so the feed thread should not really scan the entire ARC.

The sublists contain buffers eligible for eviction. It does not mean they all are eligible for L2ARC -- some may already be in L2ARC, some may belong to a different pool, some are from dataset with disabled secondarycache, some are prefetches.

@shodanshok
Copy link
Contributor Author

Feed thread scans up to headroom, but skips ineligible buffers. If none of scanned buffers are eligible -- nothing will be written.

Ok, sure, I misunderstood the previous post.

The sublists contain buffers eligible for eviction. It does not mean they all are eligible for L2ARC -- some may already be in L2ARC, some may belong to a different pool, some are from dataset with disabled secondarycache, some are prefetches.

You are right.

I agree that completely disabling headroom limit can be too much. At the same time, I am somewhat surprised that I did never see the feed thread to cause any significant load even on servers with l2arc_headroom=0. This is probably due to limited memory and default recordsize (128K).

What about setting l2arc_headroom=32? If you feel that reasonable, I can update this patch.

Thanks.

@amotin
Copy link
Member

amotin commented Oct 27, 2023

What about setting l2arc_headroom=32? If you feel that reasonable, I can update this patch.

With the new write limit it would mean up to 1GB/s of scanned buffers, or up to 4GB/s considering boosts due to compressed and cold ARC, or up to 16GB/s considering all traversed lists. Sure such write speeds are reachable in real life, but not by every system. Also not every system has so much ARC in general. This value would not be completely insane, but feels quite aggressive.

But before it I would prefer some code review/cleanup to be done there. I am not getting sense of l2arc_headroom_boost these days. I think in case of compressed ARC we should just measure the headroom in terms of HDR_GET_PSIZE(), not HDR_GET_LSIZE(). That would match both how much do we write to the L2ARC and how much do we evict from ARC. Doing better math we could reduce headroom by dropping compression boost and only adjusting the general one.

@shodanshok
Copy link
Contributor Author

With the new write limit it would mean up to 1GB/s of scanned buffers, or up to 4GB/s considering boosts due to compressed and cold ARC, or up to 16GB/s considering all traversed lists. Sure such write speeds are reachable in real life, but not by every system. Also not every system has so much ARC in general. This value would not be completely insane, but feels quite aggressive.

Yes, it would be remains quite aggressive. Maybe a safer approach is the simpler one - as I increasedl2arc_write_max by 4x, let l2arc_headroom be increased by the same 4x (instead of the proposed 16x). This means 256 MB/s of scanned buffers per-sublist in steady state, and up to 1 GB per-sublist in case of cold and compressed ARC.

But before it I would prefer some code review/cleanup to be done there. I am not getting sense of l2arc_headroom_boost these days. I think in case of compressed ARC we should just measure the headroom in terms of HDR_GET_PSIZE(), not HDR_GET_LSIZE(). That would match both how much do we write to the L2ARC and how much do we evict from ARC. Doing better math we could reduce headroom by dropping compression boost and only adjusting the general one.

I think the general idea was "if compression is enabled, consider a 2x data reduction rate". Better math would be fine, but as an hand-wave rule I find it quite reasonable.

As current values are so undersized, I am upgrading this PR with l2arc_write_max=32M and l2arc_headroom=8 hoping they would be more appropriate for modern SSDs.

Thanks.

@amotin
Copy link
Member

amotin commented Oct 27, 2023

I think the general idea was "if compression is enabled, consider a 2x data reduction rate". Better math would be fine, but as an hand-wave rule I find it quite reasonable.

If ARC is compressed, then we write the data to L2ARC exactly as they are in ARC. We do not need to guess, we know the exact physical size.

As current values are so undersized, I am upgrading this PR with l2arc_write_max=32M and l2arc_headroom=8 hoping they would be more appropriate for modern SSDs.

I have no objections.

@shodanshok shodanshok marked this pull request as ready for review October 27, 2023 17:14
@shodanshok
Copy link
Contributor Author

shodanshok commented Oct 27, 2023

I just updated the man page.

The above "cold and compressed ARC" calculation was done considering a 2x boot from a cold ARC, which is not actually true. Do you think I should set l2arc_write_boost the same as l2arc_write_max (32M) ?

EDIT: no, I'm wrong, l2arc_write_boost is defined the same as l2arc_write_max. I will re-update the man page to reflect the new values.

Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Todays we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
@shodanshok
Copy link
Contributor Author

I see some CI tests failing... can the failures be related to this patch?

@behlendorf
Copy link
Contributor

I see some CI tests failing... can the failures be related to this patch?

It looks like it could be due to the pool layouts for some of the test cases. I do see the following warning in the CI console logs before the failures. Although, based on the log message it should have capped this to something safe.

[ 3254.330747] NOTICE: l2arc_write_max or l2arc_write_boost plus the overhead of
log blocks (persistent L2ARC, 0 bytes) exceeds the size of the cache device (guid 14312536273815924369),
resetting them to the default (33554432)

@shodanshok
Copy link
Contributor Author

I see some CI tests failing... can the failures be related to this patch?

It looks like it could be due to the pool layouts for some of the test cases. I do see the following warning in the CI console logs before the failures. Although, based on the log message it should have capped this to something safe.

[ 3254.330747] NOTICE: l2arc_write_max or l2arc_write_boost plus the overhead of
log blocks (persistent L2ARC, 0 bytes) exceeds the size of the cache device (guid 14312536273815924369),
resetting them to the default (33554432)

Interesting. Do you think it is an issue with the test suite, or should I implement a cap for l2arc_write_max in the code itself?

Thanks.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 1, 2023
@behlendorf
Copy link
Contributor

It's surprising. I've resubmitting those CI runs, let see how reproducible it is.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 7, 2023
@behlendorf behlendorf merged commit 887a3c5 into openzfs:master Nov 9, 2023
18 of 19 checks passed
amotin added a commit to amotin/zfs that referenced this pull request Nov 13, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
amotin added a commit to amotin/zfs that referenced this pull request Nov 13, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
amotin added a commit to amotin/zfs that referenced this pull request Nov 14, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
amotin added a commit to amotin/zfs that referenced this pull request Nov 14, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
behlendorf pushed a commit that referenced this pull request Nov 14, 2023
PR #15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15519
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes openzfs#15457
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#15519
mmatuska pushed a commit to mmatuska/zfs that referenced this pull request Dec 27, 2023
PR openzfs#15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#15519
behlendorf pushed a commit that referenced this pull request Jan 9, 2024
PR #15457 exposed weird logic in L2ARC write sizing. If it appeared
bigger than device size, instead of liming write it reset all the
system-wide tunables to their default.  Aside of being excessive,
it did not actually help with the problem, still allowing infinite
loop to happen.

This patch removes the tunables reverting logic, but instead limits
L2ARC writes (or at least eviction/trim) to 1/4 of the capacity.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #15519
@shodanshok shodanshok deleted the l2tune branch June 29, 2024 17:22
@shodanshok shodanshok mentioned this pull request Nov 9, 2024
13 tasks
@adamdmoss
Copy link
Contributor

IMHO this is not only sensible but perhaps even still too conservative. But a definite start!

ptr1337 pushed a commit to CachyOS/zfs that referenced this pull request Nov 14, 2024
Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes openzfs#15457
ptr1337 pushed a commit to CachyOS/zfs that referenced this pull request Nov 21, 2024
Current L2ARC write rate and headroom parameters are very conservative:
l2arc_write_max=8M and l2arc_headroom=2 (ie: a full L2ARC writes at
8 MB/s, scanning 16/32 MB of ARC tail each time; a warming L2ARC runs
at 2x these rates).

These values were selected 15+ years ago based on then-current SSDs
size, performance and endurance. Today we have multi-TB, fast and
cheap SSDs which can sustain much higher read/write rates.

For this reason, this patch increases l2arc_write_max to 32M and
l2arc_headroom to 8 (4x increase for both).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Gionatan Danti <g.danti@assyoma.it>
Closes openzfs#15457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants