More adaptive ARC eviction. #14359

amotin · 2023-01-07T17:39:33Z

Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO.

This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and metadata, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata.

The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places.

This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits.

Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable.

How Has This Been Tested?

Manually simulating different access pattern I was able to observe expected arc_meta, arc_pd and arc_pm changes.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

adamdmoss · 2023-01-17T23:24:15Z

(sorry for the low-fidelity pic :) ) - I get a panic during import when I test this PR - looks like some incompatibility with L2ARC rebuild -

amotin · 2023-01-18T01:37:40Z

I get a panic during import when I test this PR - looks like some incompatibility with L2ARC rebuild - !

@adamdmoss Thank you for the report. Appears I unexpectedly changed persistent L2ARC on-disk format. Added simple shim to fix it.

adamdmoss · 2023-01-18T02:16:57Z

Verified fixed - thanks!

devZer0 · 2023-02-01T15:12:53Z

@amotin , thanks for making this. i currently try to test this.

i have a question

manpage is telling:

 zfs_arc_meta_balance=500 (uint)
         Balance between metadata and data on ghost hits.  Values above 100 increase metadata caching by proportionally reducing effect of ghost data hits on tar‐get data/metadata rate.

what does a value of "500" exactly mean ? what should is set to maximise arc being used for metadata and avoid metadata eviction? proportional relation to what?

amotin · 2023-02-01T16:22:07Z

what does a value of "500" exactly mean ? what should is set to maximise arc being used for metadata and avoid metadata eviction? proportional relation to what?

@devZer0 It means data ghost hits cause 5 times smaller metadata cache reduction than metadata ghost hit cause data cache reduction. There is no upper limit. The higher you set it, the smaller pressure will be on metadata. It is not absolute, some metadata will likely be evicted, otherwise there will be no ghost state to indicate pressure, but after some time it should settle at some balance point where data and metadata ghost hits (read "almost cache hit but no") balance each other according to this coefficient. That is the whole point of being adaptive.

allanjude

Reviewed-by: Allan Jude <allan@klarasystems.com>

include/sys/arc_impl.h

module/zfs/arc.c

amotin · 2023-03-02T15:04:25Z

While there I decided to remove unusable spa argument from arc_evict_impl() and reorder remaining more logically.

behlendorf · 2023-03-02T21:35:13Z

@ahrens @grwilson I'd like to integrate this ARC change early next week, after a long weekend of stress testing. If you have a chance to look it over before then that would be great.

Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

amotin · 2023-03-06T18:49:53Z

I've noticed there is no more reason for arcstat_dnode_size to be an aggsum, since now it is read only once per arc_evict(), so I demoted it to cheaper wmsum.

behlendorf · 2023-03-08T19:18:38Z

Merged. These changes worked as intended in my testing.

Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#14359

nasbdh9 · 2023-03-20T09:47:19Z

Are these changes expected to be ported to 2.1.10?

amotin · 2023-03-20T14:29:53Z

Are these changes expected to be ported to 2.1.10?

No. Same as few other ARC refactoring PRs of mine it will stay in 2.2. Those are quite a big and invasive change for a minor release.

New features: - Fully adaptive ARC eviction (#14359) - Block cloning (#13392) - Scrub error log (#12812, #12355) - Linux container support (#14070, #14097, #12263) - BLAKE3 Checksums (#12918) - Corrective "zfs receive" (#9372) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#14359

New Features - Block cloning (#13392) - Linux container support (#14070, #14097, #12263) - Scrub error log (#12812, #12355) - BLAKE3 checksums (#12918) - Corrective "zfs receive" - Vdev and zpool user properties Performance - Fully adaptive ARC (#14359) - SHA2 checksums (#13741) - Edon-R checksums (#13618) - Zstd early abort (#13244) - Prefetch improvements (#14603, #14516, #14402, #14243, #13452) - General optimization (#14121, #14123, #14039, #13680, #13613, #13606, #13576, #13553, #12789, #14925, #14948) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

New features: - Fully adaptive ARC eviction (openzfs#14359) - Block cloning (openzfs#13392) - Scrub error log (openzfs#12812, openzfs#12355) - Linux container support (openzfs#14070, openzfs#14097, openzfs#12263) - BLAKE3 Checksums (openzfs#12918) - Corrective "zfs receive" (openzfs#9372) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

gertvdijk · 2024-12-11T13:54:55Z

Hi @amotin thanks again so much for your work on this one. My tests on 2.2 (2.2.6) show a much more stable ARC size and use compared to the problematic prune storms on 2.1.x reported here: #9966 (comment).

I did notice that there is some residue in the code and docs. If I understand correctly, the module parameter zfs_arc_meta_strategy is removed in 2.2, but the arc_strategy enum is still in the header file include/sys/arc.h; is that intentional or can this be removed?

zfs/include/sys/arc.h

Lines 110 to 113 in e0039c7

    
           typedef enum arc_strategy { 
        
           	ARC_STRATEGY_META_ONLY		= 0, /* Evict only meta data buffers */ 
        
           	ARC_STRATEGY_META_BALANCED	= 1, /* Evict data buffers if needed */ 
        
           } arc_strategy_t;

Also in openzfs-docs repo this module parameter is still mentioned for tuning - should I go and fix that with a note for 2.2+? 😃
https://github.com/openzfs/openzfs-docs/blob/2df53a3b8594b8663257dce1f4032f71f6880006/docs/Performance%20and%20Tuning/Module%20Parameters.rst#L2572-L2600

amotin · 2024-12-11T16:05:31Z

@gertvdijk You are right, it should be removed. Would you like to create a PR, or would prefer me to? About the openzfs-docs I have no idea, never touched it. I guess it could benefit from PR also.

amotin force-pushed the arc_evict branch from 2516da5 to 23a7996 Compare January 7, 2023 17:42

amotin added Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Jan 7, 2023

amotin force-pushed the arc_evict branch 2 times, most recently from 0feb27f to 310fbc7 Compare January 7, 2023 22:15

amotin mentioned this pull request Jan 7, 2023

Merge metadata+data for eviction purposes #14014

Closed

amotin force-pushed the arc_evict branch 3 times, most recently from 8e24992 to 3b9dd0d Compare January 9, 2023 20:16

amotin requested review from behlendorf, ahrens and grwilson January 10, 2023 14:17

amotin force-pushed the arc_evict branch from 3b9dd0d to 785ca56 Compare January 18, 2023 01:35

malventano mentioned this pull request Jan 31, 2023

Data throughput causing apparent (directory) metadata eviction with metadata_size << arc_meta_min #10508

Open

allanjude approved these changes Feb 4, 2023

View reviewed changes

amotin force-pushed the arc_evict branch from 785ca56 to 12539bb Compare February 15, 2023 01:54

behlendorf reviewed Mar 2, 2023

View reviewed changes

module/zfs/arc.c Outdated Show resolved Hide resolved

module/zfs/arc.c Show resolved Hide resolved

amotin force-pushed the arc_evict branch 2 times, most recently from df3c3af to 2d948d6 Compare March 2, 2023 15:03

behlendorf approved these changes Mar 2, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 2, 2023

amotin force-pushed the arc_evict branch from 2d948d6 to ff9066b Compare March 6, 2023 18:47

behlendorf merged commit a8d83e2 into openzfs:master Mar 8, 2023

amotin deleted the arc_evict branch March 8, 2023 19:23

vaclavskala mentioned this pull request Mar 10, 2023

Change zfs_arc_meta_limit_percent default to 100 #10191

Closed

12 tasks

prakashsurya mentioned this pull request Mar 29, 2023

OOM triggered, suspect ARC to blame #14686

Closed

amotin mentioned this pull request Apr 10, 2023

arc_prune and arc_evict at 100% even with no disk activity #14005

Open

lenghenglong mentioned this pull request May 16, 2023

arc_prune causes 8K file random read performance to decrease #14826

Open

chrismuzyn mentioned this pull request Aug 21, 2023

ZFS 2.1.x on RH/CentOS 7 #14262

Open

syntaxerrormmm mentioned this pull request Feb 22, 2024

OpenZFS 2.2+ breaks the zfs.py script blind-oracle/zabbix-zfs#18

Closed

krzotr mentioned this pull request Jul 5, 2024

More adaptive ARC eviction - No information about deprecation of module parameters openzfs/openzfs-docs#513

Closed

tkittich mentioned this pull request Oct 1, 2024

MRU, MFU don't adapt to their targets #16576

Closed

amotin removed the Status: Design Review Needed Architecture or design is under discussion label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More adaptive ARC eviction. #14359

More adaptive ARC eviction. #14359

amotin commented Jan 7, 2023 •

edited

Loading

adamdmoss commented Jan 17, 2023

amotin commented Jan 18, 2023

adamdmoss commented Jan 18, 2023

devZer0 commented Feb 1, 2023

amotin commented Feb 1, 2023 •

edited

Loading

allanjude left a comment

amotin commented Mar 2, 2023

behlendorf commented Mar 2, 2023

amotin commented Mar 6, 2023

behlendorf commented Mar 8, 2023

nasbdh9 commented Mar 20, 2023

amotin commented Mar 20, 2023

gertvdijk commented Dec 11, 2024

amotin commented Dec 11, 2024

More adaptive ARC eviction. #14359

More adaptive ARC eviction. #14359

Conversation

amotin commented Jan 7, 2023 • edited Loading

How Has This Been Tested?

Types of changes

Checklist:

adamdmoss commented Jan 17, 2023

amotin commented Jan 18, 2023

adamdmoss commented Jan 18, 2023

devZer0 commented Feb 1, 2023

amotin commented Feb 1, 2023 • edited Loading

allanjude left a comment

Choose a reason for hiding this comment

amotin commented Mar 2, 2023

behlendorf commented Mar 2, 2023

amotin commented Mar 6, 2023

behlendorf commented Mar 8, 2023

nasbdh9 commented Mar 20, 2023

amotin commented Mar 20, 2023

gertvdijk commented Dec 11, 2024

amotin commented Dec 11, 2024

amotin commented Jan 7, 2023 •

edited

Loading

amotin commented Feb 1, 2023 •

edited

Loading