Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

L2ARC metadata caching partially broken #15201

Closed
shodanshok opened this issue Aug 23, 2023 · 14 comments
Closed

L2ARC metadata caching partially broken #15201

shodanshok opened this issue Aug 23, 2023 · 14 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@shodanshok
Copy link
Contributor

shodanshok commented Aug 23, 2023

System information

Type Version/Name
Distribution Name Debian
Distribution Version 12.1
Kernel Version 6.1.0-11-amd64
Architecture x86-64
OpenZFS Version 2.1.11

Describe the problem you're observing

L2ARC metadata caching seems partially broken, in the sense that L2ARC caches far too few metadata. For example, walking via find a directory with ~100000 files results in only ~5MB of L2ARC data, vs ~20MB of compressed metadata when forcing L1 eviction via echo 3 > /proc/sys/vm/drop_caches. With successive find runs more metadata land on L2ARC. This happens on a test machine with no memory pressure and l2arc_headroom=0 (to have all L1ARC buffers cached on L2ARC) + l2arc_noprefetch=0 (so even prefetched buffers are eligible from L2ARC). See below for more details. Data caching seems much less affected.

# test pool
root@debian12:~/zfs/zfs-2.1.11# zpool status
  pool: tank
 state: ONLINE
remove: Removal of vdev 1 copied 116K in 0h0m, completed on Tue Aug 22 16:19:31 2023
        240 memory used for removed device mappings
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0     0
          vdb              ONLINE       0     0     0
        cache
          /root/l2arc.img  ONLINE       0     0     0

# test dataset, primarcycache/secondarycache=all
root@debian12:~/zfs/zfs-2.1.11# zfs list tank/test
NAME        USED  AVAIL     REFER  MOUNTPOINT
tank/test   297M  14.7G      297M  /tank/test

# stat-ing files
root@debian12:~/zfs/zfs-2.1.11# time find /tank/test/fsmark/ -exec stat {} \+ > /dev/null

# L2ARC cache - only ~5MB cached
root@debian12:~/zfs/zfs-2.1.11# zpool iostat -v
                     capacity     operations     bandwidth
pool               alloc   free   read  write   read  write
-----------------  -----  -----  -----  -----  -----  -----
tank                299M  15.2G     22      5   469K  60.2K
  vdb               299M  15.2G     22      5   469K  60.2K
  indirect-1           -      -      0      0      0      0
cache                  -      -      -      -      -      -
  /root/l2arc.img  5.36M  3.99G      0      2  6.17K  72.1K
-----------------  -----  -----  -----  -----  -----  -----

# exporting and reimporting the pool to empty L1ARC and only use reconstructed L2ARC, notice how many l2misses happen
root@debian12:~/zfs/zfs-2.1.11# zpool export tank; zpool import tank
root@debian12:~/zfs/zfs-2.1.11# time find /tank/test/fsmark/ -exec stat {} \+ > /dev/null
root@debian12:~/zfs/zfs-2.1.11/module# arcstat -f time,read,miss,l2read,l2miss 1 (on another terminal)
    time  read  miss  l2read  l2miss
23:30:25     3     0       0       0
23:30:26  4.9K   439     439     278
23:30:27   47K   269     270     137
23:30:28   48K   360     359     321
23:30:29   48K   530     530     377
23:30:30   51K   745     745     588
23:30:31   47K   519     519     449
23:30:32   49K   109     109     109
23:30:33   47K   522     522     517
23:30:34   51K   606     606     586
23:30:35   18K    41      41      41
23:30:36     0     0       0       0
23:30:37     0     0       0       0

# re-walk the files, now more metadata are cached
root@debian12:~/zfs/zfs-2.1.11# time find /tank/test/fsmark/ -exec stat {} \+ > /dev/null
root@debian12:~/zfs/zfs-2.1.11# zpool iostat -v
                     capacity     operations     bandwidth
pool               alloc   free   read  write   read  write
-----------------  -----  -----  -----  -----  -----  -----
tank                299M  15.2G     16      1   258K  23.6K
  vdb               299M  15.2G     16      1   258K  23.6K
  indirect-1           -      -      0      0      0      0
cache                  -      -      -      -      -      -
  /root/l2arc.img  9.37M  3.99G      4      1  55.9K  39.1K
-----------------  -----  -----  -----  -----  -----  -----

# force L1ARC eviction, now L2ARC caches much more metadata
root@debian12:~/zfs/zfs-2.1.11# echo 3 >/proc/sys/vm/drop_caches
root@debian12:~/zfs/zfs-2.1.11# zpool iostat -v
                     capacity     operations     bandwidth
pool               alloc   free   read  write   read  write
-----------------  -----  -----  -----  -----  -----  -----
tank                299M  15.2G     12      1   195K  17.9K
  vdb               299M  15.2G     12      1   195K  17.9K
  indirect-1           -      -      0      0      0      0
cache                  -      -      -      -      -      -
  /root/l2arc.img  18.6M  3.98G      3      1  42.3K  90.7K
-----------------  -----  -----  -----  -----  -----  -----

# exporting and reimporting the pool again, see how L2ARC fares much better now
root@debian12:~/zfs/zfs-2.1.11# zpool export tank; zpool import tank
root@debian12:~/zfs/zfs-2.1.11# time find /tank/test/fsmark/ -exec stat {} \+ > /dev/null 
root@debian12:~/zfs/zfs-2.1.11/module# arcstat -f time,read,miss,l2read,l2miss 1 (on another terminal)
    time  read  miss  l2read  l2miss
23:33:18     4     2       2       2
23:33:19   54K   960     960      16
23:33:20   58K   548     548      11
23:33:21   57K   391     391       9
23:33:22   60K   508     508       5
23:33:23   58K   547     547       4
23:33:24   57K   541     541      27
23:33:25   54K   612     612     275
23:33:26   14K    31      31      31
23:33:27     0     0       0       0

I experimented some changes to zfs/arc.c in order to scan all sublists (4 by default) at each l2arc_write_buffers. This change increased metadata cache (roughly doubling it), but did not seem to completely solve the issue.

Describe how to reproduce the problem

Create a metadata-rich dataset (ie: many small files), walk them via find + stat and check the cached L2ARC metadata via zpool iostat -v

Include any warning/errors/backtraces from the system logs

None

@shodanshok shodanshok added the Type: Defect Incorrect behavior (e.g. crash, hang) label Aug 23, 2023
@amotin
Copy link
Member

amotin commented Aug 23, 2023

L2ARC was never intended to duplicate all ARC. Only blocks that are close to eviction from ARC are written to L2ARC. And considering much smaller amounts of metadata and so eviction rate, I am not exactly surprised that less of metadata are reloaded from persistent L2ARC. Before persistency L2ARC cached blocks that are not worthy to keep in RAM, now though appear that only they are surviving reboot. ;)

@shodanshok
Copy link
Contributor Author

Only blocks that are close to eviction from ARC are written to L2ARC.

In this case, things should work different because a) ARC is not warm, so the to-be-copied buffers are scanned from head rather than tail and b) l2arc_headroom=0 explicitly ask for all ARC content to be scanned.

In general I fee that we have some lurking issue with the L2ARC code (see also here: #15201), as L2ARC feed rate should not depend on how fast / how many ARC sublists are traversed. The basic idea is that, at each feed thread run, some eligible buffers are discovered and added to the L2ARC device. If buffers are in RAM (as shown by arc_summary), why are they not copied on L2ARC?

Am I missing something? Thanks.

@amotin
Copy link
Member

amotin commented Aug 24, 2023

You may be right about l2arc_headroom=0, I was not aware of that special case. Aside of that, to memory comes ARC_FLAG_L2CACHE flag, that may be not set for some speculatively prefetched data or who knows what else, based of assumption that if we can speculatively prefetch something now, we may be able to do it later also, and it may have sense to leave the data on primary storage to not overload L2ARC device later, since may be a bottleneck. L2ARC is expected to have lower latency than the main pool, but it often can not compete with on bulk throughput.

@shodanshok
Copy link
Contributor Author

I don't think the uncached metadata are due to ARC_FLAG_L2CACHE: I put some printk at arc.c at function l2arc_write_eligible

l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr)
to show why so many metadata were not sent to L2 and the "not to cache in L2" flag rarely was the cause.

Rather, it seems that the function l2arc_write_buffers simply misses many ARC data when iterating through the sublists and ARC hdr here:

for (; hdr; hdr = hdr_prev) {

Forcing l2arc_write_buffers to walk all four sublists, rather than a randomly selected one, at each l2arc_feed_thread iteration shows much more metadata sent to L2ARC. This feel wrong: if metadata are in ARC (and they are in ARC, as no memory pressure evicted them and arc_summary shows no change), they should simply be copied in one of the other l2arc_feed_thread iterations (ie: when the randomly selected sublist is the "right" one).

Hence I feel something is potentially wrong with how the L2ARC is fed. Any ideas?

@amotin
Copy link
Member

amotin commented Aug 25, 2023

I haven't verified it, but I may have an idea what is going on and why it affects metadata more than data. The problem is that arcs_list returned by l2arc_sublist_lock() are the lists of EVICTABLE ARC headers. ARC headers referenced by dbuf cache are NOT evictable. Headers for blocks backing open and/or cached dnodes are NOT evictable. Headers with active I/O and respective indirect blocks are NOT evictable till the end of I/O or TXG. There may be other cases that I can't recall right now. So all those headers will not be written to L2ARC until you flush or heavily evict dnode and dbuf cache.

@amotin
Copy link
Member

amotin commented Aug 25, 2023

And once more the most useful blocks are not likely to get into persistent L2ARC. :)

@shodanshok
Copy link
Contributor Author

shodanshok commented Aug 26, 2023

Thanks for the analysis, it make a lot of sense. With persistent L2ARC, and especially when l2arc_headroom=0, the current behavior is excessively conservative: I think it would be better to simply copy all blocks into L2ARC. In general, L2ARC shows it age - being added when SSDs were small and expensive. Any possibility to get this fixed?

@amotin
Copy link
Member

amotin commented Aug 26, 2023

We do not have full sorted per-state lists of ARC headers, only of evictable ones, sorted by last access time exactly for purposes of eviction. The only place where we have all headers tracked is the ARC hash table. But there we have everything in absolutely random order, and writing everything from there would break all existing L2ARC logic in more usual configurations, that even though old, still makes sense, at least if system is more running than rebooting.

I think we should instead focus on using special vdevs more often. IIRC there is already an option to not write to L2ARC data stored on special vdev. May be we could dynamically reserve unused part of special vdev for a sort of embedded L2ARC, like we already do for embedded ZIL on main vdevs. Though since special vdev expects redundancy while L2ARC is not, it may be a weird idea. At very least you may manually do it with partitions, resizing those later if needed, if the placement is properly thought in advance.

@shodanshok
Copy link
Contributor Author

shodanshok commented Aug 26, 2023

We do not have full sorted per-state lists of ARC headers, only of evictable ones, sorted by last access time exactly for purposes of eviction. The only place where we have all headers tracked is the ARC hash table. But there we have everything in absolutely random order, and writing everything from there would break all existing L2ARC logic in more usual configurations, that even though old, still makes sense, at least if system is more running than rebooting.

Ok, I missed that the ARC sublists only included evictable buffers, and it make sense. Just to understand better: why changing l2arc_write_buffers to scan all four sublists resulted in much more metadata being written to L2ARC? It was simply because, by early-scanning, more buffer where not already markes as "active" (and non-evictable)?

I think we should instead focus on using special vdevs more often. IIRC there is already an option to not write to L2ARC data stored on special vdev. May be we could dynamically reserve unused part of special vdev for a sort of embedded L2ARC, like we already do for embedded ZIL on main vdevs. Though since special vdev expects redundancy while L2ARC is not, it may be a weird idea. At very least you may manually do it with partitions, resizing those later if needed, if the placement is properly thought in advance.

While special vdevs are a great addition, I really like L2ARC: being expendable, one can stripe multiple (relatively) cheap devices to greatly enhance pool performance. Moreover, it is much more dynamic than special vdevs. For example, for datasets hosting big files only (virtual machines, databases, etc) using special vdev means migrating all, or rather not at all, such data on the special devices (depending on the selected block migration cutoff via special_small_blocks).

Thanks.

@amotin
Copy link
Member

amotin commented Aug 26, 2023

Ok, I missed that the ARC sublists only included evictable buffers, and it make sense. Just to understand better: why changing l2arc_write_buffers to scan all four sublists resulted in much more metadata being written to L2ARC? It was simply because, by early-scanning, more buffer where not already markes as "active" (and non-evictable)?

There may be other factors, but since the number of sublists is equal to the number of CPUs on a large system it may take a number of iterations to scan through all the headers. Scanning more at a time should obviously increase the chances a lot.

While special vdevs are a great addition, I really like L2ARC: being expendable, one can stripe multiple (relatively) cheap devices to greatly enhance pool performance. Moreover, it is much more dynamic than special vdevs. For example, for datasets hosting big files only (virtual machines, databases, etc) using special vdev means migrating all, or rather not at all, such data on the special devices (depending on the selected block migration cutoff via special_small_blocks).

I think actually combining special vdevs with L2ARC could give the best of both worlds: metadata on special vdevs would reduce random access, pool import, random management tasks, etc. times, and you would not need to bother if they are cached or not; same time L2ARC for data can do dynamic caching for huge amounts of data, and its existing algorithm should be acceptable for it without tricks like l2arc_headroom=0, or even with, if you so prefer.

@zfsuser
Copy link

zfsuser commented Aug 27, 2023

A remark regarding the special VDEVs:

While they are a great option for enterprise systems, they are are typically not feasible for compact SOHO systems. L2ARC works fine with a single NVME drive, but for special VDEVs two to three e.g. SATA or NVME drives are required to keep the pool redundancy. A typical compact SOHO system won't have enough drive bays and/or drive interfaces to support the required amount of special VDEVs in addition to the pool data drives.

Furthermore, at least private-owned SOHO systems in countries with high energy prices are often switched off overnight (due to missing Suspend-To-Ram support), therefore having all relevant data in the persistent L2ARC would be very welcome.

I understand that the problem is non-trivial, but i wanted to point out that usage of the special VDEVs is not always a solution.

@shodanshok
Copy link
Contributor Author

shodanshok commented Aug 27, 2023

There may be other factors, but since the number of sublists is equal to the number of CPUs on a large system it may take a number of iterations to scan through all the headers. Scanning more at a time should obviously increase the chances a lot.

To tell the truth, my test VM only has a single vCPU which, if I correctly understand the code, randomly iterates between the four sublists. What surprised me was not that iterating such sublists one at time loaded L2ARC slowly, but that it produces different total cached metadata (ie: 4MB vs 8-10MB).

same time L2ARC for data can do dynamic caching for huge amounts of data, and its existing algorithm should be acceptable for it without tricks like l2arc_headroom=0, or even with, if you so prefer.

l2arc_headroom=0 was an excellent suggestion given by @gamanakis in the persistent L2ARC thread to have "persistent" L1ARC as well. I found it very useful, but now I understand why it does not really provides persistence for all ARC buffers (and the most used ones are not going to be copied in L2ARC).

@rincebrain
Copy link
Contributor

rincebrain commented Aug 28, 2023

Could be interesting to have a command or flag that forcibly injects the hot metadata onto the L2ARC, since if I follow correctly it's never going to be eligible but would still benefit from an initial cached load even if outside of the cold load it's amortized almost infinitely by how hot it is.

Also, as pointed out above, given the average sizes I've seen for pool metadata versus data size, and the size of consumer SSDs, let alone enterprise ones, it really seems like you could get away with having an option to force load the entire pool's metadata onto the L2ARC and keep it there.

Yeah, a special vdev or hybrid L2ARC out of special space would be great, but as pointed out, since specials aren't a transient home, that would be problematic.

@shodanshok
Copy link
Contributor Author

Could be interesting to have a command or flag that forcibly injects the hot metadata onto the L2ARC

I agree, but I can't find any obvious place to add this logic. Basically, one need reclaim memory to let hot buffers to be cached in L2ARC, but this will cause L1 ARC eviction of the very same buffer, which means lower performance. In other words: echo 3 >/proc/sys/vm/drop_caches can be reasonable before a planned reboot, but not on an otherwise working system.

Also, as pointed out above, given the average sizes I've seen for pool metadata versus data size, and the size of consumer SSDs, let alone enterprise ones, it really seems like you could get away with having an option to force load the entire pool's metadata onto the L2ARC and keep it there.

This would be interesting, but caution should be used if/when such metadata pre-load can lead to cache trashing.

Anyway, as L2ARC is working as intended, I will close this issue.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants