-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
L2ARC metadata caching partially broken #15201
Comments
L2ARC was never intended to duplicate all ARC. Only blocks that are close to eviction from ARC are written to L2ARC. And considering much smaller amounts of metadata and so eviction rate, I am not exactly surprised that less of metadata are reloaded from persistent L2ARC. Before persistency L2ARC cached blocks that are not worthy to keep in RAM, now though appear that only they are surviving reboot. ;) |
In this case, things should work different because a) ARC is not warm, so the to-be-copied buffers are scanned from head rather than tail and b) In general I fee that we have some lurking issue with the L2ARC code (see also here: #15201), as L2ARC feed rate should not depend on how fast / how many ARC sublists are traversed. The basic idea is that, at each feed thread run, some eligible buffers are discovered and added to the L2ARC device. If buffers are in RAM (as shown by Am I missing something? Thanks. |
You may be right about |
I don't think the uncached metadata are due to ARC_FLAG_L2CACHE: I put some Line 8143 in 11fbcac
Rather, it seems that the function Line 9249 in 11fbcac
Forcing Hence I feel something is potentially wrong with how the L2ARC is fed. Any ideas? |
I haven't verified it, but I may have an idea what is going on and why it affects metadata more than data. The problem is that arcs_list returned by l2arc_sublist_lock() are the lists of EVICTABLE ARC headers. ARC headers referenced by dbuf cache are NOT evictable. Headers for blocks backing open and/or cached dnodes are NOT evictable. Headers with active I/O and respective indirect blocks are NOT evictable till the end of I/O or TXG. There may be other cases that I can't recall right now. So all those headers will not be written to L2ARC until you flush or heavily evict dnode and dbuf cache. |
And once more the most useful blocks are not likely to get into persistent L2ARC. :) |
Thanks for the analysis, it make a lot of sense. With persistent L2ARC, and especially when |
We do not have full sorted per-state lists of ARC headers, only of evictable ones, sorted by last access time exactly for purposes of eviction. The only place where we have all headers tracked is the ARC hash table. But there we have everything in absolutely random order, and writing everything from there would break all existing L2ARC logic in more usual configurations, that even though old, still makes sense, at least if system is more running than rebooting. I think we should instead focus on using special vdevs more often. IIRC there is already an option to not write to L2ARC data stored on special vdev. May be we could dynamically reserve unused part of special vdev for a sort of embedded L2ARC, like we already do for embedded ZIL on main vdevs. Though since special vdev expects redundancy while L2ARC is not, it may be a weird idea. At very least you may manually do it with partitions, resizing those later if needed, if the placement is properly thought in advance. |
Ok, I missed that the ARC sublists only included evictable buffers, and it make sense. Just to understand better: why changing
While special vdevs are a great addition, I really like L2ARC: being expendable, one can stripe multiple (relatively) cheap devices to greatly enhance pool performance. Moreover, it is much more dynamic than special vdevs. For example, for datasets hosting big files only (virtual machines, databases, etc) using special vdev means migrating all, or rather not at all, such data on the special devices (depending on the selected block migration cutoff via Thanks. |
There may be other factors, but since the number of sublists is equal to the number of CPUs on a large system it may take a number of iterations to scan through all the headers. Scanning more at a time should obviously increase the chances a lot.
I think actually combining special vdevs with L2ARC could give the best of both worlds: metadata on special vdevs would reduce random access, pool import, random management tasks, etc. times, and you would not need to bother if they are cached or not; same time L2ARC for data can do dynamic caching for huge amounts of data, and its existing algorithm should be acceptable for it without tricks like l2arc_headroom=0, or even with, if you so prefer. |
A remark regarding the special VDEVs: While they are a great option for enterprise systems, they are are typically not feasible for compact SOHO systems. L2ARC works fine with a single NVME drive, but for special VDEVs two to three e.g. SATA or NVME drives are required to keep the pool redundancy. A typical compact SOHO system won't have enough drive bays and/or drive interfaces to support the required amount of special VDEVs in addition to the pool data drives. Furthermore, at least private-owned SOHO systems in countries with high energy prices are often switched off overnight (due to missing Suspend-To-Ram support), therefore having all relevant data in the persistent L2ARC would be very welcome. I understand that the problem is non-trivial, but i wanted to point out that usage of the special VDEVs is not always a solution. |
To tell the truth, my test VM only has a single vCPU which, if I correctly understand the code, randomly iterates between the four sublists. What surprised me was not that iterating such sublists one at time loaded L2ARC slowly, but that it produces different total cached metadata (ie: 4MB vs 8-10MB).
|
Could be interesting to have a command or flag that forcibly injects the hot metadata onto the L2ARC, since if I follow correctly it's never going to be eligible but would still benefit from an initial cached load even if outside of the cold load it's amortized almost infinitely by how hot it is. Also, as pointed out above, given the average sizes I've seen for pool metadata versus data size, and the size of consumer SSDs, let alone enterprise ones, it really seems like you could get away with having an option to force load the entire pool's metadata onto the L2ARC and keep it there. Yeah, a special vdev or hybrid L2ARC out of special space would be great, but as pointed out, since specials aren't a transient home, that would be problematic. |
I agree, but I can't find any obvious place to add this logic. Basically, one need reclaim memory to let hot buffers to be cached in L2ARC, but this will cause L1 ARC eviction of the very same buffer, which means lower performance. In other words:
This would be interesting, but caution should be used if/when such metadata pre-load can lead to cache trashing. Anyway, as L2ARC is working as intended, I will close this issue. |
System information
Describe the problem you're observing
L2ARC metadata caching seems partially broken, in the sense that L2ARC caches far too few metadata. For example, walking via
find
a directory with ~100000 files results in only ~5MB of L2ARC data, vs ~20MB of compressed metadata when forcing L1 eviction viaecho 3 > /proc/sys/vm/drop_caches
. With successivefind
runs more metadata land on L2ARC. This happens on a test machine with no memory pressure andl2arc_headroom=0
(to have all L1ARC buffers cached on L2ARC) +l2arc_noprefetch=0
(so even prefetched buffers are eligible from L2ARC). See below for more details. Data caching seems much less affected.I experimented some changes to
zfs/arc.c
in order to scan all sublists (4 by default) at eachl2arc_write_buffers
. This change increased metadata cache (roughly doubling it), but did not seem to completely solve the issue.Describe how to reproduce the problem
Create a metadata-rich dataset (ie: many small files), walk them via
find
+stat
and check the cached L2ARC metadata viazpool iostat -v
Include any warning/errors/backtraces from the system logs
None
The text was updated successfully, but these errors were encountered: