Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability for Dynamic Storage Tiering - NVME (superfast) + SSD (mid-tier) + HDD (slow) - manipulate 'brtfs balance' profiles #610

Open
TheLinuxGuy opened this issue Apr 2, 2023 · 10 comments
Labels
enhancement kernel something in kernel has to be done too question Not a bug, clarifications, undocumented behaviour

Comments

@TheLinuxGuy
Copy link

Could brtfs implement a feature to support multiple devices of different speed/types with a profiling algorithm for data balancing? In other words - dynamic storage tiering.

Assume a user with a combined brtfs filesystem with:

  • 1TB NVME (tier 1)
  • 4TB SSD (tier 2)
  • 20TB HDD (tier 3)

To keep things simple, assume no redundancy in each tier. The goal the user is looking for is to ensure the maximum performance and for the storage in the filesystem to be as optimized as it can be within some customizable settings (e.g: how much nvme space should be left "free" for writeback caching of new I/O).

As I was thinking brtfs-balance already does some of the filesystem optimization by balancing disk space utilization evenly across each disk. This feature is asking for more options to change how should brtfs-balance should work and how new I/O writes are handled so that 'tier 1' is always the priority.

Least used data blocks not recently accessed would be "downgraded" or moved down to a lower tier if the user hasn't accessed those data blocks and as the filesystem usage grows demanding some purging / rebalance.

@TheLinuxGuy
Copy link
Author

Also from my research, it seems that Netgear may have forked brtfs already to achieve this and they implemented their own algorithm for storage tiering in their now defunct ReadyNAS OS.

See page 10 of https://www.downloads.netgear.com/files/GDC/READYNAS-100/ReadyNAS_FlexRAID_Optimization_Guide.pdf and https://unix.stackexchange.com/questions/623460/tiered-storage-with-btrfs-how-is-it-done?answertab=modifieddesc#tab-top

@kdave kdave added enhancement question Not a bug, clarifications, undocumented behaviour kernel something in kernel has to be done too labels Apr 3, 2023
@kdave
Copy link
Owner

kdave commented Apr 3, 2023

I was not aware of that, thanks for the links. It seems that readynas is not maitained and I can't find any git repositories assuming it's built on top of linux. Their page also does not mention 'btrfs' anywhere. The storage tiers are a feature people ask for so no surprise that somebody implemented that outside of linux but merging that back would be desirable. I haven't seen the code so it's hard to tell in what way it was implemented and if it would be acceptable, vendors often don't have to deal with backward compatibility or long term support so it's "cheaper" to do their private extensions instead.

@Forza-tng
Copy link
Contributor

There is the patch set for metdata-on-ssd somewhere. This, I think would be a good middle-ground if they were accepted into mainline kernel. https://patchwork.kernel.org/project/linux-btrfs/patch/20200405082636.18016-2-kreijack@libero.it/

@Duncaen
Copy link

Duncaen commented Apr 4, 2023

https://www.downloads.netgear.com/files/GPL/ReadyNASOS_V6.10.8_WW_src.zip

The paths I looked at are:

btrfs-tools-4.16/debian/patches/0010-Add-btrfs-balance-sweep-subcommand-for-dat-tiering.patch
linux-4.4.218-x86_64/fs/btrfs

I haven't looked at the full diff since the kernel is pretty old and much has changed, but basically it looks like it adds another sort function to sort the devices in __btrfs_alloc_chunk2 (btrfs_create_chunk now) sorting the device by a class attribute.
And then an ioctl for a "sweep" filter for balance.

@studyfranco
Copy link

This would be a fantastic addition to Btrfs. I'd like to emphasize the importance of being able to specify sub-volume affinity. Imagine having sub-volumes for /, /var/log, and /home. Here's the concept:

  • Data from / has the highest priority and is initially stored on tier 1, but it can be moved to tier 3 when it's not actively used.
  • Data from /var/log is initially written to tier 3. If there's no free space available on that tier, the data is written on another tier.
  • Data from /home is initially written with priority on tier 1. If there's no space available, it can be moved to tier 2. Eventually, when it's not actively used, it can be shifted to tier 3.

In this system, data from / is given the highest priority for storage space on tier 1, with a lower priority for /var/log and /home on the same tier. Similarly, data from /var/log is given the highest priority for storage space on tier 3, with a lower priority for /var/log and /home on the same tier.

I imagine two parameters to implement this:

  • driver_write_priority: Allows users to define the order of data writes on the disks and set the priority of each sub-volume on the disks.
  • drive_unused_data: A parameter to handle data that is not actively used.

This level of control over data placement within sub-volumes would be a game-changer. It allows for finely tuned optimization of storage resources based on specific usage scenarios. It would further solidify Btrfs as a powerful and flexible file system for data management.

@Forza-tng
Copy link
Contributor

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

@kakra
Copy link

kakra commented Nov 26, 2023

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26

They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: kakra/linux#31

@studyfranco
Copy link

@TheLinuxGuy , @studyfranco It might be worth for you to have a look at the Btrfs preferred metadata patches. kakra/linux#26
They do not explicitly deal in tiers, but they do introduce metadata-only, metadata-preferred, data-only and data-preferred priorities.

Rebased to 6.6 LTS: kakra/linux#31

This is a very good begin. But, my use case (and my proposition) is more complex.
I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

@kakra
Copy link

kakra commented Nov 29, 2023

This is a very good begin. But, my use case (and my proposition) is more complex.
I have a hybrid system, and BTRFS with this feature will be the best file system for home usage. No space loose, no compromise, and most adaptative when we want to play games.

Currently I'm solving it this way:

I have two NVMe drives, each drive has a 64GB meta-data-preferred partiton for btrfs. The remaining space is md-raid1, then bcache backing partition put into it. All HDDs (4x 4TB) are data-preferred partitions formatted on bcache writeback backend partition and attached to the md-raid1 cache.

This way, meta data is on native NVMe because bcache doesn't handle cow metadata very efficient, and I still get the benefits of having hot data on NVMe. I'm using these patches to exclude some IO traffic from being cached (e.g. backup or maintenance jobs with idle IO priority): kakra/linux#32

I achieve cache hit rate of 96% and bypass-hits of 95% (IO requests that should have bypassed caching but already have been in cache) for a 800 GB cache and 4.2TB used btrfs storage.

Actually, combining bcache with preferred meta data worked magic: cache hit rates went up and response times went down a lot. Transfer rates peak around 2 GB/s which is slower than native NVMe but still very good. Average transfer rates are around 300-500 MB/s with data coming partially from cache and HDD. Migrating this setup from single-SSD to dual-NVMe improved perceived responsiveness a lot. Still, due to cow and btrfs-data-raid1, bcache cannot work optimally and wastes some space and performance. A better integration of both would be useful where bcache would know about btrfs-raid1 and store data just once, or cow would inform bcache about unused blocks.

Yeshey added a commit to Yeshey/TechNotes that referenced this issue May 25, 2024
Im losing the space of the cache, which is a really big drawback for me, Im following this issue: kdave/btrfs-progs#610, maybe I'll switch to a single btrfs partition accross the two drives if they ever implement it so I dont lose the extra space from the SSD cache
@bugsquasher1991
Copy link

I would like to add to this feature that it would also be a great idea to have tiered storage on a directory or file level. Meaning, making the "tiering" a property of a directory or file itself:

  • automatic tiering (default)
  • store on non-rotating drive
  • store on rotating drive

This could be done for both data and metadata.

As proposed, we could even have different "tier level" defined for use cases like NVMe <-> SATA SSD <-> HDD.

By making tiering a property of a file or directory, people could mark certain files they would always want to be accessable fast (e.g., without spin-up time) in a way that would make the filesystem store them on the fast cache ssds of a pool. This would be a cool way to decide which files are stored on the cache, as opposed to only being able to go by the last accessed data and keeping that in the cache.

Usage case could be e.g. a homeserver where personal files, pictures etc. should always be available without delay, while large media files can be stored on slower rotational drives that take time to spin up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement kernel something in kernel has to be done too question Not a bug, clarifications, undocumented behaviour
Projects
None yet
Development

No branches or pull requests

7 participants