Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btrfs balance single targeting a single devid didn't move all the data to that device #907

Open
gdevenyi opened this issue Oct 10, 2024 · 7 comments
Labels

Comments

@gdevenyi
Copy link

I executed:
btrfs balance start --force -sconvert=single,devid=2 -dconvert=single,devid=2 -mconvert=single,devid=2 /storage

With the intention of moving devices 1,3,4 and 5 from my btrfs filesystem.

The command ran overnight and I found afterwards that device 4 and 5 had been vacated of data, but devices 1 and 2 had equal amounts, although everything was now stored as "single"

@Zygo
Copy link

Zygo commented Oct 11, 2024

The devid filter selects block groups based on which devices they currently occupy. So your command is asking for everything currently allocated on device 2 to be relocated onto whichever devices are preferred by the target raid profile. For single profile, this is the device with the most free space, with data distributed across multiple devices when multiple devices have equal free space. Given your statement of the final result, the largest devices were likely devices 1 and 2, since that was where the data ended up.

There currently isn't a good way to move data from all devices to one device while also changing profile without moving the data multiple times. You can alternate between resizing devices 1, 3, 4, and 5 smaller, in 1 GiB increments, so that they have less unallocated space than device 2, then perform some of the conversion to single when device 2 has more unallocated space than the others, then stop the balance when the unallocated space is equal and go back to resizing the other devices smaller again, repeating all of that until all data has been removed from devices 1, 3, 4, 5. This reduces the number of data movements, but it requires a shell for loop or a small python-btrfs script to control the raw kernel ioctls and handle the switches between resizing and balancing.

If this is a feature request: it's a fairly straightforward patch to disable allocation on some devices (a variant of the existing allocation preferences patch with the "allocate nothing" extension). Once that is merged, then this operation can be performed in two steps:

  1. Disable allocation on devices 1, 3, 4, and 5 (set preference to "none" or whatever name the final implementation uses)
  2. Balance using the command line above with no devid filter: btrfs balance start -dconvert=single -mconvert=dup /storage

With allocation disabled on devices other than device 2, balance will have no choice but to reallocate all the data there.

@gdevenyi
Copy link
Author

Given your statement of the final result, the largest devices were likely devices 1 and 2, since that was where the data ended up.

Yes, this is correct.

If this is a feature request

I guess it is now, since it is not currently possible to "un-balance" data off of disk in preparation for removal. It sounds like the no allocation preferences will address this.

@kdave kdave added the bug label Nov 8, 2024
@kakra
Copy link

kakra commented Dec 4, 2024

If this is a feature request: it's a fairly straightforward patch to disable allocation on some devices (a variant of the existing allocation preferences patch with the "allocate nothing" extension). Once that is merged, then this operation can be performed in two steps:

I've implemented a none-preferred mode in my patch set. I wanted to avoid a none-only to avoid unexpected out-of-space situations. For the use-case of @gdevenyi, if data still allocated to the none-preferred device after balance, there's always a chance to add more new space and balance that device-id again.

kakra/linux#36

@Zygo
Copy link

Zygo commented Dec 4, 2024

I wanted to avoid a none-only to avoid unexpected out-of-space situations.

Please do not do this. This is a frequent misunderstanding, and I realize you and others who have proposed this in the past have good intentions, but breaking the -only preferences is not the way.

The purpose of "none-only" is to force the filesystem to give up on allocation immediately, and return ENOSPC quickly, before making a mess that will take days to clean up. The unexpected out-of-space situation that preferred tries to avoid is the purpose of the -only feature.

The "preferred" variants allow data to spill over onto low-preference devices when space runs low. This is highly undesirable in cases like e.g. reducing an array from 12 disks down to 8. Balances on arrays that large can take weeks or months to run. If someone dumps a lot of data on the filesystem and it spills over onto devices we're trying to remove, we can lose days of IO time putting the data on the wrong devices before anyone notices (it's a month-long balance, we don't check what it's doing every day), and then more days of IO time taking the data off again. We'd rather just have the filesystem fill up and stop accepting more data, so there's no need to do extra work to clean up in the unexpected out-of-space scenario.

Putting data on a device we've told btrfs not to is bad. What if we set none-only on the devices because they're failing? Replacing none-only with none-preferred would put more data on a bad device.

Ideally there should be a sanity check to make sure the filesystem has the minimum number of drives for each profile, and reject a "-only" preference if it would mean e.g. there's only one drive for raid1 data. If, as a result of an "-only" preference, there's not enough space to allocate something, then the filesystem is merely full. There's no need to start putting data in unexpected places unless the user explicitly requests that by using -preferred instead of -none.

Note that even if all existing drives are "none-only", btrfs can still allocate metadata in block groups that already exist. So it's not necessarily a problem even if on paper the configuration seems insane. It doesn't add any new failure modes compared to filling up all the drives on an unpatched btrfs. If we solved that problem, the solution would work without modification on preferred metadata too.

@kakra
Copy link

kakra commented Dec 4, 2024

The "preferred" variants allow data to spill over onto low-preference devices when space runs low. This is highly undesirable in cases like e.g. reducing an array from 12 disks down to 8. Balances on arrays that large can take weeks or months to run. If someone dumps a lot of data on the filesystem and it spills over onto devices we're trying to remove, we can lose days of IO time putting the data on the wrong devices before anyone notices (it's a month-long balance, we don't check what it's doing every day), and then more days of IO time taking the data off again. We'd rather just have the filesystem fill up and stop accepting more data, so there's no need to do extra work to clean up in the unexpected out-of-space scenario.

Okay, I understand the use-case. Then I should add the -only case, too.

But doesn't that essentially make it an operation of btrfs dev remove? Except that it doesn't immediately kick in a relocation of data?

@Zygo
Copy link

Zygo commented Dec 5, 2024

an operation of btrfs dev remove? Except that it doesn't immediately kick in a relocation of data

Exactly. dev remove works on only one device at a time, so if you need to remove many devices at once, e.g. 1, 2, and 3, the first remove 1 pushes data on onto devices 2 and 3, then remove 2 has to relocate that data off of device 2, putting more data on 3, then remove 3 has to relocate all the data on 3. Some of the data on device 1 will be relocated 3 times. If it's a striped profile like raid10, all the data on the entire filesystem might be relocated 3 times.

With the none-only preference, we can mark devices 1, 2, and 3 none-only, then do the deletes. Each delete will not put any new data on devices 1, 2, or 3. Thus all the data on these devices is relocated only once.

@kakra
Copy link

kakra commented Dec 5, 2024

This is a very easy explanation to follow. I'll add that use-case to my patch then. Thanks @Zygo.

kakra added a commit to kakra/linux that referenced this issue Dec 6, 2024
This is useful where you want to prevent new allocations of chunks to
a set of multiple disks which are going to be removed from the pool.
This acts as a multiple `btrfs dev remove` on steroids that can remove
multiple disks in parallel without moving data to disks which would be
removed in the next round. In such cases, it will avoid moving the
same data multiple times, and thus avoid placing it on potentially bad
disks.

Thanks to @Zygo for the explanation and suggestion.

Link: kdave/btrfs-progs#907 (comment)
Signed-off-by: Kai Krakow <kai@kaishome.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants