Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COW cp (--reflink) doesn't work across different datasets: Invalid cross-device link #15345

Open
darkbasic opened this issue Oct 3, 2023 · 29 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@darkbasic
Copy link

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version
Kernel Version 6.6.0-rc4
Architecture amd64
OpenZFS Version git branch zfs-2.2-release + 6.6 compatibility patches

Describe the problem you're observing

Reflinking doesn't work across different datasets.
Since https://lore.kernel.org/linux-btrfs/cover.1645194730.git.josef@toxicpanda.com/T/#mf251325026fe2e15ed5119856bf654ba4f0d298b btrfs allows to reflink across different subvolumes, so it should be possible to achieve something similar in Linux with zfs.
Not being able to reflink across different datasets vastly reduce the utility of reflinking.

Describe how to reproduce the problem

cp -a --reflink=always /path/to/first/dataset/file /path/to/second/dataset/

Include any warning/errors/backtraces from the system logs

[niko@arch-phoenix ~]$ cp --reflink=always ~/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz .
cp: failed to clone './chromium-117.0.5938.132.tar.xz' from '/home/niko/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz': Invalid cross-device link

P.S.
I have been told that --reflink=auto should be able to clone blocks across different datasets, but this isn't the case:

[niko@arch-phoenix ~]$ time cp ~/.cache/yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz . && rm chromium-117.0.5938.132.tar.xz

real	0m3.136s
user	0m0.000s
sys	0m2.852s
[niko@arch-phoenix .cache]$ time cp yay/chromium-wayland-vaapi/chromium-117.0.5938.132.tar.xz . && rm chromium-117.0.5938.132.tar.xz

real	0m0.127s
user	0m0.000s
sys	0m0.127s

It took 3 seconds compared to 0.1 seconds when the dataset was the same, suggesting that reflinking didn't work across datasets.

@darkbasic darkbasic added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 3, 2023
@rincebrain
Copy link
Contributor

See here.

@darkbasic
Copy link
Author

@rincebrain interesting, but somehow it doesn't work according to the time results.

@robn
Copy link
Member

robn commented Oct 3, 2023

--reflink=always calls ioctl(FICLONE). Linux inspects this call before passing it to the filesystem, and will reject it if source and destination files are not on a filesystem with the same superblock (and before 5.18, the same mountpoint).

This is not an OpenZFS bug as such, because if Linux would pass the call down, we would quite happily service it. Working around Linux's check here is extremely difficult, if its even possible.

(the Btrfs example is not really relevant; OpenZFS and Btrfs have a fundamentally different construction and purpose).

--reflink=auto calls the copy_file_range() syscall, which in this case means "make a new file with the same contents as this existing one and I don't care how you do it". Often OpenZFS can service this with a clone, but not always (for many good reasons). If it can't, it'll fall back to a regular content copy.

The call time is not a very good indicator that a clone vs a copy was done. To tell if it was cloned you currently have to dig around with zdb. But in any case, that's acceptable behaviour, because --reflink=auto (copy_file_range() allows it).

Yes, this sucks. We'll keep working on it but its complicated.

@rincebrain
Copy link
Contributor

btrfs doesn't trip this because they show up subvolumes as the same "mountpoint" which is why this check doesn't bite them.

@darkbasic
Copy link
Author

The problem is not only with different datasets, but with snapshots as well:

[niko@arch-phoenix ~]$ cp --reflink=always /home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz .
cp: failed to clone './linux-mainline.tar.gz' from '/home/.zfs/snapshot/zrepl_20231004_074556_000/niko/devel/linux-mainline.tar.gz': Invalid cross-device link
[niko@arch-phoenix ~]$ mount | grep home
rpool/home on /home type zfs (rw,nodev,relatime,xattr,posixacl,casesensitive)
rpool/home@zrepl_20231004_074556_000 on /home/.zfs/snapshot/zrepl_20231004_074556_000 type zfs (ro,relatime,xattr,posixacl,casesensitive)

Accessing snapshots create a different mountpoint which triggers the same issue.

@robn
Copy link
Member

robn commented Oct 4, 2023

Filesystems and snapshots are different datasets. Or, put another way, they're different mounts, and so different superblocks, and so Linux rejects the request.

I understand that this is frustrating, but no amount of pointing it out is going to make a quick fix happen. I'm aware of four possible solutions (or shapes of solutions):

  • Linux lifting the restriction
  • Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like cp)
  • Adding zfs clonefile or similar command to do clones directly inside OpenZFS
  • Significantly modify OpenZFS to use the same superblock for all mounts

I've been quietly exploring all of these options for a few weeks now. They are all difficult and/or complicated, for different reasons, and I also have very little time available to look at it. If you've got some other idea, I'm happy to hear it.

@darkbasic
Copy link
Author

My only suggestion is to reach Kent Overstreet @koverstreet and ask him what are his plans for bcachefs regarding this.
Lifting the Linux restriction would obviously be the best course of action and working with a (soon-to-be) mainline filesystem would be much easier.

@darkbasic
Copy link
Author

darkbasic commented Oct 6, 2023

The reason why --reflink=auto (which in turn calls the copy_file_range() syscall) might not create clones is encryption. We can't clone encrypted blocks across datasets because the key material is partially bound to the source dataset (actually its encryption root). #14705 has a start on this.

@robn
Copy link
Member

robn commented Oct 6, 2023

If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context. That's why it didn't work in your case. It can and does create clones in many other situations.

In any case, there's no bug here. These limitations are well understood and will be worked on as time and interest allows.

@darkbasic
Copy link
Author

If you're going to take something I wrote and pass it off as your own you should at least adjust it to match the context.

That was not my intention and I'm sorry if it felt that way. I'm just trying my best to juggle the relevant information across the two threads so that anybody who stumbles upon either of them will understand what's going on.

That's why it didn't work in your case

I thought that was clear enough, but I've edited my past message to change "does" with "might".

@oromenahar
Copy link
Contributor

Linux lifting the restriction

I think getting a patch into the linux kernel to lift the cross device link restriction can be pretty hard. There is no usecase in the kernel and no kernel (fs)driver wich need this. I don't think that the linux kernel will accept for example a new ioctl-flag FICLONEC for cross device links without any usecase inside the kernel and I think the existing flags won't be changed as well, because cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.

@darkbasic
Copy link
Author

Unfortunately even bcachefs won't be a suitable candide because it already supports reflinking across different subvolumes.

@jittygitty
Copy link

@darkbasic Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.

@lvd2
Copy link

lvd2 commented Oct 31, 2023

cross reflinking over different datasets doesn't make sense. Don't know for sure but I quess so.

From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.

@darkbasic
Copy link
Author

Perhaps, but regardless, Kent's very smart and knowledgeable and I think nice enough to give some objective and thoughtful opinions that may be helpful.

I did write to him and I linked this issue, but he simply replied that bcachefs can already reflink between subvolumes.
I guess he's pretty busy with his own stuff.

From the user point of view, it makes lots of sense, like reorganizing data (files) between datasets.

Definitely.

@TerraTech
Copy link
Contributor

Personally, I'd be ok with zfs clonefile, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.

FWIW - I'm just glad that this now exists (think large VM base/fluid type images), even if there are some barriers preventing it from full (envisioned) functionality.

@jittygitty
Copy link

@darkbasic Yea, I read some of the Linux kernel dev emails and noticed many jumping on Kent as he was trying to get bcachefs upstream, seemed very "tense and stressful" for him. (skipping linking painful emails)

Interesting quote from Kent:
"Right now what I'm hearing, in particular from Redhat, is that they want it upstream in order to commit more resources. Which, I know, is not what kernel people want to hear, but it's the chicken-and-the-egg situation I'm in." (source https://lore.kernel.org/lkml/20230706173819.36c67pf42ba4gmv4@moria.home.lan/ )

Anyway, was really great to see we got a Halloween "present" and looks like Bcachefs was finally merged into the 6.7 kernel by Linus at the end of October!
So now Linux has OCFS2, Btrfs, XFS, Bcachefs, and ZFS all supporting reflinks.
Will be interesting to do feature/performance benchmark comparisons as well as seeing how the various filesystems do when put on a zfs ZVOL. I wish that ZVOLs had gotten a bit more love, but I digress...

Have any parts of ZFS gotten "rusty" at all over the years?? ( https://lwn.net/Articles/934692/ )

@kimono-koans
Copy link

kimono-koans commented Nov 17, 2023

+1

FWIW I have a use case re: httm. I implemented FICLONE for httm specifically to make use of this feature with respect to ZFS.

Personally, I'd be ok with zfs clonefile, handling it internally and the barrier to a functional implementation would not have such a substantial roadblock. I'd surmise this would work as a cross-platform solution as well.

+1

Not that I deserve an opinion re: implementation, but I think a new subcommand/arg zfs clonefile and library function make more sense the greater the variance from default functionality. If ZFS could allow writes not simply across snapshots to live datasets (something btrfs does), but across datasets (rpool/srv to rpool/program) or to a sub-dataset (rpool/srv to rpool/srv/program), and Linux probably never will, then it makes sense to me to add a new subcommand.

@EchterAgo
Copy link

@darkbasic I'm seeing the same on Ubuntu with coreutils 9.1 installed, --reflink=auto does not attempt to call copy_file_range. I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto.

@scineram
Copy link

scineram commented Nov 24, 2023

Coreutils should have an --reflink=zfs option that would simply call the zfs internal clone function and have zfs do its own internal checks, across datasets or within an encrypted clone family.

@EchterAgo
Copy link

For some reason the --sparse=auto detection in coreutils 9.1 fails for me, resulting in cp always trying a sparse copy unless I specify --sparse=never. When doing sparse copies cp does not use copy_file_range. I'm currently trying to build coreutils 9.4 to see if things changed.

@EchterAgo
Copy link

From trying to copy a 1GiB file:

lseek(3, 0, SEEK_DATA)                  = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
lseek(3, 0, SEEK_HOLE)                  = 1073741824
lseek(3, 0, SEEK_SET)                   = 0

Seems to me that would indicate there is no hole in the source file, so why does cp still treat it as such?

@EchterAgo
Copy link

Newer coreutils versions also added a --debug switch for cp which might be helpful for diagnosing this: coreutils/coreutils@d899f9e

@EchterAgo
Copy link

@darkbasic
Copy link
Author

I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto

You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.

P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.

@EchterAgo
Copy link

I am able to make it successfully reflink accross datasets by specifying --sparse=never --reflink=auto

You don't use native encryption, right? Otherwise I'm a little bit puzzled because it's not supposed to work with encryption.

P.S. You should disable block cloning altogether until things get figured out because of corruption reports on Gentoo.

No encryption but LZ4 compression. I mean reflinking works if I pass the right options, it is the sparse detection that seems to fail in coreutils 9.1. I created the source file using dd if=/dev/urandom of=/path/file bs=1M count=1024, so it is definitely not a sparse file, yet cp treats it as such.

I know about the data corruption issue, thanks.

@strugee
Copy link
Contributor

strugee commented Nov 24, 2023

I know about the data corruption issue, thanks.

Does anyone have an issue number for those of us who don't?

@darkbasic
Copy link
Author

Here it is: #15526

@mschiff
Copy link

mschiff commented Nov 24, 2023

* Linux lifting the restriction
* Adding OpenZFS-specific calls that Linux doesn't know about (and so won't intercept), and then adding support for those to common tools (like `cp`)
* Adding `zfs clonefile` or similar command to do clones directly inside OpenZFS
* Significantly modify OpenZFS to use the same superblock for all mounts

I think zfs clonefile would be a good start. I have one more idea:

  • zfs-clonefile being its own binary which might be symlinked as cp and mv and might mimic behavior of the linked command when being called with --reflink

or

  • maybe it could fork cp or mv and then 'capture' relevant syscalls and turn them into zfs internal syscalls that will create clones successfully? But I am not sure this would be even possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests