Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

samefile() is not implemented #246

Closed
WeatherGod opened this issue Jul 30, 2024 · 4 comments · Fixed by #261
Closed

samefile() is not implemented #246

WeatherGod opened this issue Jul 30, 2024 · 4 comments · Fixed by #261
Labels
compatibility 🤝 Compatibility with stdlib pathlib enhancement 🚀 New feature or request
Milestone

Comments

@WeatherGod
Copy link

Is there a plan for implementing UPath.samefile()? I can see how such a method might be difficult to define universally, and that it isn't the same thing as UPath.__eq__(), which is implemented, but this broke some code of mine that was usually using Path.samefile() and then failed when I passed it a UPath instead of a Path.

If this is a TODO, then is there anything I can do to help?

@ap--
Copy link
Collaborator

ap-- commented Jul 30, 2024

Thanks for opening the issue and for offering to contribute.

We could base an implementation for samefile() on the one from pathlib-abc:

https://github.com/barneygale/pathlib-abc/blob/e429fd2ae079d3c623e687715a8133f4c4769ed2/pathlib_abc/__init__.py#L556-L566

The problem is: st_ino and st_dev are basically 0 for filesystems other than local.

@property
def st_ino(self) -> int:
"""inode"""
ino = self._info.get("ino")
if isinstance(ino, int):
return ino
return self._seq[1]
@property
def st_dev(self) -> int:
"""device"""
dev = self._info.get("dev")
if isinstance(dev, int):
return dev
return self._seq[2]

So first we'd need to make sure that if a UPath with another protocol is provided that this method returns False. And then we'd need to make sure that this works on other filesystems too. I wonder why only st_ino and st_dev are compared and the other fields are ignored. I would assume that all stat fields are identical when two files are the same...

For a start, we could test if just comparing the stat_results for both will cover all tested filesystems.

A test would be implemented here:

def test_samefile(self):
pass

And we should at least add one test comparing two different protocols.

Let me know if anything is unclear,
Cheers,
Andreas 😃

@ap-- ap-- added enhancement 🚀 New feature or request compatibility 🤝 Compatibility with stdlib pathlib labels Jul 30, 2024
@WeatherGod
Copy link
Author

I suspect that only those two fields are compared for the sake of optimization. In the context of local filesystems, if those two things are the same between two fstats, then there is no need to check anything else because the query has been satisfied. Also, depending on the nature of the stat object, I could imagine possible race conditions if they can change underneath you, so fewer comparisons is your friend.

My own naive implementation would be to fast-path PosixUPath vs. (PosixUPath/Path) comparisons (is that the correct class for local protocol?), and then for everything else, normalize the paths and do an equality check (which would come back False for differing protocol). I think this risks some false negatives, and it isn't the fastest.

Actually, put a pin that idea... Would it make sense to have UPath.samefile() check if self and other are the same type and if so, defer to their own specialized ._samefile() implementation specific to their protocol if it is available and fall back to some inefficient, but general approach otherwise?

@ap--
Copy link
Collaborator

ap-- commented Aug 3, 2024

I would suggest starting with the general case. People don't use UPath or pathlib for performance, they usually use it for convenience.

If performance becomes an important enough issue for users, I would recommend moving the critical code path to fsspec. The UPath instance provides access to the fsspec filesystem, storage_options and urlpath for exactly this reason.

Collecting concrete cases where this was necessary will then inform us where we can have the most impact with optimizations in universal_pathlib.

@ap--
Copy link
Collaborator

ap-- commented Aug 3, 2024

Implementation wise I think I would start with comparing the stat objects for both if protocols match. For this it would be best to just implement __eq__ on the stat object in upath/_stat.py . In there we should compare the fsspec info dicts that are used to construct the stat_object. It might boil down to compare all normalized fields and the info extras that weren't parsed. To see if we need to compare the path too it would be good to have a concrete test case.

@ap-- ap-- added this to the v0.3.0 milestone Aug 23, 2024
@ap-- ap-- closed this as completed in #261 Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility 🤝 Compatibility with stdlib pathlib enhancement 🚀 New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants