-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PurePosixPath for names of zipfile-entries (or even better: str
)?
#409
Conversation
This commit ensures that all zip-file names that are used as parameters in methods of ZipArchiveOperations use '/' in names of zip-archive items. That is due to the fact that zip enforces the use '/' in the names of its items (zip-APPNOTE.TXT 6.3.3 from September 1st, 2012). Until PR datalad#409 is merged this will ensure that the proper file names are used to check for items in zip-archives. The commit can be reverted once PR datalad#409 is merged.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #409 +/- ##
==========================================
+ Coverage 92.02% 92.06% +0.03%
==========================================
Files 122 122
Lines 9033 9034 +1
==========================================
+ Hits 8313 8317 +4
+ Misses 720 717 -3
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this is a good move. I understand that this switch better reflects the internal nature of a zipfile. The cost of that switch is that now FileSystemItem
or more generally a PathBasedItem
is no longer having a platform-specific identifier. Consumers will now have to account for that in their code.
The use case you put forth is: "Consequently, those paths cannot be used to retrieve items from a zipfile.ZipFile without conversion to a POSIX-style path"
I would argue that this is not particularly relevant here. The ZipFile
instance to retrieve from via path is an internal detail that is not exposed. At the same time, users can set fp=True
and get direct access to file content -- plus a platform path to write to (given an externally supplied extraction base dir).
WDYT?
I implemented #415 that aims to show a different approach. It solves the problem posed here, but without the issues that I pointed out above. |
I am not sure yet, but I would like to put forward a few thoughts:
with TarArchiveOperations('<some archive with backslash-names>') as archive_ops:
for item in archive_ops:
assert item.name in archive_ops # <--- this will fail on Windows of the archive names contain backslashes! The same problem exists for the zip-Archive operations. We could probably ignore this and say that we do not support archives with backslashes in directory- or file-names on Windows platforms?! ... We could also only allows strings to name members of collections?! I know, this seems like a never-ending story. I don't want to be nagging (naggy?), but I think we have to be careful to get this right in order to reduce future maintenance costs and efforts. In this light I actually grow more and more fond of the idea of just using |
str
)?
There is no need to apologize for identifying a flaw that needs to be fixed. To break this down: This issue is not in the linked #415, but at
>>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts)
PureWindowsPath('d/a/b/c.txt') This means that we must relay the POSIX nature of the archive member path to the users, because there is no way to express this as a platform (windows) path -- and also no way to extract this file under an equivalent name on an FS that uses windows-semantics. If feel like a clean(er) solution would be to change |
Rational from datalad#409: ```py >>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts) PureWindowsPath('d/a/b/c.txt') ``` This means that we must relay the POSIX nature of the archive member path to the users, because there is no way to express this as a platform (windows) path -- and also no way to extract this file under an equivalent name on an FS that uses windows-semantics. So a type mismatch can be used to trigger mitigation strategies. If feel like a clean(er) solution would be to change `TarFileItem` to declare to have a `name` of type `PurePosixPath`.
Rational from datalad#409: ```py >>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts) PureWindowsPath('d/a/b/c.txt') ``` This means that we must relay the POSIX nature of the archive member path to the users, because there is no way to express this as a platform (windows) path -- and also no way to extract this file under an equivalent name on an FS that uses windows-semantics. So a type mismatch can be used to trigger mitigation strategies. If feel like a clean(er) solution would be to change `TarFileItem` to declare to have a `name` of type `PurePosixPath`.
Rational from datalad#409: ```py >>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts) PureWindowsPath('d/a/b/c.txt') ``` This means that we must relay the POSIX nature of the archive member path to the users, because there is no way to express this as a platform (windows) path -- and also no way to extract this file under an equivalent name on an FS that uses windows-semantics. So a type mismatch can be used to trigger mitigation strategies. If feel like a clean(er) solution would be to change `TarFileItem` to declare to have a `name` of type `PurePosixPath`. For the same reason and rational, a symlink target must also be communicated in POSIX form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be merged, once the suggested changes to the ZipfileItem class (type change) are pushed.
This commit modifies zip-iterators to use `PurePosixPath` instead of `PurePath`. The reason is that, according to zip-APPNOTE.TXT version 6.3.3 from 2012-09-01, all slashes in file names (contained in zip) should be forward slashes, i.e. '/' (see https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.3.TXT) zip-APPNOTE.TXT from 2006-09-26 states that all slashes should be forward slashes. It seem to be safe to assume that almost all zip-files we encounter will be using forward slashes and using a `PurePath` on Windows to represent names of files in a zip-archive will not reflect the content properly. In addition, those paths cannot be used to retrieve items from a `zipfile.ZipFile` without conversion to a posix-style path. We should therefore use `PurePosixPath`-instances to represent the name of elements in a zip-archive.
44eb81f
to
c0c325b
Compare
This commit changes the type of ZipfileItem.name to `PurePosixPath`. This is to ensure that instances will properly represent zipfile-items with challenging names, e.g. backslash in a name, correctly on all platforms. This has become necessary because the preivously used `PurePath`-type would yield `PureWindowsPath`- instances on Windows. Those could not properly handle zipfile-item names that contain backslashes (because `PureWindowsPath` interprets backslashes as directory name-separators and would therfore interpret a name like "a\b\c" as file "c" in the directory "a/b". The commit also includes a zipfile-item with a name that is not a valid Windows-path name in the test zip-file. (The referred to name does not contain a backslash, because on Windows the backslash does not make it into the zip-file item name.)
c0c325b
to
3a94df7
Compare
Changes are done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
This commit ensures that all zip-file names that are used as parameters in methods of ZipArchiveOperations use '/' in names of zip-archive items. That is due to the fact that zip enforces the use '/' in the names of its items (zip-APPNOTE.TXT 6.3.3 from September 1st, 2012). Until PR datalad#409 is merged this will ensure that the proper file names are used to check for items in zip-archives. The commit can be reverted once PR datalad#409 is merged.
This commit ensures that all zip-file names that are used as parameters in methods of ZipArchiveOperations use '/' in names of zip-archive items. That is due to the fact that zip enforces the use '/' in the names of its items (zip-APPNOTE.TXT 6.3.3 from September 1st, 2012). Until PR datalad#409 is merged this will ensure that the proper file names are used to check for items in zip-archives. The commit can be reverted once PR datalad#409 is merged.
This commit ensures that all zip-file names that are used as parameters in methods of ZipArchiveOperations use '/' in names of zip-archive items. That is due to the fact that zip enforces the use '/' in the names of its items (zip-APPNOTE.TXT 6.3.3 from September 1st, 2012). Until PR datalad#409 is merged this will ensure that the proper file names are used to check for items in zip-archives. The commit can be reverted once PR datalad#409 is merged.
This commit ensures that all zip-file names that are used as parameters in methods of ZipArchiveOperations use '/' in names of zip-archive items. That is due to the fact that zip enforces the use '/' in the names of its items (zip-APPNOTE.TXT 6.3.3 from September 1st, 2012). Until PR datalad#409 is merged this will ensure that the proper file names are used to check for items in zip-archives. The commit can be reverted once PR datalad#409 is merged.
This PR modifies zip-iterators to use
PurePosixPath
for names of archive members instead ofPurePath
. The rationale for this change is that according to zip-APPNOTE.TXT version 6.3.3 from September 1st, 2012, all slashes in file names (contained in zip) must be forward slashes, i.e. '/'. From September 26th, 2006 onwards zip-APPNOTE.TXT version 6.3.0 states that all slashes should be forward slashes.It seems to be safe to assume that almost all zip-files we encounter will be using forward slashes. Because
PurePath
instances are OS-specific, aPurePath
on Windows will contain\
-slashes and not properly represent the content of the archive. Consequently, those paths cannot be used to retrieve items from azipfile.ZipFile
without conversion to a POSIX-style path.We should therefore use
PurePosixPath
-instances to represent the name of elements in a zip-archive.This touches again on issue #402. Maybe we should introduce different base classes, one for objects with platform-dependent paths, which would use
PurePath
, and one for objects with platform-independent, fixed, path-formats, e.g. zip-file items.