Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PurePosixPath for names of zipfile-entries (or even better: str)? #409

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Jun 8, 2023

This PR modifies zip-iterators to use PurePosixPath for names of archive members instead of PurePath. The rationale for this change is that according to zip-APPNOTE.TXT version 6.3.3 from September 1st, 2012, all slashes in file names (contained in zip) must be forward slashes, i.e. '/'. From September 26th, 2006 onwards zip-APPNOTE.TXT version 6.3.0 states that all slashes should be forward slashes.

It seems to be safe to assume that almost all zip-files we encounter will be using forward slashes. Because PurePath instances are OS-specific, a PurePath on Windows will contain \-slashes and not properly represent the content of the archive. Consequently, those paths cannot be used to retrieve items from a zipfile.ZipFile without conversion to a POSIX-style path.

We should therefore use PurePosixPath-instances to represent the name of elements in a zip-archive.

This touches again on issue #402. Maybe we should introduce different base classes, one for objects with platform-dependent paths, which would use PurePath, and one for objects with platform-independent, fixed, path-formats, e.g. zip-file items.

@christian-monch christian-monch requested a review from mih as a code owner June 8, 2023 05:11
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Jun 8, 2023
This commit ensures that all zip-file names
that are used as parameters in methods of
ZipArchiveOperations use '/' in names
of zip-archive items.

That is due to the fact that zip enforces
the use '/' in the names of its items
(zip-APPNOTE.TXT 6.3.3 from September 1st,
2012).

Until PR datalad#409
is merged this will ensure that the proper
file names are used to check for items in
zip-archives. The commit can be reverted
once PR  datalad#409
is merged.
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (a51e6f0) 92.02% compared to head (3a94df7) 92.06%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #409      +/-   ##
==========================================
+ Coverage   92.02%   92.06%   +0.03%     
==========================================
  Files         122      122              
  Lines        9033     9034       +1     
==========================================
+ Hits         8313     8317       +4     
+ Misses        720      717       -3     
Impacted Files Coverage Δ
...atalad_next/iter_collections/tests/test_iterzip.py 100.00% <100.00%> (ø)
datalad_next/iter_collections/zipfile.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that this is a good move. I understand that this switch better reflects the internal nature of a zipfile. The cost of that switch is that now FileSystemItem or more generally a PathBasedItem is no longer having a platform-specific identifier. Consumers will now have to account for that in their code.

The use case you put forth is: "Consequently, those paths cannot be used to retrieve items from a zipfile.ZipFile without conversion to a POSIX-style path"

I would argue that this is not particularly relevant here. The ZipFile instance to retrieve from via path is an internal detail that is not exposed. At the same time, users can set fp=True and get direct access to file content -- plus a platform path to write to (given an externally supplied extraction base dir).

WDYT?

@mih
Copy link
Member

mih commented Jun 11, 2023

I implemented #415 that aims to show a different approach. It solves the problem posed here, but without the issues that I pointed out above.

@christian-monch
Copy link
Contributor Author

christian-monch commented Jun 15, 2023

I am not convinced that this is a good move. I understand that this switch better reflects the internal nature of a zipfile. The cost of that switch is that now FileSystemItem or more generally a PathBasedItem is no longer having a platform-specific identifier. Consumers will now have to account for that in their code.

The use case you put forth is: "Consequently, those paths cannot be used to retrieve items from a zipfile.ZipFile without conversion to a POSIX-style path"

I would argue that this is not particularly relevant here. The ZipFile instance to retrieve from via path is an internal detail that is not exposed. At the same time, users can set fp=True and get direct access to file content -- plus a platform path to write to (given an externally supplied extraction base dir).

WDYT?

I am not sure yet, but I would like to put forward a few thoughts:

  1. I like the PurePath.as_posix()-approach from Update TarArchiveOperations and equip with tests  #415. But I would like to mention the following differences between Windows and Linux:

    a. PurePath('d/a\\b\\c.txt').as_posix() == 'd/a/b/c.txt' (on Windows)

    b. PurePath('d/a\\b\\c.txt').as_posix() == 'd/a\\b\\c.txt' (on Linux)

  2. The previous means that the following code snippet will fail on Windows if an item with a name like 'd/a\\b\\c.txt', i.e. a name that contains backslashes, exists in the archive. That means PurePath cannot always be used to retrieve an item, even if we use the PurePath-instance stored in item.name (the assert below will fail on Windows):

with TarArchiveOperations('<some archive with backslash-names>') as archive_ops:
    for item in archive_ops:
        assert item.name in archive_ops       # <--- this will fail on Windows of the archive names contain backslashes!

The same problem exists for the zip-Archive operations.

We could probably ignore this and say that we do not support archives with backslashes in directory- or file-names on Windows platforms?! ... We could also only allows strings to name members of collections?!

I know, this seems like a never-ending story. I don't want to be nagging (naggy?), but I think we have to be careful to get this right in order to reduce future maintenance costs and efforts. In this light I actually grow more and more fond of the idea of just using str for names of archive-members.

@christian-monch christian-monch changed the title Use PurePosixPath for names of zipfile-entries Use PurePosixPath for names of zipfile-entries (or even better: str)? Jun 15, 2023
@mih
Copy link
Member

mih commented Jun 16, 2023

There is no need to apologize for identifying a flaw that needs to be fixed.

To break this down: This issue is not in the linked #415, but at

name=PurePath(PurePosixPath(member.name)),
, because

>>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts)
PureWindowsPath('d/a/b/c.txt')

This means that we must relay the POSIX nature of the archive member path to the users, because there is no way to express this as a platform (windows) path -- and also no way to extract this file under an equivalent name on an FS that uses windows-semantics.

If feel like a clean(er) solution would be to change TarFileItem to declare to have a name of type PurePosixPath (and likewise for ZipFileItem. Plus some documentation at that declaration that points out the motivation. We might also consider to derive both from an intermediate ArchiveFileSystemItem that does that.

mih added a commit to mih/datalad-next that referenced this pull request Jun 16, 2023
Rational from datalad#409:

```py
>>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts)
PureWindowsPath('d/a/b/c.txt')
```

This means that we must relay the POSIX nature of the archive member
path to the users, because there is no way to express this as a platform
(windows) path -- and also no way to extract this file under an
equivalent name on an FS that uses windows-semantics. So a type
mismatch can be used to trigger mitigation strategies.

If feel like a clean(er) solution would be to change `TarFileItem` to
declare to have a `name` of type `PurePosixPath`.
mih added a commit to mih/datalad-next that referenced this pull request Jun 16, 2023
Rational from datalad#409:

```py
>>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts)
PureWindowsPath('d/a/b/c.txt')
```

This means that we must relay the POSIX nature of the archive member
path to the users, because there is no way to express this as a platform
(windows) path -- and also no way to extract this file under an
equivalent name on an FS that uses windows-semantics. So a type
mismatch can be used to trigger mitigation strategies.

If feel like a clean(er) solution would be to change `TarFileItem` to
declare to have a `name` of type `PurePosixPath`.
mih added a commit to mih/datalad-next that referenced this pull request Jun 19, 2023
Rational from datalad#409:

```py
>>> PureWindowsPath(*PurePosixPath('d/a\\b\\c.txt').parts)
PureWindowsPath('d/a/b/c.txt')
```

This means that we must relay the POSIX nature of the archive member
path to the users, because there is no way to express this as a platform
(windows) path -- and also no way to extract this file under an
equivalent name on an FS that uses windows-semantics. So a type
mismatch can be used to trigger mitigation strategies.

If feel like a clean(er) solution would be to change `TarFileItem` to
declare to have a `name` of type `PurePosixPath`.

For the same reason and rational, a symlink target must also be
communicated in POSIX form.
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be merged, once the suggested changes to the ZipfileItem class (type change) are pushed.

This commit modifies zip-iterators to use
`PurePosixPath` instead of `PurePath`. The
reason is that, according to zip-APPNOTE.TXT
version 6.3.3 from 2012-09-01, all slashes in
file names (contained in zip) should be
forward slashes, i.e. '/' (see
https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.3.TXT)
zip-APPNOTE.TXT from 2006-09-26 states that
all slashes should be forward slashes.

It seem to be safe to assume that almost all
zip-files we encounter will be using forward
slashes and using a `PurePath` on Windows
to represent names of files in a zip-archive
will not reflect the content properly. In
addition, those paths cannot be used to
retrieve items from a `zipfile.ZipFile`
without conversion to a posix-style path.

We should therefore use `PurePosixPath`-instances
to represent the name of elements in a
zip-archive.
@christian-monch christian-monch force-pushed the fix-zip-filename branch 2 times, most recently from 44eb81f to c0c325b Compare June 20, 2023 08:41
This commit changes the type of ZipfileItem.name
to `PurePosixPath`. This is to ensure that instances
will properly represent zipfile-items with
challenging names, e.g. backslash in a name,
correctly on all platforms.

This has become necessary because the preivously
used `PurePath`-type would yield `PureWindowsPath`-
instances on Windows. Those could not properly
handle zipfile-item names that contain backslashes
(because `PureWindowsPath` interprets backslashes
as directory name-separators and would therfore
interpret a name like "a\b\c" as file "c" in the
directory "a/b".

The commit also includes a zipfile-item with a
name that is not a valid Windows-path name in the
test zip-file. (The referred to name does not
contain a backslash, because on Windows the
backslash does not make it into the zip-file item
name.)
@christian-monch
Copy link
Contributor Author

I think this can be merged, once the suggested changes to the ZipfileItem class (type change) are pushed.

Changes are done.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@mih mih merged commit 1c8ce51 into datalad:main Jun 20, 2023
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Jun 21, 2023
This commit ensures that all zip-file names
that are used as parameters in methods of
ZipArchiveOperations use '/' in names
of zip-archive items.

That is due to the fact that zip enforces
the use '/' in the names of its items
(zip-APPNOTE.TXT 6.3.3 from September 1st,
2012).

Until PR datalad#409
is merged this will ensure that the proper
file names are used to check for items in
zip-archives. The commit can be reverted
once PR  datalad#409
is merged.
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Jun 22, 2023
This commit ensures that all zip-file names
that are used as parameters in methods of
ZipArchiveOperations use '/' in names
of zip-archive items.

That is due to the fact that zip enforces
the use '/' in the names of its items
(zip-APPNOTE.TXT 6.3.3 from September 1st,
2012).

Until PR datalad#409
is merged this will ensure that the proper
file names are used to check for items in
zip-archives. The commit can be reverted
once PR  datalad#409
is merged.
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Aug 25, 2023
This commit ensures that all zip-file names
that are used as parameters in methods of
ZipArchiveOperations use '/' in names
of zip-archive items.

That is due to the fact that zip enforces
the use '/' in the names of its items
(zip-APPNOTE.TXT 6.3.3 from September 1st,
2012).

Until PR datalad#409
is merged this will ensure that the proper
file names are used to check for items in
zip-archives. The commit can be reverted
once PR  datalad#409
is merged.
mih pushed a commit to christian-monch/datalad-next that referenced this pull request Dec 18, 2023
This commit ensures that all zip-file names
that are used as parameters in methods of
ZipArchiveOperations use '/' in names
of zip-archive items.

That is due to the fact that zip enforces
the use '/' in the names of its items
(zip-APPNOTE.TXT 6.3.3 from September 1st,
2012).

Until PR datalad#409
is merged this will ensure that the proper
file names are used to check for items in
zip-archives. The commit can be reverted
once PR  datalad#409
is merged.
@christian-monch christian-monch deleted the fix-zip-filename branch July 16, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants