Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double check and document (if needed) .name conventions for iter_collections #554

Closed
mih opened this issue Dec 6, 2023 · 1 comment · Fixed by #583
Closed

Double check and document (if needed) .name conventions for iter_collections #554

mih opened this issue Dec 6, 2023 · 1 comment · Fixed by #583

Comments

@mih
Copy link
Member

mih commented Dec 6, 2023

I cannot put my finger on it, but I feel the nature of the value of .name for a reported item is non-uniform.

I am not aware of a documentation for the convention, if how iterators should behave.

The .name should always be a PurePath|Path objects.

However, should it carry the base path (key argument to the iterator) in it? Or not?

It would make sense to me to never have it. Rational:

There is no reason while a (file) collection should be on a file system. It could have a very non-Path base address and no hierarchical organization inside (think object store: some URL, plus some object ID).

We need to go through and verify that we did not overfit to a filesystem use case in an implementation.

@mih
Copy link
Member Author

mih commented Jan 5, 2024

#581 brought up a related issue. It makes sense to have the name attribute be as cheap as possible, and only provide access to more complex/expensive types via lazily evaluated properties.

mih added a commit to mih/datalad-next that referenced this issue Jan 5, 2024
Previously, the `name` attribute was some kind of `PurePath` instance.
The precise nature varied across item types and corresponding
iterators.

This led to two problems:

- Performance (see datalad#581): the creation of a Path instance is not
  particularly cheap. If a consumer does not need such an instance,
  and unconditional creation is a waste of resources
- Abstraction (see datalad#430): the Path representation of a path-like
  name does not always match the exact semantics of the original
  name (it might have a trailing slash that has a purpose, etc)

Both problems are now addressed by relaxing the type of `.name`
of a `PathBasedItem`. It can now be anything. Consumers that require
an actual Path instance can use the `.path` attribute. An analog
access is also implemented for `.link_target` (which now remains
literal), that is accompanied by `.link_target_path`, where it
was necessary.

This change in approach removed the need for the `_ZipFileDirPath`
workaround that was used to re-establish compatibility of a Path-based
`.name` with the conventions in a `ZipFile` collection/container.
With the `.name` attribute retaining its native semantics, the
workaround is no longer needed as was removed.

In order to document the (lack of) implemented homogeneous conventions
for path-based items, the docstrings of all iterators have been amended
with information on the nature of the `.name` attribute yielded by them.
The corresponding data classes have received docstrings for the newly
added properties for Path access.

These properties uniformly use the `cached_property` decorator.
For the lifetime of an item, the nature of such a path should never
change, and caching it automatically addresses the significant
creation cost on accessing a path representation repeatedly.

Closes datalad#554
Closes datalad#581
mih added a commit to mih/datalad-next that referenced this issue Jan 5, 2024
Previously, the `name` attribute was some kind of `PurePath` instance.
The precise nature varied across item types and corresponding
iterators.

This led to two problems:

- Performance (see datalad#581): the creation of a Path instance is not
  particularly cheap. If a consumer does not need such an instance,
  and unconditional creation is a waste of resources
- Abstraction (see datalad#430): the Path representation of a path-like
  name does not always match the exact semantics of the original
  name (it might have a trailing slash that has a purpose, etc)

Both problems are now addressed by relaxing the type of `.name`
of a `PathBasedItem`. It can now be anything. Consumers that require
an actual Path instance can use the `.path` attribute. An analog
access is also implemented for `.link_target` (which now remains
literal), that is accompanied by `.link_target_path`, where it
was necessary.

This change in approach removed the need for the `_ZipFileDirPath`
workaround that was used to re-establish compatibility of a Path-based
`.name` with the conventions in a `ZipFile` collection/container.
With the `.name` attribute retaining its native semantics, the
workaround is no longer needed as was removed.

In order to document the (lack of) implemented homogeneous conventions
for path-based items, the docstrings of all iterators have been amended
with information on the nature of the `.name` attribute yielded by them.
The corresponding data classes have received docstrings for the newly
added properties for Path access.

These properties uniformly use the `cached_property` decorator.
For the lifetime of an item, the nature of such a path should never
change, and caching it automatically addresses the significant
creation cost on accessing a path representation repeatedly.

Closes datalad#554
Closes datalad#581
@mih mih closed this as completed in #583 Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant