Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repository: Add the as_path context manager #6151

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 16, 2023

The node repository interface intentionally does not provide access to
its file objects through filepaths on the file system. This is because,
for efficiency reasons, the content of a repository may not actually be
stored as individual files on a file system, but for example are stored
in an object store.

Therefore, the contents of the repository can only be retrieved as a
file-like object or read as a string or list of bytes into memory.
Certain use-cases require a file to be made available through a filepath.
An example is when it needs to be passed to an API that only accepts a
filepath, such as numpy.loadfromtxt.

Currently, the user will have to manually copy the content of the repo's
content to a temporary file on disk, and pass the temporary filepath.
This results in clients having to often resport to the following snippet:

import pathlib
import shutil
import tempfile

with tempfile.TemporaryDirectory() as tmp_path:

    # Copy the entire content to the temporary folder
    dirpath = pathlib.Path(tmp_path)
    node.base.repository.copy_tree(dirpath)

    # Or copy the content of a file. Should use streaming
    # to avoid reading everything into memory
    filepath = (dirpath / 'some_file.txt')
    with filepath.open('rb') as target:
        with node.base.repository.open('rb') as source:
            shutil.copyfileobj(source, target)

    # Now use `filepath` to library call, e.g.
    numpy.loadtxt(filepath)

This logic is now provided under the as_path context manager. This
will make it easy to access repository content as files on the local
file system. A warning is added to the docs explaining the inefficiency
of the content having to be read and written to a temporary directory
first, encouraging it only to be used when the alternative is not an
option.

@sphuber sphuber force-pushed the feature/repo-as-filepath branch 3 times, most recently from acd17b2 to 05d4ffc Compare October 16, 2023 12:23
docs/source/topics/data_types.rst Outdated Show resolved Hide resolved
Comment on lines 321 to 323
In [4]: with single_file.open() as handle:
print(handle.read())
Out[4]: 'The file content'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While simple, this is not a good example IMHO, because it ends up anyway putting everything in memory in handle.read(), defeating the purpose and making it not clear why we have to go via the new interface.

We should thing at something simple but really streaming. E.g. just pseudo code:

CHUNK_SIZE = 65536
with single_file.open() as handle:
    while True:
        chunk = handle.read(CHUNK_SIZE)
        if not chunk: 
            break
        #process chunk here

Or something simple, like counting length:

CHUNK_SIZE = 65536

length = 0
with single_file.open() as handle:
    while True:
        chunk = handle.read(CHUNK_SIZE)
        if not chunk: 
            break
        length += len(chunk)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I find this example is quite complex and risks confusing the reader. Maybe something like this:

import shutil

# Copy a large file from repo to file on disk without loading in memory
with single_file.open(mode='rb') as source:
    with open('copy.txt', mode='wb') as target:
        shutil.copyfileobj(source, target)

Think this will be more intuitive and an actual use case for some users.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

docs/source/topics/data_types.rst Show resolved Hide resolved

.. code-block:: ipython

In [9]: with folder.open('subdir/file3.txt') as handle:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before

docs/source/topics/data_types.rst Show resolved Hide resolved
tests/orm/nodes/test_repository.py Show resolved Hide resolved
The node repository interface intentionally does not provide access to
its file objects through filepaths on the file system. This is because,
for efficiency reasons, the content of a repository may not actually be
stored as individual files on a file system, but for example are stored
in an object store.

Therefore, the contents of the repository can only be retrieved as a
file-like object or read as a string or list of bytes into memory.
Certain use-cases require a file to be made available through a filepath.
An example is when it needs to be passed to an API that only accepts a
filepath, such as `numpy.loadfromtxt`.

Currently, the user will have to manually copy the content of the repo's
content to a temporary file on disk, and pass the temporary filepath.
This results in clients having to often resport to the following snippet:

    import pathlib
    import shutil
    import tempfile

    with tempfile.TemporaryDirectory() as tmp_path:

        # Copy the entire content to the temporary folder
        dirpath = pathlib.Path(tmp_path)
        node.base.repository.copy_tree(dirpath)

        # Or copy the content of a file. Should use streaming
        # to avoid reading everything into memory
        filepath = (dirpath / 'some_file.txt')
        with filepath.open('rb') as target:
            with node.base.repository.open('rb') as source:
                shutil.copyfileobj(source, target)

        # Now use `filepath` to library call, e.g.
        numpy.loadtxt(filepath)

This logic is now provided under the `as_path` context manager. This
will make it easy to access repository content as files on the local
file system. The snippet above is simplified to:

    with node.base.repository.as_path() as filepath:
        numpy.loadtxt(filepath)

The method is exposed directly in the interface of the `FolderData` and
`SinglfileData` data types. A warning is added to the docs explaining
the inefficiency of the content having to be read and written to a
temporary directory first, encouraging it only to be used when the
alternative is not an option.
@sphuber sphuber force-pushed the feature/repo-as-filepath branch from 05d4ffc to f664650 Compare October 17, 2023 08:21
@sphuber sphuber requested a review from giovannipizzi October 17, 2023 08:22

In [5]: with single_file.as_path() as filepath:
numpy.loadtxt(filepath)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you mention above generally shutil.copyfileobj, shouldn't we have an example of that here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shutil.copyfileobj takes a file handle, not a filepath. Of course I could then do

with single_file.as_path() as filepath:
    with filepath.open() as source:
        with open('output.txt', 'wb') as target:
            shutil.copyfileobj(source, target)

That would be defeating the point. Or did you mean another method like shutil.copyfileobj but that uses filepaths?

@sphuber sphuber merged commit b0546e8 into aiidateam:main Oct 20, 2023
18 checks passed
@sphuber sphuber deleted the feature/repo-as-filepath branch October 20, 2023 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants