New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Repository: Add the `as_path` context manager #6151

Merged

sphuber merged 2 commits into aiidateam:main from sphuber:feature/repo-as-filepath

Oct 20, 2023

Contributor

sphuber commented Oct 16, 2023 •

edited

Loading

The node repository interface intentionally does not provide access to
its file objects through filepaths on the file system. This is because,
for efficiency reasons, the content of a repository may not actually be
stored as individual files on a file system, but for example are stored
in an object store.

Therefore, the contents of the repository can only be retrieved as a
file-like object or read as a string or list of bytes into memory.
Certain use-cases require a file to be made available through a filepath.
An example is when it needs to be passed to an API that only accepts a
filepath, such as numpy.loadfromtxt.

Currently, the user will have to manually copy the content of the repo's
content to a temporary file on disk, and pass the temporary filepath.
This results in clients having to often resport to the following snippet:

import pathlib
import shutil
import tempfile

with tempfile.TemporaryDirectory() as tmp_path:

    # Copy the entire content to the temporary folder
    dirpath = pathlib.Path(tmp_path)
    node.base.repository.copy_tree(dirpath)

    # Or copy the content of a file. Should use streaming
    # to avoid reading everything into memory
    filepath = (dirpath / 'some_file.txt')
    with filepath.open('rb') as target:
        with node.base.repository.open('rb') as source:
            shutil.copyfileobj(source, target)

    # Now use `filepath` to library call, e.g.
    numpy.loadtxt(filepath)

This logic is now provided under the as_path context manager. This
will make it easy to access repository content as files on the local
file system. A warning is added to the docs explaining the inefficiency
of the content having to be read and written to a temporary directory
first, encouraging it only to be used when the alternative is not an
option.

sphuber force-pushed the feature/repo-as-filepath branch 3 times, most recently from acd17b2 to 05d4ffc Compare

October 16, 2023 12:23

giovannipizzi requested changes

View reviewed changes

docs/source/topics/data_types.rst Outdated Show resolved Hide resolved

docs/source/topics/data_types.rst Outdated

Comment on lines 321 to 323

    
                  In [4]: with single_file.open() as handle:

                          print(handle.read())

                  Out[4]: 'The file content'

Member

giovannipizzi Oct 17, 2023

While simple, this is not a good example IMHO, because it ends up anyway putting everything in memory in handle.read(), defeating the purpose and making it not clear why we have to go via the new interface.

We should thing at something simple but really streaming. E.g. just pseudo code:

CHUNK_SIZE = 65536
with single_file.open() as handle:
    while True:
        chunk = handle.read(CHUNK_SIZE)
        if not chunk: 
            break
        #process chunk here

Or something simple, like counting length:

CHUNK_SIZE = 65536

length = 0
with single_file.open() as handle:
    while True:
        chunk = handle.read(CHUNK_SIZE)
        if not chunk: 
            break
        length += len(chunk)

Contributor Author

sphuber Oct 17, 2023

I thought about that, but I find this example is quite complex and risks confusing the reader. Maybe something like this:

import shutil

# Copy a large file from repo to file on disk without loading in memory
with single_file.open(mode='rb') as source:
    with open('copy.txt', mode='wb') as target:
        shutil.copyfileobj(source, target)

Think this will be more intuitive and an actual use case for some users.

Member

giovannipizzi Oct 19, 2023

OK!

docs/source/topics/data_types.rst Show resolved Hide resolved

docs/source/topics/data_types.rst Outdated

    
              .. code-block:: ipython

                  In [9]: with folder.open('subdir/file3.txt') as handle:

Member

giovannipizzi Oct 17, 2023

Same comment as before

docs/source/topics/data_types.rst Show resolved Hide resolved

tests/orm/nodes/test_repository.py Show resolved Hide resolved

sphuber added 2 commits

October 17, 2023 09:42


          Repository: Add the as_path context manager

16e79f6

The node repository interface intentionally does not provide access to
its file objects through filepaths on the file system. This is because,
for efficiency reasons, the content of a repository may not actually be
stored as individual files on a file system, but for example are stored
in an object store.

Therefore, the contents of the repository can only be retrieved as a
file-like object or read as a string or list of bytes into memory.
Certain use-cases require a file to be made available through a filepath.
An example is when it needs to be passed to an API that only accepts a
filepath, such as `numpy.loadfromtxt`.

Currently, the user will have to manually copy the content of the repo's
content to a temporary file on disk, and pass the temporary filepath.
This results in clients having to often resport to the following snippet:

    import pathlib
    import shutil
    import tempfile

    with tempfile.TemporaryDirectory() as tmp_path:

        # Copy the entire content to the temporary folder
        dirpath = pathlib.Path(tmp_path)
        node.base.repository.copy_tree(dirpath)

        # Or copy the content of a file. Should use streaming
        # to avoid reading everything into memory
        filepath = (dirpath / 'some_file.txt')
        with filepath.open('rb') as target:
            with node.base.repository.open('rb') as source:
                shutil.copyfileobj(source, target)

        # Now use `filepath` to library call, e.g.
        numpy.loadtxt(filepath)

This logic is now provided under the `as_path` context manager. This
will make it easy to access repository content as files on the local
file system. The snippet above is simplified to:

    with node.base.repository.as_path() as filepath:
        numpy.loadtxt(filepath)

The method is exposed directly in the interface of the `FolderData` and
`SinglfileData` data types. A warning is added to the docs explaining
the inefficiency of the content having to be read and written to a
temporary directory first, encouraging it only to be used when the
alternative is not an option.


          Address PR comments

f664650

sphuber force-pushed the feature/repo-as-filepath branch from 05d4ffc to f664650 Compare

October 17, 2023 08:21

sphuber requested a review from giovannipizzi

October 17, 2023 08:22

giovannipizzi reviewed

View reviewed changes

docs/source/topics/data_types.rst


		In [5]: with single_file.as_path() as filepath:
		numpy.loadtxt(filepath)

Member

giovannipizzi Oct 19, 2023

Since you mention above generally shutil.copyfileobj, shouldn't we have an example of that here as well?

Contributor Author

sphuber Oct 19, 2023

But shutil.copyfileobj takes a file handle, not a filepath. Of course I could then do

with single_file.as_path() as filepath:
    with filepath.open() as source:
        with open('output.txt', 'wb') as target:
            shutil.copyfileobj(source, target)

That would be defeating the point. Or did you mean another method like shutil.copyfileobj but that uses filepaths?

sphuber merged commit b0546e8 into aiidateam:main

18 checks passed

sphuber deleted the feature/repo-as-filepath branch

October 20, 2023 06:14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet