-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repository: Add the as_path
context manager
#6151
Conversation
acd17b2
to
05d4ffc
Compare
docs/source/topics/data_types.rst
Outdated
In [4]: with single_file.open() as handle: | ||
print(handle.read()) | ||
Out[4]: 'The file content' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While simple, this is not a good example IMHO, because it ends up anyway putting everything in memory in handle.read()
, defeating the purpose and making it not clear why we have to go via the new interface.
We should thing at something simple but really streaming. E.g. just pseudo code:
CHUNK_SIZE = 65536
with single_file.open() as handle:
while True:
chunk = handle.read(CHUNK_SIZE)
if not chunk:
break
#process chunk here
Or something simple, like counting length:
CHUNK_SIZE = 65536
length = 0
with single_file.open() as handle:
while True:
chunk = handle.read(CHUNK_SIZE)
if not chunk:
break
length += len(chunk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but I find this example is quite complex and risks confusing the reader. Maybe something like this:
import shutil
# Copy a large file from repo to file on disk without loading in memory
with single_file.open(mode='rb') as source:
with open('copy.txt', mode='wb') as target:
shutil.copyfileobj(source, target)
Think this will be more intuitive and an actual use case for some users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
docs/source/topics/data_types.rst
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [9]: with folder.open('subdir/file3.txt') as handle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before
The node repository interface intentionally does not provide access to its file objects through filepaths on the file system. This is because, for efficiency reasons, the content of a repository may not actually be stored as individual files on a file system, but for example are stored in an object store. Therefore, the contents of the repository can only be retrieved as a file-like object or read as a string or list of bytes into memory. Certain use-cases require a file to be made available through a filepath. An example is when it needs to be passed to an API that only accepts a filepath, such as `numpy.loadfromtxt`. Currently, the user will have to manually copy the content of the repo's content to a temporary file on disk, and pass the temporary filepath. This results in clients having to often resport to the following snippet: import pathlib import shutil import tempfile with tempfile.TemporaryDirectory() as tmp_path: # Copy the entire content to the temporary folder dirpath = pathlib.Path(tmp_path) node.base.repository.copy_tree(dirpath) # Or copy the content of a file. Should use streaming # to avoid reading everything into memory filepath = (dirpath / 'some_file.txt') with filepath.open('rb') as target: with node.base.repository.open('rb') as source: shutil.copyfileobj(source, target) # Now use `filepath` to library call, e.g. numpy.loadtxt(filepath) This logic is now provided under the `as_path` context manager. This will make it easy to access repository content as files on the local file system. The snippet above is simplified to: with node.base.repository.as_path() as filepath: numpy.loadtxt(filepath) The method is exposed directly in the interface of the `FolderData` and `SinglfileData` data types. A warning is added to the docs explaining the inefficiency of the content having to be read and written to a temporary directory first, encouraging it only to be used when the alternative is not an option.
05d4ffc
to
f664650
Compare
|
||
In [5]: with single_file.as_path() as filepath: | ||
numpy.loadtxt(filepath) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you mention above generally shutil.copyfileobj
, shouldn't we have an example of that here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But shutil.copyfileobj
takes a file handle, not a filepath. Of course I could then do
with single_file.as_path() as filepath:
with filepath.open() as source:
with open('output.txt', 'wb') as target:
shutil.copyfileobj(source, target)
That would be defeating the point. Or did you mean another method like shutil.copyfileobj
but that uses filepaths?
The node repository interface intentionally does not provide access to
its file objects through filepaths on the file system. This is because,
for efficiency reasons, the content of a repository may not actually be
stored as individual files on a file system, but for example are stored
in an object store.
Therefore, the contents of the repository can only be retrieved as a
file-like object or read as a string or list of bytes into memory.
Certain use-cases require a file to be made available through a filepath.
An example is when it needs to be passed to an API that only accepts a
filepath, such as
numpy.loadfromtxt
.Currently, the user will have to manually copy the content of the repo's
content to a temporary file on disk, and pass the temporary filepath.
This results in clients having to often resport to the following snippet:
This logic is now provided under the
as_path
context manager. Thiswill make it easy to access repository content as files on the local
file system. A warning is added to the docs explaining the inefficiency
of the content having to be read and written to a temporary directory
first, encouraging it only to be used when the alternative is not an
option.