Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PseudoPotentialData: allow str or Path for source argument #98

Merged
merged 1 commit into from
May 17, 2021

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented May 14, 2021

Fixes #87

Currently, the source of the pseudopotential passed at construction time
of the PseudoPotentialData that represents it had to be a binary
stream. Here we introduce the possibility to allow the caller that pass
a str or Path object with the absolute path to the file content on
disk that should be wrapped.

There is a potential inefficiency in the implementation after this change.
It is not only the constructor of SinglefileData, and therefore our
plugin PseudoPotentialData and all its subclasses, that needs to be
able to support the different source types and convert the paths to a
stream if that is what passed, the set_file method, which is called by
the constructor needs to apply the same checks and conversions. Here, we
need to therefore make sure that the conversion from filepath to byte
stream is only done once and not in the set_file call of every
subclass.

Therefore it is important that the set_file method first calls the
prepare_source class method, which will only convert to bytestream if
isn't already one, and that should then be passed to the superclass
call. Note that this should be called straight after the prepare_source
since the base class will set the actual file and filename attributes
that are methods that may be invoked afterwards will rely on.

@sphuber sphuber requested review from bosonie and mbercx May 14, 2021 07:10
@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2021

There were a few tricky implementation details to consider, but I think this implementation could work. Question for me would be if we should still support str given that we support Path. On the one hand it would be good to start forcing people to use pathlib. But on the other hand, I feel we are introducing this code to cater to people that don't code too often and simply want to drop in an absolute path as a string. So maybe restricting to Path would defeat the entire purpose of these changes

@bosonie
Copy link
Collaborator

bosonie commented May 14, 2021

Thanks @sphuber.
I first have general comments.

  1. I believe we should support string since SinglefileData supports it.
  2. I think the calls to parse_element of each subclass of PseudopotentialsData should be done before the call to the super. Just for efficiency. The call to the super calls in turn the super (SinglefileData) that sets the object. No need to do that if we then discover that the file is not of the right type and the element can not be parsed. And the parse_element method does not use any other attributes of the class.

Copy link
Collaborator

@bosonie bosonie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left an important comment.
The rest seems fine to me. I tested it in some siesta calculations and works well.

Comment on lines 82 to 87
if isinstance(source, (str, pathlib.Path)):
with open(source, 'rb') as handle:
source = io.BytesIO(handle.read())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here something needs to be added in order to give a name to the source. Without a name, the attribute filename would be set to file.txt when the absolute path is passed, while to the actual file name when the steam is passed.
I think something like that is sufficient:

if isinstance(source, (str, pathlib.Path)):
      save_name = str(source)
      with open(source, 'rb') as handle:
            source = io.BytesIO(handle.read())
      source.name = save_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, although we probably only want to do this when the filename argument is None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should just give the name to the stream as in the case an external stream is passed. If this name will be used or not shouldn't matter here, it is decided within the SinglefileData.set_file method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, that's true

@bosonie
Copy link
Collaborator

bosonie commented May 14, 2021

I think the calls to parse_element of each subclass of PseudopotentialsData should be done before the call to the super. Just for efficiency. The call to the super calls in turn the super (SinglefileData) that sets the object. No need to do that if we then discover that the file is not of the right type and the element can not be parsed. And the parse_element method does not use any other attributes of the class.

Regarding this I now realized that in the set_file of SinglefileData there is also the validation of the passed quantity, checking whether it is an existing file or absolute path. For this reason maybe it made sense at the beginning to have it as first call. However these checks should now be done (and are already done partially) in prepare_source. We should compare https://github.com/aiidateam/aiida-core/blob/403f7e7d896f8408d0cacee9fe41c030a1072eaf/aiida/orm/nodes/data/singlefile.py#L78-L86 with the checks now present in prepare_source.

@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2021

I think the calls to parse_element of each subclass of PseudopotentialsData should be done before the call to the super. Just for efficiency. The call to the super calls in turn the super (SinglefileData) that sets the object. No need to do that if we then discover that the file is not of the right type and the element can not be parsed. And the parse_element method does not use any other attributes of the class.

I am not convinced this is correct. I gave the accessing of filename attributes as an example, but there are other things that could break. For example the base class checks whether modification is allowed. In general, one should call the superclass as soon as possible and I think here the same applies. The slight increase in efficiency in case an invalid file is passed does not outweigh the potential problems that could arise by calling the super too late.

Regarding this I now realized that in the set_file of SinglefileData there is also the validation of the passed quantity, checking whether it is an existing file or absolute path. For this reason maybe it made sense at the beginning to have it as first call.
However these checks should now be done (and are already done partially) in prepare_source.

The check on the existence of the file we already do in fact, albeit indirect. As the prepare_source docstring says, it raises FileNotFoundError if the source file cannot be found, which is better than the ValueError that the SinglefileData class raises.
The second check to make sure it is an absolute path I did not add indeed, but it will be circumvented anyway since we pass a bytestream to the base class. I don't see why we should require an absolute path though, why shouldn't a relative path work? I think this choice was just made because the absolute filepath requirement appears in other parts of aiida-core and the original implementer of SinglefileData just kept it without a good reason. The only thing I can think of is that with relative paths it might be easier to make subtle mistakes if it is not clear what the cwd is, but I think that shouldn't be a reason for requiring an absolute path.

@bosonie
Copy link
Collaborator

bosonie commented May 14, 2021

Nothing comes in my mind that should justify why relative paths are not accepted. Ok to leave it as it is!

@bosonie
Copy link
Collaborator

bosonie commented May 14, 2021

Another idea that is coming to me now. Just an idea we can discuss another time.
in the PseudopotentialData subclasses, why don't we parse the element directly from the file content (self.get_object_content). It is already set since the super is called before. This would avoid to call the prepare_source in the child classes, but it would require the parse_element method to belong to the class.

@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2021

Another idea that is coming to me now. Just an idea we can discuss another time.
in the PseudopotentialData subclasses, why don't we parse the element directly from the file content (self.get_object_content). It is already set since the super is called before. This would avoid to call the prepare_source in the child classes, but it would require the parse_element method to belong to the class.

The argument against this approach is that get_object_content would require to read the content from the repository again. This is almost always going to be slower than simply rewinding the byte stream that we already have in memory.

@sphuber sphuber force-pushed the fix/087/pseudo-data-construct-filepath branch from 6fcfd80 to 0fbd7fc Compare May 14, 2021 14:53
@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2021

Thanks for the review @bosonie . I addressed the question of the filename and added tests for it.

Copy link
Collaborator

@bosonie bosonie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to you! Also in this case would be useful to have a release soon...is it possible? Sorry for that!

@sphuber
Copy link
Contributor Author

sphuber commented May 14, 2021

I will give @mbercx some more time to respond and then I could make a release, we have some other fixes on develop anyway.

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! Gave the code a quick pass and it all looks good. Also did a bit of field testing for the UpfData class and all 👍 . Just left a few nitpicky comments regarding the docstrings (which still need to be propagated from the JthXmlData class to others).


:param stream: a filelike object with the binary content of the file.
.. note:: this method will first analyse the type of the ``source`` and if it is a filepath will convert it
to a binary stream of the content located at that filepath. This is then passed on to the superclass. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to a binary stream of the content located at that filepath. This is then passed on to the superclass. This
to a binary stream of the content located at that filepath, which is then passed on to the superclass. This

:param stream: a filelike object with the binary content of the file.
.. note:: this method will first analyse the type of the ``source`` and if it is a filepath will convert it
to a binary stream of the content located at that filepath. This is then passed on to the superclass. This
needs to be done first, because it will properly set the file and filename attributes that are expected to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
needs to be done first, because it will properly set the file and filename attributes that are expected to
needs to be done first, because it will properly set the file and filename attributes that are expected

.. note:: this method will first analyse the type of the ``source`` and if it is a filepath will convert it
to a binary stream of the content located at that filepath. This is then passed on to the superclass. This
needs to be done first, because it will properly set the file and filename attributes that are expected to
be there by other methods. Straight after the superclass call, the source seeker needs to be reset to zero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
be there by other methods. Straight after the superclass call, the source seeker needs to be reset to zero
by other methods. Straight after the superclass call, the source seeker needs to be reset to zero

this way the conversion from filepath to byte stream will be performed only once. Otherwise, each subclass
would perform the conversion over and over again.

:param source: the source pseudopotential content, either a binary stream, or a `str` or `Path` to the absolute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param source: the source pseudopotential content, either a binary stream, or a `str` or `Path` to the absolute
:param source: the source pseudopotential content, either a binary stream, or a ``str`` or ``Path`` to the absolute

would perform the conversion over and over again.

:param source: the source pseudopotential content, either a binary stream, or a `str` or `Path` to the absolute
path of the file on disk.
:param filename: optional explicit filename to give to the file stored in the repository.
:raises ValueError: if the element symbol is invalid.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method also raises a TypeError in case the source type is not valid. Should we "bubble up" this raises information to this docstring from the prepare_source docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""Get pseudopotenial data node from database with matching md5 checksum or create a new one if not existent.

:param stream: a filelike object with the binary content of the file.
:param source: the source pseudopotential content, either a binary stream, or a `str` or `Path` to the absolute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param source: the source pseudopotential content, either a binary stream, or a `str` or `Path` to the absolute
:param source: the source pseudopotential content, either a binary stream, or a ``str`` or ``Path`` to the absolute

Also: Does the path have to be absolute? I did some field testing and it seemed to work just fine for relative paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it shouldn't. The SinglefileData base class actually imposes this requirement, but I didn't see a good reason for it and so I made it such that it supports relative paths as well. If it is incorrect, it will simply raise FileNotFoundError. I will update the docstrings.


.. note:: if the ``source`` is a valid file on disk, its content is read and returned as a stream of bytes.

:raises TypeError: if the source is not a str, ``pathlib.Path`` instance or binary stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:raises TypeError: if the source is not a str, ``pathlib.Path`` instance or binary stream.
:raises TypeError: if the source is not a ``str``, ``pathlib.Path`` instance or binary stream.

"""
if not isinstance(source, (str, pathlib.Path)) and not cls.is_readable_byte_stream(source):
raise TypeError(
f'`source` should be a str or `pathlib.Path` of a filepath on disk or a stream of bytes, got: {source}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f'`source` should be a str or `pathlib.Path` of a filepath on disk or a stream of bytes, got: {source}'
f'`source` should be a `str` or `pathlib.Path` of a filepath on disk or a stream of bytes, got: {source}'

@sphuber sphuber force-pushed the fix/087/pseudo-data-construct-filepath branch from 0fbd7fc to 9af79fe Compare May 17, 2021 07:53
Currently, the source of the pseudopotential passed at construction time
of the `PseudoPotentialData` that represents it had to be a binary
stream. Here we introduce the possibility to allow the caller that pass
a `str` or `Path` object with the absolute path to the file content on
disk that should be wrapped. Note that, as opposed to the base class
`SinglefileData`, the filepath is not required to be absolute. In
addition, it raises a `FileNotFoundError` instead of a `ValueError` if
the provided filepath does not exist.

There is a potential inefficiency in the implementation after this change.
It is not only the constructor of `SinglefileData`, and therefore our
plugin `PseudoPotentialData` and all its subclasses, that needs to be
able to support the different source types and convert the paths to a
stream if that is what passed, the `set_file` method, which is called by
the constructor needs to apply the same checks and conversions. Here, we
need to therefore make sure that the conversion from filepath to byte
stream is only done once and not in the `set_file` call of every
subclass.

Therefore it is important that the `set_file` method first calls the
`prepare_source` class method, which will only convert to bytestream if
isn't already one, and that should then be passed to the superclass
call. Note that this should be called straight after the `prepare_source`
since the base class will set the actual file and filename attributes
that are methods that may be invoked afterwards will rely on.
@sphuber sphuber force-pushed the fix/087/pseudo-data-construct-filepath branch from 9af79fe to 5200c49 Compare May 17, 2021 07:57
@sphuber sphuber merged commit d2aea51 into master May 17, 2021
@sphuber sphuber deleted the fix/087/pseudo-data-construct-filepath branch May 17, 2021 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PseudoPotentialData: improve error message when user try to instanciate with an abs path
3 participants