Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another unicode issue in SinglefileData #3437

Closed
giovannipizzi opened this issue Oct 18, 2019 · 7 comments · Fixed by #4345
Closed

Another unicode issue in SinglefileData #3437

giovannipizzi opened this issue Oct 18, 2019 · 7 comments · Fixed by #4345
Labels
topic/data-types type/feature request status undecided type/question may redirect to mailinglist
Milestone

Comments

@giovannipizzi
Copy link
Member

In SinglefileData.get_content(), shouldn't we instead return bytes (and fix all places where this method is used, e.g. also in UpfData and the tools reading it?

def get_content(self):
"""Return the content of the single file stored for this data node.
:return: the content of the file as a string
"""
with self.open() as handle:
return handle.read()

Otherwise it means we are assuming this is a SingleutfstringData

Mentioning @sphuber

@giovannipizzi giovannipizzi added type/question may redirect to mailinglist type/feature request status undecided labels Oct 18, 2019
@sphuber
Copy link
Contributor

sphuber commented Oct 18, 2019

Not sure. The SinglefileData correctly writes the file using binary mode to just copy over bytes wholesale. The open method will by default open in text mode, but not specify any encoding so the default encoding of the OS is used. Is this not what we want? Should we always return byte strings and leave it up to the caller to do something with it. I am not sure if this would not cause unexpected things for users calling these methods, as they are top-level node API methods after all.

@ltalirz
Copy link
Member

ltalirz commented Oct 18, 2019

I would tend to agree with @sphuber here in that get_content is a shorthand for opening the file with default settings and reading - for other use cases there is node.open.
Perhaps it would make sense to document this in the help string?

@giovannipizzi
Copy link
Member Author

I fear that if the file was binary, people will use get_content and this will crash.
Ok to document, and I understand we don't want probably to change the API, but maybe we should have a get_binary_content (or have a flag binary=True), and use it internally? This came out because in the REST API PR, get_content() is used and then encoded in 'utf8' to return a bytestring, and this looks to me convoluted and source of bugs/crashes for binary data

@sphuber
Copy link
Contributor

sphuber commented Oct 18, 2019

For the REST API, if they know they want a byte string, they can always replace get_content with:

with node.open(mode='rb') as handle:
    return handle.read()

I think get_content should be a shortcut if we want human readable content. If one knows it can be binary or they want bytes, just use the open method

@greschd
Copy link
Member

greschd commented Oct 21, 2019

I have a related question: Is there a better way to copy a SinglefileData out of the repository (via the CLI) than verdi node cat PK filename > new_file? This command currently suffers from the same error, namely

Critical: failed to get the content of file `filename`: 'utf-8' codec can't decode byte 0xac in position 10: invalid start byte

@sphuber
Copy link
Contributor

sphuber commented Oct 21, 2019

Not really I am afraid. Some of the sub classes such as CifData have a specific endpoint verdi data cif content but that will most likely from the same problem

@giovannipizzi
Copy link
Member Author

About this last comment:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/data-types type/feature request status undecided type/question may redirect to mailinglist
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants