Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for encoding in read_obo / UnicodeDecodeError for Obonet0.3.1 (Cell Ontology) #27

Closed
thomcsmits opened this issue Feb 27, 2023 · 5 comments

Comments

@thomcsmits
Copy link

Similar to #25 , I am having a UnicodeDecodeError. The solution there is to upgrade to v0.3.0, but as far as I can tell, the setup.py still doesn't specify an encoding?

I have my OBO file locally (I downloaded the cl.obo file from https://obofoundry.org/ontology/cl.html)

import networkx
import obonet

path = "./data/ontology/cl.obo"
graph = obonet.read_obo(path)

obonet==0.3.1
networkx==3.0

Without running a local version of obonet with the encoding specified, how can I best resolve this error?

Any chance to add support for specifying an encoding in read_obo?

@dhimmel
Copy link
Owner

dhimmel commented Feb 27, 2023

#25 is about an encoding issue while installing the obonet package and not when calling obonet.read_obo. So this sounds like a different problem?

Can you provide the error message that occurs from obonet.read_obo(path)?

@dhimmel
Copy link
Owner

dhimmel commented Feb 27, 2023

Does the following work by the way:

# unversioned
obonet.read_obo("http://purl.obolibrary.org/obo/cl/cl-basic.obo")
# versioned
obonet.read_obo("https://github.com/obophenotype/cell-ontology/releases/download/v2023-02-19/cl-basic.obo")

@thomcsmits
Copy link
Author

thomcsmits commented Feb 27, 2023

Thanks for the fast answer!! This is the error message:

Traceback (most recent call last):
  File "<dir>\src\ontology_obonet.py", line 23, in <module>
    graph = obonet.read_obo(path)
  File "<dir>\.venv\lib\site-packages\obonet\read.py", line 30, in read_obo
    typedefs, terms, instances, header = get_sections(obo_file)
  File "<dir>\.venv\lib\site-packages\obonet\read.py", line 77, in get_sections
    stanza_lines = list(stanza_lines)
  File "C:\Users\<user>\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 8161: character maps to <undefined>

Reading directly from the URL works, though downloading the same version from https://github.com/obophenotype/cell-ontology/releases/download/v2023-02-19/cl-basic.obo and referencing it locally gives the error as above

Thanks for the help, I will just use the URL!

@dhimmel
Copy link
Owner

dhimmel commented Feb 27, 2023

Okay I think the issue is that you're on windows, which is using a different default encoding to open files besides utf-8. According to PEP 686, Python 3.15 will start using utf-8 for opening files on Windows by default. You can also change the default for Python by setting and exporting the environment variable PYTHONUTF8=1.

But this is a workaround. The best solution would be for obonet.read_obo to accept a character set encoding that would get passed to the opener. This way you could specify utf-8 for cl-basic.obo and alternatively a different encoding used by a different ontology.

dhimmel added a commit that referenced this issue Feb 28, 2023
@dhimmel
Copy link
Owner

dhimmel commented Feb 28, 2023

Okay e6ff647 was able to create an encoding error on the Windows CI job!

E       UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 347: character maps to <undefined>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants