Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid UTF-8 bytes fails to be decoded #1825

Closed
XiaomoWu opened this issue Apr 26, 2019 · 0 comments · Fixed by #1827
Closed

Valid UTF-8 bytes fails to be decoded #1825

XiaomoWu opened this issue Apr 26, 2019 · 0 comments · Fixed by #1827
Assignees
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label
Milestone

Comments

@XiaomoWu
Copy link

XiaomoWu commented Apr 26, 2019

  • What's the problem?
    I have a valid utf-8 byte sequence and is perfectly decodable with bytes.decode('utf-8'). However, when it's used as a column, the datatable fails to decode it.

  • How to reproduce the bug?
    Here is a reproducible example.

text_as_bytes = b'\xe7\x94\xa8\xe8\xb5\xb7\xe6\x9d\xa5\xe8\xbf\x98\xe6\x98\xaf\xe5\xbe\x88\xe4\xb8\x8d\xe7\xa8\xb3\xe5\xae\x9a\xe3\x80\x82\xe5\xbe\x88\xe5\xa4\x9a\xe6\x8c\x89\xe9\x94\xae\xe9\x83\xbd\xe8\xa6\x81\xe7\x82\xb9\xe5\xa5\xbd\xe5\x87\xa0\xe6\xac\xa1\xe6\x89\x8d\xe8\xa1\x8c\xe3\x80\x82'

text_as_str = text_as_bytes.decode('utf-8')

# text_as_str is printed out correctly
print(text_as_str) # 用起来还是很不稳定。很多按键都要点好几次才行。

# create a datatable with text_as_str
dt = datatable.Frame({'text': [text_as_str]})
dt # get error: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 363-364: invalid continuation byte 
  • Your environment?
    Python: 3.7 + JupyterLab 0.35.4
    OS: Ubuntu 16.04
@st-pasha st-pasha self-assigned this Apr 26, 2019
@st-pasha st-pasha added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Apr 26, 2019
@st-pasha st-pasha added this to the Release 0.9.0 milestone Apr 26, 2019
st-pasha added a commit that referenced this issue Apr 26, 2019
Fixed a UnicodeDecodeError that could be thrown when viewing a Frame with unicode characters in Jupyter notebook. The error only manifested for strings that were longer than 50 bytes in length.

Closes #1825
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants