Valid UTF-8 bytes fails to be decoded #1825

XiaomoWu · 2019-04-26T20:32:21Z

What's the problem?
I have a valid utf-8 byte sequence and is perfectly decodable with bytes.decode('utf-8'). However, when it's used as a column, the datatable fails to decode it.
How to reproduce the bug?
Here is a reproducible example.

text_as_bytes = b'\xe7\x94\xa8\xe8\xb5\xb7\xe6\x9d\xa5\xe8\xbf\x98\xe6\x98\xaf\xe5\xbe\x88\xe4\xb8\x8d\xe7\xa8\xb3\xe5\xae\x9a\xe3\x80\x82\xe5\xbe\x88\xe5\xa4\x9a\xe6\x8c\x89\xe9\x94\xae\xe9\x83\xbd\xe8\xa6\x81\xe7\x82\xb9\xe5\xa5\xbd\xe5\x87\xa0\xe6\xac\xa1\xe6\x89\x8d\xe8\xa1\x8c\xe3\x80\x82'

text_as_str = text_as_bytes.decode('utf-8')

# text_as_str is printed out correctly
print(text_as_str) # 用起来还是很不稳定。很多按键都要点好几次才行。

# create a datatable with text_as_str
dt = datatable.Frame({'text': [text_as_str]})
dt # get error: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 363-364: invalid continuation byte

Your environment?
Python: 3.7 + JupyterLab 0.35.4
OS: Ubuntu 16.04

The text was updated successfully, but these errors were encountered:

Fixed a UnicodeDecodeError that could be thrown when viewing a Frame with unicode characters in Jupyter notebook. The error only manifested for strings that were longer than 50 bytes in length. Closes #1825

st-pasha self-assigned this Apr 26, 2019

st-pasha added the bug Any bugs / errors in datatable; however for severe bugs use [segfault] label label Apr 26, 2019

st-pasha added this to the Release 0.9.0 milestone Apr 26, 2019

st-pasha mentioned this issue Apr 26, 2019

Fixed UnicodeDecodeError when rendering Frame into html #1827

Merged

st-pasha closed this as completed in #1827 Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Valid UTF-8 bytes fails to be decoded #1825

Valid UTF-8 bytes fails to be decoded #1825

XiaomoWu commented Apr 26, 2019 •

edited

Loading

Valid UTF-8 bytes fails to be decoded #1825

Valid UTF-8 bytes fails to be decoded #1825

Comments

XiaomoWu commented Apr 26, 2019 • edited Loading

XiaomoWu commented Apr 26, 2019 •

edited

Loading