BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

michal-gh · 2021-09-12T19:13:16Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(data=[ "a"*127 ]*4000)
# Put two-byte utf-8 encoded character at the end of the chunk (the utf-8 encoding of "ą" is b'\xc4\x85')
df.iloc[2047] = "a"*127 + "ą"
df.to_csv("./bugtest.csv", index=False, header=False, encoding="utf-8")
df1 = pd.read_csv("./bugtest.csv", header=None, memory_map=True) # <-- this fails

Traceback (most recent call last):
  File "/home/michal/.conda/envs/py39nlp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 745, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data
python-BaseException

Issue Description

This bug occurs when the end of the internal 256KB buffer falls inside an utf-8 encoded multibyte character. When memory_map=True, the csv parser uses _MMapWrapper.read() method defined in common.py (L872):

def read(self, size: int = -1) -> str | bytes:
        # CSV c-engine uses read instead of iterating
        content: bytes = self.mmap.read(size)
        if self.decode:
            # memory mapping is applied before compression. Encoding should
            # be applied to the de-compressed data.
            return content.decode(self.encoding, errors=self.errors)
        return content

As this function is called with size=256KB, it is clear that content buffer can split a multibyte character. When it happens, the utf-8 codec raises "unexpected end of data" error.

The _MMapWrapper.read() method was added in REGR: memory_map with non-UTF8 encoding #40994 , so the bug is present in Pandas 1.2.5 and newer versions.

Expected Behavior

Doesn't raise exception, produces the same result as pd.read_csv("./bugtest.csv", header=None, memory_map=False)

Installed Versions

pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

phofl · 2021-09-12T21:24:33Z

We changed a few encoding related things and added a new keyword encoding_errors. I think this should solve your problems.

cc @twoertwein This is expected now, isn't it?

twoertwein · 2021-09-12T22:05:20Z

Thank you @michal-gh for your carefully crafted example!

This issue happens only with the c-engine and memory_map=True:

engine="python", memory_map=False: ą
engine="python", memory_map=True: ą
engine="c", memory_map=False: ą
engine="c", memory_map=True: UnicodeDecodeError
engine="c", memory_map=True, encoding_errors="ignore": a

Changing the encoding_errors to "ignore" lets you at least "avoid" the error, but it truncates ą to a.

I assume that is either a limitation of the IncrementalDecoder (codecs.getincrementaldecoder(encoding)(errors=errors)) or more likely how it is being used. If understand @michal-gh's description, it might be possible to put the decode command into a try-except and read more bytes and then re-try the decoding. But if we are unlucky we again might have half of a multibyte character at the end of the now larger byte object (I honestly thought the IncrementalDecoder deals with that magically).

michal-gh · 2021-09-14T19:58:23Z

@twoertwein, you are right that IncrementalDecoder deals with such cases; the only problem is that it's not invoked by the code :-). The content.decode method is <function bytes.decode(encoding='utf-8', errors='strict')>. When I changed content.decode(self.encoding, errors=self.errors) to self.decoder.decode(content, final=False), the incremental decoder didn't raise the exception and correctly saved the state keeping the last byte:

self.decoder.getstate()
Out[16]: (b'\xc4', 0)

I also think that it makes sense to special-case the most common encoding "utf-8". If the file is in utf-8 encoding, the decode operation is a no-op, so it can be bypassed by a simple if self.encoding == "utf-8": return content

twoertwein · 2021-09-14T23:43:53Z

@michal-gh Do you want to create a PR (and a separate PR for the UTF-8 special case)?

michal-gh · 2021-09-15T19:20:49Z

I will try to make these PRs, they shouldn't be too difficult.

twoertwein · 2021-09-15T22:34:46Z

The tricky part is probably to have final=True for the last line, otherwise the last line will only be partially decoded and no error will be thrown.

michal-gh · 2021-09-18T19:33:43Z

@twoertwein , I created the #43647 PR which fixes the decode bug

michal-gh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2021

twoertwein added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2021

phofl added this to the 1.3.4 milestone Sep 13, 2021

michal-gh mentioned this issue Sep 18, 2021

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43647

Merged

1 task

jreback closed this as completed in #43647 Sep 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

michal-gh commented Sep 12, 2021

phofl commented Sep 12, 2021

twoertwein commented Sep 12, 2021

michal-gh commented Sep 14, 2021

twoertwein commented Sep 14, 2021

michal-gh commented Sep 15, 2021

twoertwein commented Sep 15, 2021

michal-gh commented Sep 18, 2021

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

Comments

michal-gh commented Sep 12, 2021

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

phofl commented Sep 12, 2021

twoertwein commented Sep 12, 2021

michal-gh commented Sep 14, 2021

twoertwein commented Sep 14, 2021

michal-gh commented Sep 15, 2021

twoertwein commented Sep 15, 2021

michal-gh commented Sep 18, 2021