Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

Closed
2 of 3 tasks
michal-gh opened this issue Sep 12, 2021 · 7 comments · Fixed by #43647
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@michal-gh
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(data=[ "a"*127 ]*4000)
# Put two-byte utf-8 encoded character at the end of the chunk (the utf-8 encoding of "ą" is b'\xc4\x85')
df.iloc[2047] = "a"*127 + "ą"
df.to_csv("./bugtest.csv", index=False, header=False, encoding="utf-8")
df1 = pd.read_csv("./bugtest.csv", header=None, memory_map=True) # <-- this fails

Traceback (most recent call last):
  File "/home/michal/.conda/envs/py39nlp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 745, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data
python-BaseException

Issue Description

This bug occurs when the end of the internal 256KB buffer falls inside an utf-8 encoded multibyte character. When memory_map=True, the csv parser uses _MMapWrapper.read() method defined in common.py (L872):

def read(self, size: int = -1) -> str | bytes:
        # CSV c-engine uses read instead of iterating
        content: bytes = self.mmap.read(size)
        if self.decode:
            # memory mapping is applied before compression. Encoding should
            # be applied to the de-compressed data.
            return content.decode(self.encoding, errors=self.errors)
        return content

As this function is called with size=256KB, it is clear that content buffer can split a multibyte character. When it happens, the utf-8 codec raises "unexpected end of data" error.

The _MMapWrapper.read() method was added in REGR: memory_map with non-UTF8 encoding #40994 , so the bug is present in Pandas 1.2.5 and newer versions.

Expected Behavior

Doesn't raise exception, produces the same result as pd.read_csv("./bugtest.csv", header=None, memory_map=False)

Installed Versions

pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@michal-gh michal-gh added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2021
@phofl
Copy link
Member

phofl commented Sep 12, 2021

We changed a few encoding related things and added a new keyword encoding_errors. I think this should solve your problems.

cc @twoertwein This is expected now, isn't it?

@twoertwein
Copy link
Member

Thank you @michal-gh for your carefully crafted example!

This issue happens only with the c-engine and memory_map=True:

engine="python", memory_map=False: ą
engine="python", memory_map=True: ą
engine="c", memory_map=False: ą
engine="c", memory_map=True: UnicodeDecodeError
engine="c", memory_map=True, encoding_errors="ignore": a

Changing the encoding_errors to "ignore" lets you at least "avoid" the error, but it truncates ą to a.

I assume that is either a limitation of the IncrementalDecoder (codecs.getincrementaldecoder(encoding)(errors=errors)) or more likely how it is being used. If understand @michal-gh's description, it might be possible to put the decode command into a try-except and read more bytes and then re-try the decoding. But if we are unlucky we again might have half of a multibyte character at the end of the now larger byte object (I honestly thought the IncrementalDecoder deals with that magically).

@twoertwein twoertwein added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 12, 2021
@phofl phofl added this to the 1.3.4 milestone Sep 13, 2021
@michal-gh
Copy link
Contributor Author

@twoertwein, you are right that IncrementalDecoder deals with such cases; the only problem is that it's not invoked by the code :-). The content.decode method is <function bytes.decode(encoding='utf-8', errors='strict')>. When I changed content.decode(self.encoding, errors=self.errors) to self.decoder.decode(content, final=False), the incremental decoder didn't raise the exception and correctly saved the state keeping the last byte:

self.decoder.getstate()
Out[16]: (b'\xc4', 0)

I also think that it makes sense to special-case the most common encoding "utf-8". If the file is in utf-8 encoding, the decode operation is a no-op, so it can be bypassed by a simple if self.encoding == "utf-8": return content

@twoertwein
Copy link
Member

@michal-gh Do you want to create a PR (and a separate PR for the UTF-8 special case)?

@michal-gh
Copy link
Contributor Author

I will try to make these PRs, they shouldn't be too difficult.

@twoertwein
Copy link
Member

The tricky part is probably to have final=True for the last line, otherwise the last line will only be partially decoded and no error will be thrown.

@michal-gh
Copy link
Contributor Author

@twoertwein , I created the #43647 PR which fixes the decode bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
3 participants