Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.read_csv() fails when usecols contains multibyte unicode values #13253

Closed
hassanshamim opened this issue May 21, 2016 · 3 comments
Closed
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@hassanshamim
Copy link
Contributor

hassanshamim commented May 21, 2016

See PR: #13233 and pandas/io/tests/parser/usecols.py test_usecols_with_multibyte_unicode_characters

Code Sample, a copy-pastable example if possible

import pandas as pd
from pandas.compat import StringIO

s = '''あああ,いい,ううう,ええええ
0.056674973,8,True,a
2.613230982,2,False,b
3.568935038,7,False,a
'''
data = {
    'あああ': {
        0: 0.056674972999999997,
        1: 2.6132309819999997,
        2: 3.5689350380000002
    },
    'いい': {0: 8, 1: 2, 2: 7}
}
expected = pd.DataFrame(data)

df = pd.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
>>> df = pd.read_csv(StringIO(s), usecols=[u'あああ', u'いい'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "pandas/io/parsers.py", line 799, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "pandas/io/parsers.py", line 1263, in __init__
    raise ValueError("Usecols do not match names.")
ValueError: Usecols do not match names.

>>> df = pd.read_csv(StringIO(s), usecols=[u'あああ', u'いい'], engine='python')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "pandas/io/parsers.py", line 805, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "pandas/io/parsers.py", line 1609, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "pandas/io/parsers.py", line 1905, in _infer_columns
    columns = self._handle_usecols(columns, columns[0])
  File "pandas/io/parsers.py", line 1958, in _handle_usecols
    col_indices.append(usecols_key.index(u))
ValueError: u'\u3042\u3042\u3042' is not in list

Expected Output

        あああ  いい
0  0.056675   8
1  2.613231   2
2  3.568935   7

output of pd.show_versions()

INSTALLED VERSIONS

commit: 71f5139f14b38a1a3c1d9fd7d9219947399faec0
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+54.g71f5139.dirty
nose: 1.3.7
pip: 8.1.2
setuptools: 20.7.0
Cython: 0.24
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@gfyoung
Copy link
Member

gfyoung commented May 21, 2016

@hassanshamim : Also show that it fails for the Python engine too (just pass in `engine=python' to get the error message and copy/paste into your original issue).

@jreback jreback changed the title BUG: pd.read_csv() fails when usecols kwarg contains multibyte unicode values BUG: pd.read_csv() fails when usecols contains multibyte unicode values May 22, 2016
@jreback jreback added Unicode Unicode strings IO CSV read_csv, to_csv labels May 22, 2016
@gfyoung
Copy link
Member

gfyoung commented Jan 20, 2019

@jreback : This is a Python 2.x issue now (the test is enabled for Python 3.x)

https://github.com/pandas-dev/pandas/blob/01e7872b0/pandas/tests/io/parser/test_usecols.py#L388-L411

I think I have a patch for this, but I'm wondering whether it's worthwhile to add compatibility, or just close this out because of how close we are to releasing 0.24.0, the last Python 2.x-compatible release?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2019

u can patch if u want (0.24.x will be python 2 compat) or can close if it’s not trivial

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 21, 2019
@gfyoung gfyoung added this to the 0.24.0 milestone Jan 21, 2019
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 21, 2019
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Jan 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

3 participants