Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Properly parse unicode usecols names in CSV #24856

Merged
merged 1 commit into from
Jan 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1705,6 +1705,7 @@ I/O
^^^

- Bug in :func:`read_csv` in which a column specified with ``CategoricalDtype`` of boolean categories was not being correctly coerced from string values to booleans (:issue:`20498`)
- Bug in :func:`read_csv` in which unicode column names were not being properly recognized with Python 2.x (:issue:`13253`)
- Bug in :meth:`DataFrame.to_sql` when writing timezone aware data (``datetime64[ns, tz]`` dtype) would raise a ``TypeError`` (:issue:`9086`)
- Bug in :meth:`DataFrame.to_sql` where a naive :class:`DatetimeIndex` would be written as ``TIMESTAMP WITH TIMEZONE`` type in supported databases, e.g. PostgreSQL (:issue:`23510`)
- Bug in :meth:`read_excel()` when ``parse_cols`` is specified with an empty dataset (:issue:`9208`)
Expand Down
29 changes: 21 additions & 8 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1296,15 +1296,28 @@ def _validate_usecols_arg(usecols):
if usecols is not None:
if callable(usecols):
return usecols, None
gfyoung marked this conversation as resolved.
Show resolved Hide resolved
# GH20529, ensure is iterable container but not string.
elif not is_list_like(usecols):

if not is_list_like(usecols):
# see gh-20529
#
# Ensure it is iterable container but not string.
raise ValueError(msg)
else:
usecols_dtype = lib.infer_dtype(usecols, skipna=False)
if usecols_dtype not in ('empty', 'integer',
'string', 'unicode'):
raise ValueError(msg)
return set(usecols), usecols_dtype

usecols_dtype = lib.infer_dtype(usecols, skipna=False)

if usecols_dtype not in ("empty", "integer",
"string", "unicode"):
raise ValueError(msg)

usecols = set(usecols)

if usecols_dtype == "unicode":
# see gh-13253
#
# Python 2.x compatibility
usecols = {col.encode("utf-8") for col in usecols}
gfyoung marked this conversation as resolved.
Show resolved Hide resolved

return usecols, usecols_dtype
return usecols, None


Expand Down
5 changes: 2 additions & 3 deletions pandas/tests/io/parser/test_usecols.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import pytest

from pandas._libs.tslib import Timestamp
from pandas.compat import PY2, StringIO
from pandas.compat import StringIO

from pandas import DataFrame, Index
import pandas.util.testing as tm
Expand Down Expand Up @@ -387,8 +387,7 @@ def test_usecols_with_mixed_encoding_strings(all_parsers, usecols):

@pytest.mark.parametrize("usecols", [
["あああ", "いい"],
pytest.param([u"あああ", u"いい"], marks=pytest.mark.skipif(
PY2, reason="Buggy behavior: see gh-13253"))
[u"あああ", u"いい"]
])
def test_usecols_with_multi_byte_characters(all_parsers, usecols):
data = """あああ,いい,ううう,ええええ
Expand Down