[BUG] read_csv encoding behavior differs from pandas #12516

mattf · 2023-01-10T15:32:51Z

Describe the bug
writing code with import cudf as pd

Steps/Code to reproduce bug

In [1]: import cudf as pd

In [2]: pd.__version__
Out[2]: '22.12.0'

In [3]: import opendatasets as od
   ...: od.download('https://www.kaggle.com/kianwee/agricultural-raw-material-prices-19902020')
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Downloading agricultural-raw-material-prices-19902020.zip to ./agricultural-raw-material-prices-19902020
  0%|                                                                                                                            | 0.00/22.8k [00:00<?, ?B/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22.8k/22.8k [00:00<00:00, 42.5MB/s]

In [4]: pd.read_csv('./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv').columns[0]
Out[4]: '\ufeffMonth'

In [5]: import pandas

In [6]: pandas.read_csv('./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv').columns[0]
Out[6]: 'Month'

In [7]: ! hexdump -C ./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv | head
00000000  ef bb bf 4d 6f 6e 74 68  2c 43 6f 61 72 73 65 20  |...Month,Coarse |
00000010  77 6f 6f 6c 20 50 72 69  63 65 2c 43 6f 61 72 73  |wool Price,Coars|
00000020  65 20 77 6f 6f 6c 20 70  72 69 63 65 20 25 20 43  |e wool price % C|
00000030  68 61 6e 67 65 2c 43 6f  70 72 61 20 50 72 69 63  |hange,Copra Pric|
00000040  65 2c 43 6f 70 72 61 20  70 72 69 63 65 20 25 20  |e,Copra price % |
00000050  43 68 61 6e 67 65 2c 43  6f 74 74 6f 6e 20 50 72  |Change,Cotton Pr|
00000060  69 63 65 2c 43 6f 74 74  6f 6e 20 70 72 69 63 65  |ice,Cotton price|
00000070  20 25 20 43 68 61 6e 67  65 2c 46 69 6e 65 20 77  | % Change,Fine w|
00000080  6f 6f 6c 20 50 72 69 63  65 2c 46 69 6e 65 20 77  |ool Price,Fine w|
00000090  6f 6f 6c 20 70 72 69 63  65 20 25 20 43 68 61 6e  |ool price % Chan|

In [8]: ! file -i ./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv
./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv: text/csv; charset=utf-8

Expected behavior
same behavior as pandas

The text was updated successfully, but these errors were encountered:

mattf · 2023-01-10T15:38:06Z

this issue will be incredibly frustrating to users...

In [18]: df = pd.read_csv('./agricultural-raw-material-prices-19902020/agricultural_raw_material.csv')

In [19]: df
Out[19]: 
     Month Coarse wool Price Coarse wool price % Change  ... Soft sawnwood price % Change Wood pulp Price  Wood pulp price % Change
0    Apr-90            482.34                          -  ...                            -          829.29                         -
1    May-90            447.26                     -7.27%  ...                       -2.63%          842.51                     1.59%
2    Jun-90            440.99                     -1.40%  ...                       -6.10%          831.35                    -1.32%
3    Jul-90            418.44                     -5.11%  ...                        5.03%          798.83                    -3.91%
4    Aug-90            418.44                      0.00%  ...                       -0.83%          818.74                     2.49%
..      ...               ...                        ...  ...                          ...             ...                       ...
356  Dec-19              <NA>                       <NA>  ...                         <NA>           875.0                     0.00%
357  Jan-20              <NA>                       <NA>  ...                         <NA>           875.0                     0.00%
358  Feb-20              <NA>                       <NA>  ...                         <NA>           875.0                     0.00%
359  Mar-20              <NA>                       <NA>  ...                         <NA>           875.0                     0.00%
360  Apr-20              <NA>                       <NA>  ...                         <NA>            <NA>                      <NA>

[361 rows x 25 columns]

In [20]: df.Month
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/.local/lib/python3.9/site-packages/cudf/utils/utils.py:272, in GetAttrGetItemMixin.__getattr__(self, key)
    271 try:
--> 272     return self[key]
    273 except KeyError:

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
    100 libnvtx_push_range(self.attributes, self.domain.handle)
--> 101 result = func(*args, **kwargs)
    102 libnvtx_pop_range(self.domain.handle)

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:1157, in DataFrame.__getitem__(self, arg)
   1156 if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
-> 1157     return self._get_columns_by_label(arg, downcast=True)
   1159 elif isinstance(arg, slice):

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
    100 libnvtx_push_range(self.attributes, self.domain.handle)
--> 101 result = func(*args, **kwargs)
    102 libnvtx_pop_range(self.domain.handle)

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:1893, in DataFrame._get_columns_by_label(self, labels, downcast)
   1888 """
   1889 Return columns of dataframe by `labels`
   1890 
   1891 If downcast is True, try and downcast from a DataFrame to a Series
   1892 """
-> 1893 new_data = super()._get_columns_by_label(labels, downcast)
   1894 if downcast:

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
    100 libnvtx_push_range(self.attributes, self.domain.handle)
--> 101 result = func(*args, **kwargs)
    102 libnvtx_pop_range(self.domain.handle)

File ~/.local/lib/python3.9/site-packages/cudf/core/frame.py:418, in Frame._get_columns_by_label(self, labels, downcast)
    414 """
    415 Returns columns of the Frame specified by `labels`
    416 
    417 """
--> 418 return self._data.select_by_label(labels)

File ~/.local/lib/python3.9/site-packages/cudf/core/column_accessor.py:343, in ColumnAccessor.select_by_label(self, key)
    342         return self._select_by_label_with_wildcard(key)
--> 343 return self._select_by_label_grouped(key)

File ~/.local/lib/python3.9/site-packages/cudf/core/column_accessor.py:463, in ColumnAccessor._select_by_label_grouped(self, key)
    462 def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 463     result = self._grouped_data[key]
    464     if isinstance(result, cudf.core.column.ColumnBase):

KeyError: 'Month'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In [20], line 1
----> 1 df.Month

File ~/.local/lib/python3.9/site-packages/cudf/utils/utils.py:274, in GetAttrGetItemMixin.__getattr__(self, key)
    272     return self[key]
    273 except KeyError:
--> 274     raise AttributeError(
    275         f"{type(self).__name__} object has no attribute {key}"
    276     )

AttributeError: DataFrame object has no attribute Month

In [21]: df['Month']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [21], line 1
----> 1 df['Month']

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:1157, in DataFrame.__getitem__(self, arg)
   1095 """
   1096 If *arg* is a ``str`` or ``int`` type, return the column Series.
   1097 If *arg* is a ``slice``, return a new DataFrame with all columns
   (...)
   1154 8  8  8  8
   1155 """
   1156 if _is_scalar_or_zero_d_array(arg) or isinstance(arg, tuple):
-> 1157     return self._get_columns_by_label(arg, downcast=True)
   1159 elif isinstance(arg, slice):
   1160     return self._slice(arg)

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/dataframe.py:1893, in DataFrame._get_columns_by_label(self, labels, downcast)
   1886 @_cudf_nvtx_annotate
   1887 def _get_columns_by_label(self, labels, downcast=False):
   1888     """
   1889     Return columns of dataframe by `labels`
   1890 
   1891     If downcast is True, try and downcast from a DataFrame to a Series
   1892     """
-> 1893     new_data = super()._get_columns_by_label(labels, downcast)
   1894     if downcast:
   1895         if is_scalar(labels):

File ~/.local/lib/python3.9/site-packages/nvtx/nvtx.py:101, in annotate.__call__.<locals>.inner(*args, **kwargs)
     98 @wraps(func)
     99 def inner(*args, **kwargs):
    100     libnvtx_push_range(self.attributes, self.domain.handle)
--> 101     result = func(*args, **kwargs)
    102     libnvtx_pop_range(self.domain.handle)
    103     return result

File ~/.local/lib/python3.9/site-packages/cudf/core/frame.py:418, in Frame._get_columns_by_label(self, labels, downcast)
    412 @_cudf_nvtx_annotate
    413 def _get_columns_by_label(self, labels, downcast=False):
    414     """
    415     Returns columns of the Frame specified by `labels`
    416 
    417     """
--> 418     return self._data.select_by_label(labels)

File ~/.local/lib/python3.9/site-packages/cudf/core/column_accessor.py:343, in ColumnAccessor.select_by_label(self, key)
    341     if any(isinstance(k, slice) for k in key):
    342         return self._select_by_label_with_wildcard(key)
--> 343 return self._select_by_label_grouped(key)

File ~/.local/lib/python3.9/site-packages/cudf/core/column_accessor.py:463, in ColumnAccessor._select_by_label_grouped(self, key)
    462 def _select_by_label_grouped(self, key: Any) -> ColumnAccessor:
--> 463     result = self._grouped_data[key]
    464     if isinstance(result, cudf.core.column.ColumnBase):
    465         return self.__class__({key: result})

KeyError: 'Month'

this 'Month' column appears to be there, but isn't

GregoryKimball · 2023-06-06T03:19:36Z

@mattf would you please share more information about this issue? Is there a whitespace problem in the column name?

Update, courtesy ChatGPT:

The Unicode character '\ufeff' is known as the 
"ZERO WIDTH NO-BREAK SPACE" (ZWNBSP).
 It is a special character used primarily in text encoding and processing. 
It is a non-printing character that behaves like a whitespace character 
but does not allow line-breaking at its position.

So the request could be to remove all non-printing characters from column names. Seems like something that should happen in the cuDF-python layer, rather than the libcudf layer.

GregoryKimball · 2023-06-06T03:24:05Z

@davidwendt do we have a strings API to "remove all non-printing" characters? Maybe a version of "normalize" could do this?

davidwendt · 2023-06-06T14:58:31Z

This is actually a BOM (byte order mark) which is sometimes added to the beginning of text files to identify the encoding.
https://en.wikipedia.org/wiki/Byte_order_mark
Note the hexdump provided in the description shows the BOM is EFBBBF which identifies the file as UTF-8 encoded.
So I'm not sure where the FEFF is coming from. I downloaded the file from the kaggle page and I don't see the FEFF in the hexdump either.
Regardless, it may be possible for the CSV reader to check for these 3 bytes at the beginning of the file and just skip them if it's UTF-8 and throw an error if it is UTF-16 or UTF-32 for example.

Once the 3 bytes are removed, cuDF should be able to read the file.

Adds support to the CSV reader to skip the UTF-8 BOM bytes at the beginning of a text source. Some text files may contain a BOM (byte order marker) at the beginning of the file to identify its encoding. https://en.wikipedia.org/wiki/Byte_order_mark cuDF only support UTF-8 encoding so skipping these bytes allows us to support these types of files. Closes #12516 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13516

mattf added Needs Triage Need team to review and classify bug Something isn't working labels Jan 10, 2023

GregoryKimball added this to the Modernize CSV reader milestone Feb 9, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment 0 - Waiting on Author Waiting for author to respond to review Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 6, 2023

GregoryKimball removed the 0 - Waiting on Author Waiting for author to respond to review label Jun 6, 2023

davidwendt mentioned this issue Jun 6, 2023

Support UTF-8 BOM in CSV reader #13516

Merged

3 tasks

rapids-bot bot closed this as completed in #13516 Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] read_csv encoding behavior differs from pandas #12516

[BUG] read_csv encoding behavior differs from pandas #12516

mattf commented Jan 10, 2023

mattf commented Jan 10, 2023

GregoryKimball commented Jun 6, 2023 •

edited

Loading

GregoryKimball commented Jun 6, 2023

davidwendt commented Jun 6, 2023

[BUG] read_csv encoding behavior differs from pandas #12516

[BUG] read_csv encoding behavior differs from pandas #12516

Comments

mattf commented Jan 10, 2023

mattf commented Jan 10, 2023

GregoryKimball commented Jun 6, 2023 • edited Loading

GregoryKimball commented Jun 6, 2023

davidwendt commented Jun 6, 2023

GregoryKimball commented Jun 6, 2023 •

edited

Loading