-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] read_csv encoding behavior differs from pandas #12516
Comments
this issue will be incredibly frustrating to users...
this 'Month' column appears to be there, but isn't |
@mattf would you please share more information about this issue? Is there a whitespace problem in the column name? Update, courtesy ChatGPT:
So the request could be to remove all non-printing characters from column names. Seems like something that should happen in the cuDF-python layer, rather than the libcudf layer. |
@davidwendt do we have a strings API to "remove all non-printing" characters? Maybe a version of "normalize" could do this? |
This is actually a BOM (byte order mark) which is sometimes added to the beginning of text files to identify the encoding. Once the 3 bytes are removed, cuDF should be able to read the file. |
Adds support to the CSV reader to skip the UTF-8 BOM bytes at the beginning of a text source. Some text files may contain a BOM (byte order marker) at the beginning of the file to identify its encoding. https://en.wikipedia.org/wiki/Byte_order_mark cuDF only support UTF-8 encoding so skipping these bytes allows us to support these types of files. Closes #12516 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13516
Describe the bug
writing code with
import cudf as pd
Steps/Code to reproduce bug
Expected behavior
same behavior as
pandas
The text was updated successfully, but these errors were encountered: