Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

Open
revans2 opened this issue Oct 19, 2022 · 5 comments
Open
Assignees
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Oct 19, 2022

Describe the bug
This is directly from NVIDIA/spark-rapids#6435 If you have a field like abc"" in a CSV file the cudf CSV parser stops processing more data.

Steps/Code to reproduce bug
Create a file test.csv with the following data in it.

134324937434,#1991 N Grayhawk,"",Menlo Park,89025,AB,United States,US
208564744937,"63,trevion Way","",st Lothian,h7f4h8,"",United Kingdom,GB
132709376823,16 Oakland PARK RD,"",ring,l1w1e4,South,Canada,CA
224867848652,7 kingwell Court,"",United,s7jd9,South United,United Kingdom,GB
169636884295,30 cartuja Road,"",Halifax,L0R 9p2,ON,Canada,CA
859473321609,Street,"",Manchester,92220,OR,United States,US
141096112545,99 rue des,"",Australia,jsd9je,"",France,FR
160397658930,5 Rise,"",walligshngton,RY6 8LT,FORT,United Kingdom,GB
726367494002,1852 Townsend st,666,Wallsend,90382,CA,United States,US
187644735867,Bärbel-HAMPDEN-Ping 37,"",Miami,13355,"",Kingdom,ZZ
948475348324,155 sw City ct,Rochdale,Germany,30864,FL,Australia,QQ
164083193213,abc"","",Jerez Fra.,11401,Cadiz,Spain,ES
198732413077,3p Grove Rochdale road,BAW,Fulifax,HX4 trW,"",Israel,GB
227433927227,95 novem blvd,"",RAW VILLAGE,3173,XYZ,Australia,IL

Now try to read it using CUDF. The last two rows are skipped, and the acb"" is read back missing the last "

(From spark using the rapids plugin for apache spark)

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|                abc"|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Without the plugin I get back

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|               abc""|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
|198732413077|3p Grove Rochdale...|     BAW|      Fulifax|HX4 trW|        null|        Israel| GB|
|227433927227|       95 novem blvd|    null|  RAW VILLAGE|   3173|         XYZ|     Australia| IL|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Which is also what I get back from pandas.

>>> pd.read_csv("./test.csv", header=None)
               0                       1         2              3        4             5               6   7
0   134324937434        #1991 N Grayhawk       NaN     Menlo Park    89025            AB   United States  US
1   208564744937          63,trevion Way       NaN     st Lothian   h7f4h8           NaN  United Kingdom  GB
2   132709376823      16 Oakland PARK RD       NaN           ring   l1w1e4         South          Canada  CA
3   224867848652        7 kingwell Court       NaN         United    s7jd9  South United  United Kingdom  GB
4   169636884295         30 cartuja Road       NaN        Halifax  L0R 9p2            ON          Canada  CA
5   859473321609                  Street       NaN     Manchester    92220            OR   United States  US
6   141096112545              99 rue des       NaN      Australia   jsd9je           NaN          France  FR
7   160397658930                  5 Rise       NaN  walligshngton  RY6 8LT          FORT  United Kingdom  GB
8   726367494002        1852 Townsend st       666       Wallsend    90382            CA   United States  US
9   187644735867  Bärbel-HAMPDEN-Ping 37       NaN          Miami    13355           NaN         Kingdom  ZZ
10  948475348324          155 sw City ct  Rochdale        Germany    30864            FL       Australia  QQ
11  164083193213                   abc""       NaN     Jerez Fra.    11401         Cadiz           Spain  ES
12  198732413077  3p Grove Rochdale road       BAW        Fulifax  HX4 trW           NaN          Israel  GB
13  227433927227           95 novem blvd       NaN    RAW VILLAGE     3173           XYZ       Australia  IL

Expected behavior
CUDF returns the same result as Pandas and Spark.

@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Oct 19, 2022
@revans2 revans2 added the Spark Functionality that helps Spark RAPIDS label Oct 19, 2022
@sameerz sameerz changed the title [BUG] CSV reader cannot handle unquoted quote characer appearing in a field [BUG] CSV reader cannot handle unquoted quote character appearing in a field Oct 19, 2022
@revans2
Copy link
Contributor Author

revans2 commented Oct 20, 2022

I was able to make the test case a lot simpler and still see the same error.

1
2
acb""
4
5

shows the same problems, only the first three lines come out, and the third entry is only abc", it is missing the final "

@revans2
Copy link
Contributor Author

revans2 commented Oct 20, 2022

Wow, even if I escape the quotes I still get the problem with the lines, but it "fixes" the issue with the quote at the end being removed, but it does not remove the escapes from the quotes.

1
2
"abc\"\""
4
5

But the output is

+-------+
|    _c0|
+-------+
|      1|
|      2|
|abc\"\"|
+-------+

when it should be

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

Oddly if I remove the escapes and just keep the entire thing quoted.

1
2
"abc"""
4
5

It fixes the problem with dropping lines, but it does not fix the single entry.

+----+
| _c0|
+----+
|   1|
|   2|
|abc"|
|   4|
|   5|
+----+

vs from spark

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

This one I am less sure that we have to match exactly what Spark is doing, because pandas matches CUDF in this case. Also pandas does different things for escaped quotes too, so just take these as info for now.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Oct 21, 2022
@GregoryKimball
Copy link
Contributor

Thanks @revans2 for investigating this. I'm posting a python reproducer:

>>> s = '1\n2\nabc""\n4\n5'
>>> pd.read_csv(StringIO(s), header=None)
       0
0      1
1      2
2  abc""
3      4
4      5
>>> cudf.read_csv(StringIO(s), header=None)
      0
0     1
1     2
2  abc"

@vuule vuule self-assigned this Oct 31, 2022
@vuule
Copy link
Contributor

vuule commented Nov 1, 2022

I can't repro the issue in the comment. Trying using Python:
s = '1\n2\n"abc\"\""\n4\n'
But I'm getting the same output as with Pandas (and it looks correct):

      1
0     2
1  abc"
2     4

@vuule
Copy link
Contributor

vuule commented Aug 16, 2023

Did some scoping and it seems like this requires large changes in the way the reader finds row offsets. Current state machine has four states (represented by two bits) and handling this would require an additional state, and thus more bits. My main concern is with the work involved to change the way state machine packs and handles the states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants