[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

revans2 · 2022-10-19T20:45:58Z

Describe the bug
This is directly from NVIDIA/spark-rapids#6435 If you have a field like abc"" in a CSV file the cudf CSV parser stops processing more data.

Steps/Code to reproduce bug
Create a file test.csv with the following data in it.

134324937434,#1991 N Grayhawk,"",Menlo Park,89025,AB,United States,US
208564744937,"63,trevion Way","",st Lothian,h7f4h8,"",United Kingdom,GB
132709376823,16 Oakland PARK RD,"",ring,l1w1e4,South,Canada,CA
224867848652,7 kingwell Court,"",United,s7jd9,South United,United Kingdom,GB
169636884295,30 cartuja Road,"",Halifax,L0R 9p2,ON,Canada,CA
859473321609,Street,"",Manchester,92220,OR,United States,US
141096112545,99 rue des,"",Australia,jsd9je,"",France,FR
160397658930,5 Rise,"",walligshngton,RY6 8LT,FORT,United Kingdom,GB
726367494002,1852 Townsend st,666,Wallsend,90382,CA,United States,US
187644735867,Bärbel-HAMPDEN-Ping 37,"",Miami,13355,"",Kingdom,ZZ
948475348324,155 sw City ct,Rochdale,Germany,30864,FL,Australia,QQ
164083193213,abc"","",Jerez Fra.,11401,Cadiz,Spain,ES
198732413077,3p Grove Rochdale road,BAW,Fulifax,HX4 trW,"",Israel,GB
227433927227,95 novem blvd,"",RAW VILLAGE,3173,XYZ,Australia,IL

Now try to read it using CUDF. The last two rows are skipped, and the acb"" is read back missing the last "

(From spark using the rapids plugin for apache spark)

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|                abc"|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Without the plugin I get back

+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|         _c0|                 _c1|     _c2|          _c3|    _c4|         _c5|           _c6|_c7|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+
|134324937434|    #1991 N Grayhawk|    null|   Menlo Park|  89025|          AB| United States| US|
|208564744937|      63,trevion Way|    null|   st Lothian| h7f4h8|        null|United Kingdom| GB|
|132709376823|  16 Oakland PARK RD|    null|         ring| l1w1e4|       South|        Canada| CA|
|224867848652|    7 kingwell Court|    null|       United|  s7jd9|South United|United Kingdom| GB|
|169636884295|     30 cartuja Road|    null|      Halifax|L0R 9p2|          ON|        Canada| CA|
|859473321609|              Street|    null|   Manchester|  92220|          OR| United States| US|
|141096112545|          99 rue des|    null|    Australia| jsd9je|        null|        France| FR|
|160397658930|              5 Rise|    null|walligshngton|RY6 8LT|        FORT|United Kingdom| GB|
|726367494002|    1852 Townsend st|     666|     Wallsend|  90382|          CA| United States| US|
|187644735867|Bärbel-HAMPDEN-Pi...|    null|        Miami|  13355|        null|       Kingdom| ZZ|
|948475348324|      155 sw City ct|Rochdale|      Germany|  30864|          FL|     Australia| QQ|
|164083193213|               abc""|    null|   Jerez Fra.|  11401|       Cadiz|         Spain| ES|
|198732413077|3p Grove Rochdale...|     BAW|      Fulifax|HX4 trW|        null|        Israel| GB|
|227433927227|       95 novem blvd|    null|  RAW VILLAGE|   3173|         XYZ|     Australia| IL|
+------------+--------------------+--------+-------------+-------+------------+--------------+---+

Which is also what I get back from pandas.

>>> pd.read_csv("./test.csv", header=None)
               0                       1         2              3        4             5               6   7
0   134324937434        #1991 N Grayhawk       NaN     Menlo Park    89025            AB   United States  US
1   208564744937          63,trevion Way       NaN     st Lothian   h7f4h8           NaN  United Kingdom  GB
2   132709376823      16 Oakland PARK RD       NaN           ring   l1w1e4         South          Canada  CA
3   224867848652        7 kingwell Court       NaN         United    s7jd9  South United  United Kingdom  GB
4   169636884295         30 cartuja Road       NaN        Halifax  L0R 9p2            ON          Canada  CA
5   859473321609                  Street       NaN     Manchester    92220            OR   United States  US
6   141096112545              99 rue des       NaN      Australia   jsd9je           NaN          France  FR
7   160397658930                  5 Rise       NaN  walligshngton  RY6 8LT          FORT  United Kingdom  GB
8   726367494002        1852 Townsend st       666       Wallsend    90382            CA   United States  US
9   187644735867  Bärbel-HAMPDEN-Ping 37       NaN          Miami    13355           NaN         Kingdom  ZZ
10  948475348324          155 sw City ct  Rochdale        Germany    30864            FL       Australia  QQ
11  164083193213                   abc""       NaN     Jerez Fra.    11401         Cadiz           Spain  ES
12  198732413077  3p Grove Rochdale road       BAW        Fulifax  HX4 trW           NaN          Israel  GB
13  227433927227           95 novem blvd       NaN    RAW VILLAGE     3173           XYZ       Australia  IL

Expected behavior
CUDF returns the same result as Pandas and Spark.

The text was updated successfully, but these errors were encountered:

revans2 · 2022-10-20T13:26:46Z

I was able to make the test case a lot simpler and still see the same error.

1
2
acb""
4
5

shows the same problems, only the first three lines come out, and the third entry is only abc", it is missing the final "

revans2 · 2022-10-20T13:45:47Z

Wow, even if I escape the quotes I still get the problem with the lines, but it "fixes" the issue with the quote at the end being removed, but it does not remove the escapes from the quotes.

1
2
"abc\"\""
4
5

But the output is

+-------+
|    _c0|
+-------+
|      1|
|      2|
|abc\"\"|
+-------+

when it should be

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

Oddly if I remove the escapes and just keep the entire thing quoted.

1
2
"abc"""
4
5

It fixes the problem with dropping lines, but it does not fix the single entry.

+----+
| _c0|
+----+
|   1|
|   2|
|abc"|
|   4|
|   5|
+----+

vs from spark

+-----+
|  _c0|
+-----+
|    1|
|    2|
|abc""|
|    4|
|    5|
+-----+

This one I am less sure that we have to match exactly what Spark is doing, because pandas matches CUDF in this case. Also pandas does different things for escaped quotes too, so just take these as info for now.

GregoryKimball · 2022-10-21T18:27:39Z

Thanks @revans2 for investigating this. I'm posting a python reproducer:

>>> s = '1\n2\nabc""\n4\n5'
>>> pd.read_csv(StringIO(s), header=None)
       0
0      1
1      2
2  abc""
3      4
4      5
>>> cudf.read_csv(StringIO(s), header=None)
      0
0     1
1     2
2  abc"

vuule · 2022-11-01T19:45:09Z

I can't repro the issue in the comment. Trying using Python:
s = '1\n2\n"abc\"\""\n4\n'
But I'm getting the same output as with Pandas (and it looks correct):

vuule · 2023-08-16T22:18:17Z

Did some scoping and it seems like this requires large changes in the way the reader finds row offsets. Current state machine has four states (represented by two bits) and handling this would require an additional state, and thus more bits. My main concern is with the work involved to change the way state machine packs and handles the states.

revans2 added bug Something isn't working Needs Triage Need team to review and classify labels Oct 19, 2022

revans2 added the Spark Functionality that helps Spark RAPIDS label Oct 19, 2022

revans2 mentioned this issue Oct 19, 2022

[BUG] a simple query returns wrong results NVIDIA/spark-rapids#6435

Open

sameerz changed the title ~~[BUG] CSV reader cannot handle unquoted quote characer appearing in a field~~ [BUG] CSV reader cannot handle unquoted quote character appearing in a field Oct 19, 2022

GregoryKimball added this to the Modernize CSV reader milestone Oct 21, 2022

GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Oct 21, 2022

vuule self-assigned this Oct 31, 2022

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

revans2 mentioned this issue Aug 16, 2023

[BUG] Reading CSV file with "" field causes rows to not be read NVIDIA/spark-rapids#8926

Open

GregoryKimball mentioned this issue Aug 18, 2023

[FEA] Modernize CSV reader and expand reader options #13916

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

revans2 commented Oct 19, 2022

revans2 commented Oct 20, 2022

revans2 commented Oct 20, 2022

GregoryKimball commented Oct 21, 2022

vuule commented Nov 1, 2022 •

edited

Loading

vuule commented Aug 16, 2023

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948

Comments

revans2 commented Oct 19, 2022

revans2 commented Oct 20, 2022

revans2 commented Oct 20, 2022

GregoryKimball commented Oct 21, 2022

vuule commented Nov 1, 2022 • edited Loading

vuule commented Aug 16, 2023

vuule commented Nov 1, 2022 •

edited

Loading