Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] a simple query returns wrong results #6435

Open
Tracked by #2063
ilaychen opened this issue Aug 28, 2022 · 8 comments
Open
Tracked by #2063

[BUG] a simple query returns wrong results #6435

ilaychen opened this issue Aug 28, 2022 · 8 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release

Comments

@ilaychen
Copy link

ilaychen commented Aug 28, 2022

Describe the bug
I'm running a simple query on both spark-rapids and spark 3.1.3 (on GCP's Dataproc clusters), and I'm getting different results.
The query I'm running (on a ~15TB dataset) is:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")

The results for pure spark 3.1.3 are:

+--------------+----------+
|    cntry_code|         c|
+--------------+----------+
|            US| 108174267|
|            GB|  28301655|
|            DE|  21627123|
|            FR|  12282801|
|            CA|   8104623|
|            AU|   7106091|
|            IT|   6912796|
|            ES|   5609006|
+--------------+----------+

the results for spark-rapids are:

+--------------+----------+
|    cntry_code|         c|
+--------------+----------+
|            US| 108000877|
|            GB|  28256306|
|            DE|  21592682|
|            FR|  12262796|
|            CA|   8091397|
|            AU|   7094689|
|            IT|   6901487|
|            ES|   5599871|
+------------+------------+

The data i'm reading is CSV type, I'll try to figure out a way to share the dataset if that's important.

Steps/Code to reproduce bug
Start two Dataproc clusters, the first one as mentioned in here, the second one should be pure spark 3.1.3 .
In both clusters perform a similar spark sql query, such as:
spark.sql("SELECT cntry_code, COUNT(cntry_code) as c from locations_ GROUP BY cntry_code sort by c DESC")
and check the results.

Expected behavior
the outputs should be exactly the same.

Environment details (please complete the following information)

  • Environment location: GCP
  • Spark configuration settings related to the issue
    for spark-rapids:
conf = SparkConf().setAppName("Locations")
conf.set('spark.rapids.sql.explain', 'ALL')
conf.set("spark.executor.instances", "2")
conf.set("spark.executor.cores", "1")
conf.set("spark.task.cpus", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.executor.memory", "16g")
conf.set("spark.rapids.memory.pinnedPool.size", "2G")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.executor.extraJavaOptions", "-Dai.rapids.cudf.prefer-pinned=true")
conf.set("spark.locality.wait", "0s")
conf.set("spark.sql.files.maxPartitionBytes", "2G")
conf.set("spark.sql.broadcastTimeout", "3000")
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.task.resource.gpu.amount", "0.142")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set("spark.rapids.sql.regexp.enabled","true")
conf.set('spark.rapids.sql.variableFloatAgg.enabled', 'true')
conf.set('spark.rapids.sql.csv.read.double.enabled', 'true')
conf.set('spark.rapids.sql.exec.CollectLimitExec', 'true')

for spark 3.1.3:

conf = SparkConf().setAppName("Locations")
conf.set("spark.executor.instances", "2")
conf.set("spark.executor.cores", "1")
conf.set("spark.task.cpus", "1")
conf.set("spark.executor.memory", "16g")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.sql.files.maxPartitionBytes", "2G")
conf.set("spark.sql.broadcastTimeout", "3000")

Additional context
Add any other context about the problem here.

@ilaychen ilaychen added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 28, 2022
@ilaychen ilaychen changed the title [BUG] a simple query has wrong results [BUG] a simple query returns wrong results Aug 28, 2022
@ilaychen ilaychen reopened this Aug 28, 2022
@viadea
Copy link
Collaborator

viadea commented Aug 30, 2022

@ilaychen Could you send the details to spark-rapids-support spark-rapids-support@nvidia.com?
Such as the Spark RAPIDS version you are using and the sample dataset which can reproduce this issue?

@ilaychen
Copy link
Author

ilaychen commented Sep 8, 2022

Hi, I found the case where this issue happens.
It seems like spark-rapids returns wrong results even for a simple df.count() after processing the sample of data that is mentioned below.
This issue doesn't appear for very small files (less than 1k rows)

You can generate a csv file of ~1k rows with this kind of data:
(try to use these rows, and generate ~1k similar rows)

3453433564704482365,7622 S Handy Street,"",Silva,85468,NY,United States,US
1641493545665433335,241 Inlusive Road,Flat 30Trafalgar Court,Manchester,M16 8JW,"",United Kingdom,GB
1297115076542454494,2557 Latte Blvd,#402,Kansas City,64108,MO,United States,US
2119952246048784100,42 North Street,"",Sand-on-Sea,ZZ2 5HU,"",United Kingdom,GB
2186678475639058807,329 Johannah Way,"",Bridger,07507,NJ,United States,US
1730422379578793088,25 Yore Rd,"",MIAM,M9M 1W5,ON,Canada,CA

Read it within a spark-rapids session with:

customSchema = StructType([
  StructField("id", StringType(), True),
  StructField("addr_1", StringType(), True),
  StructField("addr_2", StringType(), True),
  StructField("city", StringType(), True),
  StructField("zip", StringType(), True),
  StructField("state", StringType(), True),
  StructField("cntry", StringType(), True),
  StructField("cntry_code", StringType(), True)]
)
df = spark.read.csv(path, schema=customSchema)

and count the dataframe.. you'll see that the number of rows in the actual file is different than the number of rows in the dataframe.
Another way to see that the processing has a bug - is to try to read id 1730422379578793088.. spark-rapids can't read it
spark.sql("SELECT * from df_tmpView where cust_id = '1730422379578793088'").count()

@viadea
Copy link
Collaborator

viadea commented Sep 8, 2022

@ilaychen I duplicated your sample data to 2000+ CSV rows(without header) and used latest 22.10 snapshot jar to test it.
And it worked fine for me:

from pyspark.sql.types import *

customSchema = StructType([
  StructField("id", StringType(), True),
  StructField("addr_1", StringType(), True),
  StructField("addr_2", StringType(), True),
  StructField("city", StringType(), True),
  StructField("zip", StringType(), True),
  StructField("state", StringType(), True),
  StructField("cntry", StringType(), True),
  StructField("cntry_code", StringType(), True)]
)

path = "/home/xxx/data/xxx/samplecsv/"
df = spark.read.csv(path, schema=customSchema)
df.count()

2394

It matches the sample CSV file:

$  wc -l a.csv
2394 a.csv

Could you share below by email to us(spark-rapids-support@nvidia.com)?

  1. Sample data(maybe 1k+ rows) which can reproduce this issue.
  2. What is the exact version of your Spark RAPIDS jar?
  3. What are the detailed spark and spark rapids configs you are using? Maybe the whole spark-defaults.conf used

@viadea
Copy link
Collaborator

viadea commented Oct 19, 2022

Thanks @ilaychen for sharing the sample data and I can reproduce the issue now.
The key to reproduce is if there is a value with "", then it will stop there.
For example, if one column is:

abc""

GPU run:

>>> df.count()
12

GPU run:

>>> spark.conf.set("spark.rapids.sql.enabled","false")
>>> df.count()
14

@ilaychen
Copy link
Author

ilaychen commented Oct 19, 2022

My pleasure! @viadea
Adding the example csv file that produces this error:

134324937434,#1991 N Grayhawk,"",Menlo Park,89025,AB,United States,US
208564744937,"63,trevion Way","",st Lothian,h7f4h8,"",United Kingdom,GB
132709376823,16 Oakland PARK RD,"",ring,l1w1e4,South,Canada,CA
224867848652,7 kingwell Court,"",United,s7jd9,South United,United Kingdom,GB
169636884295,30 cartuja Road,"",Halifax,L0R 9p2,ON,Canada,CA
859473321609,Street,"",Manchester,92220,OR,United States,US
141096112545,99 rue des,"",Australia,jsd9je,"",France,FR
160397658930,5 Rise,"",walligshngton,RY6 8LT,FORT,United Kingdom,GB
726367494002,1852 Townsend st,666,Wallsend,90382,CA,United States,US
187644735867,Bärbel-HAMPDEN-Ping 37,"",Miami,13355,"",Kingdom,ZZ
948475348324,155 sw City ct,Rochdale,Germany,30864,FL,Australia,QQ
164083193213,abc"","",Jerez Fra.,11401,Cadiz,Spain,ES
198732413077,3p Grove Rochdale road,BAW,Fulifax,HX4 trW,"",Israel,GB
227433927227,95 novem blvd,"",RAW VILLAGE,3173,XYZ,Australia,IL

The schema that is mentioned above still applies 😃

@revans2
Copy link
Collaborator

revans2 commented Oct 19, 2022

I filed rapidsai/cudf#11948 in CUDF for the issue as it is their bug. I'll also try to take a look at their code to see what I can come up with.

@revans2 revans2 self-assigned this Oct 19, 2022
@revans2 revans2 added P0 Must have for release cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels Oct 19, 2022
@revans2
Copy link
Collaborator

revans2 commented Oct 20, 2022

It looks like there are two issues happening here. One of them is that CUDF is returning the wrong number of rows. It gets confused when it sees the "" in the row separator logic. The second one is that CUDF does not support escape characters. #129 instead it only supports the pandas default of double quotes to escape a single quote. i.e. abc"" => abc" because the "" is used to escape a single quote. I think the first one CUDF might be able to fix. The second one appears to be working as designed, at least until we can get them to add in a new feature to the CSV parser.

@revans2
Copy link
Collaborator

revans2 commented Aug 16, 2023

the CUDF issue rapidsai/cudf#11948 is to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf P0 Must have for release
Projects
None yet
Development

No branches or pull requests

3 participants