-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-233] Incorrect results for several joins in a single stage #748
Conversation
@@ -0,0 +1,48 @@ | |||
package org.apache.sedona.core.joinJudgement; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Apache License header here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
@@ -0,0 +1,75 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add another test in Sedona Spark SQL to verify that this bug is eliminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Hopefully I have time to add it tomorrow
Added test in sedona-sql. Added missing license header.
Dear Yitao, Sedona R build started to fail since this PR. But the PR is not relevant to the R side, could you please take a look? Thanks, |
Hello Jia,
Apologies for the delayed reply! I'll take a look as soon as I can.
I don't have a lot of bandwidth at the moment though, mainly busy with
work, and then helping take care of my 6-month-old before and after work,
among other things... I wonder whether there are other folks such as the
current maintainer of sparklyr who might be able to help?
Also, while we are on the topic of sparklyr: TBH 2 things that always
weighed on my mind while working with sparklyr and sparklyr-related R
packages are (1) sparklyr uses java reflection to invoke JVM methods and it
is so easy to break abstraction with it if one is not careful, and then (2)
it might also run into subtle interop issues with Scala. I don't know
whether there is a great solution to any of those 2 potential issues at the
moment though.
Anyhow, I'll look into this issue as soon as I can.
Best regards,
Yitao
…On Tue, Jan 24, 2023 at 7:46 PM Jia Yu ***@***.***> wrote:
@yitao-li <https://github.com/yitao-li>
Dear Yitao, Sedona R build started to fail since this PR. But the PR is
not relevant to the R side, could you please take a look?
Thanks,
Jia
—
Reply to this email directly, view it on GitHub
<#748 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADC6YVMKKKADUVDXQUDMGLWUAWQPANCNFSM6AAAAAAT62JUUU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-XXX] my subject
.What changes were proposed in this PR?
This patch changes how the deduplication gets it partition id. The previous method of getting it from TaskContext was unreliable. Now it uses mapPartitionsWithIndex. The documentation clearly states that is uses the original partition id. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html#mapPartitionsWithIndex[U](f:(Int,Iterator[T])=%3EIterator[U],preservesPartitioning:Boolean)(implicitevidence$9:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]
Deduplication is refactored out of the join judgement into a separate DuplicatesFilter.
Deduplication code that is used in sedona-flink is moved to common.
How was this patch tested?
Unit test added
Did this PR include necessary documentation updates?