[SEDONA-233] Incorrect results for several joins in a single stage #748

umartin · 2023-01-18T09:09:17Z

Did you read the Contributor Guide?

Yes, I have read Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the assoicated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-233. The PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

This patch changes how the deduplication gets it partition id. The previous method of getting it from TaskContext was unreliable. Now it uses mapPartitionsWithIndex. The documentation clearly states that is uses the original partition id. https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html#mapPartitionsWithIndex[U](f:(Int,Iterator[T])=%3EIterator[U],preservesPartitioning:Boolean)(implicitevidence$9:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]

Deduplication is refactored out of the join judgement into a separate DuplicatesFilter.

Deduplication code that is used in sedona-flink is moved to common.

How was this patch tested?

Unit test added

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the docs.

jiayuasu · 2023-01-19T04:19:38Z

core/src/main/java/org/apache/sedona/core/joinJudgement/DuplicatesFilter.java

@@ -0,0 +1,48 @@
+package org.apache.sedona.core.joinJudgement;
+


Please add Apache License header here.

jiayuasu · 2023-01-19T04:20:56Z

core/src/test/java/org/apache/sedona/core/spatialOperator/JoinQueryDeduplicationTest.java

@@ -0,0 +1,75 @@
+/*


Could you please add another test in Sedona Spark SQL to verify that this bug is eliminated?

Sure! Hopefully I have time to add it tomorrow

Added test in sedona-sql. Added missing license header.

jiayuasu · 2023-01-24T19:46:03Z

@yitao-li

Dear Yitao, Sedona R build started to fail since this PR. But the PR is not relevant to the R side, could you please take a look?

Thanks,
Jia

yitao-li · 2023-01-26T19:43:57Z

Hello Jia, Apologies for the delayed reply! I'll take a look as soon as I can. I don't have a lot of bandwidth at the moment though, mainly busy with work, and then helping take care of my 6-month-old before and after work, among other things... I wonder whether there are other folks such as the current maintainer of sparklyr who might be able to help? Also, while we are on the topic of sparklyr: TBH 2 things that always weighed on my mind while working with sparklyr and sparklyr-related R packages are (1) sparklyr uses java reflection to invoke JVM methods and it is so easy to break abstraction with it if one is not careful, and then (2) it might also run into subtle interop issues with Scala. I don't know whether there is a great solution to any of those 2 potential issues at the moment though. Anyhow, I'll look into this issue as soon as I can. Best regards, Yitao

…

On Tue, Jan 24, 2023 at 7:46 PM Jia Yu ***@***.***> wrote: @yitao-li <https://github.com/yitao-li> Dear Yitao, Sedona R build started to fail since this PR. But the PR is not relevant to the R side, could you please take a look? Thanks, Jia — Reply to this email directly, view it on GitHub <#748 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADC6YVMKKKADUVDXQUDMGLWUAWQPANCNFSM6AAAAAAT62JUUU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

…pache#748)

[SEDONA-233] Incorrect results for several joins in a single stage

ff74757

jiayuasu requested changes Jan 19, 2023

View reviewed changes

jiayuasu added the bug label Jan 19, 2023

jiayuasu added this to the sedona-1.4.0 milestone Jan 19, 2023

jiayuasu added the sedona-sql label Jan 19, 2023

[SEDONA-233] Incorrect results for several joins in a single stage

a7176e6

Added test in sedona-sql. Added missing license header.

jiayuasu approved these changes Jan 20, 2023

View reviewed changes

jiayuasu added the resolved label Jan 20, 2023

jiayuasu merged commit 43e1d79 into apache:master Jan 20, 2023

Kontinuation pushed a commit to Kontinuation/sedona that referenced this pull request Oct 11, 2024

[SEDONA-233] Incorrect results for several joins in a single stage (a…

6d6dbc4

…pache#748)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-233] Incorrect results for several joins in a single stage #748

[SEDONA-233] Incorrect results for several joins in a single stage #748

umartin commented Jan 18, 2023

jiayuasu Jan 19, 2023

umartin Jan 19, 2023

jiayuasu Jan 19, 2023

umartin Jan 19, 2023

jiayuasu commented Jan 24, 2023

yitao-li commented Jan 26, 2023 via email

		@@ -0,0 +1,48 @@
		package org.apache.sedona.core.joinJudgement;

[SEDONA-233] Incorrect results for several joins in a single stage #748

[SEDONA-233] Incorrect results for several joins in a single stage #748

Conversation

umartin commented Jan 18, 2023

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

jiayuasu Jan 19, 2023

Choose a reason for hiding this comment

umartin Jan 19, 2023

Choose a reason for hiding this comment

jiayuasu Jan 19, 2023

Choose a reason for hiding this comment

umartin Jan 19, 2023

Choose a reason for hiding this comment

jiayuasu commented Jan 24, 2023

yitao-li commented Jan 26, 2023 via email