[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

hvanhovell · 2024-09-10T20:35:36Z

What changes were proposed in this pull request?

This PR does two things:

It adds shims for SparkContext and RDD. These are in a separate module. This module is a compile time dependency for sql/api, and a regular dependency for connector/connect/client/jvm. We remove this dependency in catalyst and connect-server because those should use the actual implementation.
It adds RDD (and the one SparkContext) based method to the shared Scala API. For connect these methods throw an unsupported operation exception.

Why are the changes needed?

We are creating a shared Scala interface for Classic and Connect.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. I will add a couple on the connect side.

Was this patch authored or co-authored using generative AI tooling?

No.

hvanhovell · 2024-09-10T20:37:54Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/package.scala

@@ -52,4 +52,7 @@ package object sql {
    f(builder)
    column(builder.build())
  }
+
+  private[sql] def throwRddNotSupportedException(): Nothing =
+    throw new UnsupportedOperationException("RDDs are not supported in Spark Connect.")


Make this use the error framework.

hvanhovell · 2024-09-10T20:38:35Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala

    client.hijackServerSideSessionIdForTesting(suffix)
  }

+  /** @inheritdoc */
+  override def sparkContext: SparkContext =


Make this use the error framework.

hvanhovell · 2024-09-10T20:39:28Z

cc @grundprinzip @HyukjinKwon

dongjoon-hyun

Could you make CIs happy, @hvanhovell ?

[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:83:18: value aggregate is not a member of org.apache.spark.rdd.RDD[Array[String]]
[error]         tokenRDD.aggregate(startType)(inferRowType, mergeRowTypes)

hvanhovell · 2024-09-11T15:11:08Z

@dongjoon-hyun yeah working on it. SBT does not seem to respect maven exclusions...

dongjoon-hyun · 2024-09-12T21:01:32Z

Could you rebase once more, @hvanhovell ?

hvanhovell · 2024-09-14T02:04:08Z

@dongjoon-hyun it is a bit more complicated. While SBT compile seems to respect Maven exclusions, test:compile and package do not. Investigating. I'd personally not hold off preview2 because of this.

dongjoon-hyun · 2024-09-14T02:06:33Z

Ack! Thank you for sharing the status, @hvanhovell .

dongjoon-hyun · 2024-09-26T21:29:33Z

Could you re-trigger the failed CIs?

[info] ClientE2ETestSuite:
[info] org.apache.spark.sql.ClientE2ETestSuite *** ABORTED *** (2 minutes, 48 seconds)
[info]   The code passed to eventually never returned normally. Attempted 1 times over 2.805218503466667 minutes. (RemoteSparkSession.scala:197)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:

hvanhovell · 2024-10-09T01:23:38Z

Merging to master.

LuciferYang · 2024-10-09T15:18:55Z

This PR has caused the Maven daily test build to fail: xxx.

scaladoc error: fatal error: object scala in compiler mirror not found.
Error:  Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.1:doc-jar (attach-scaladocs) on project spark-connect-shims_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :spark-connect-shims_2.13
Error: Process completed with exit code 1.

I am trying to fix it at: #48399

### What changes were proposed in this pull request? This PR does two things: - It adds shims for SparkContext and RDD. These are in a separate module. This module is a compile time dependency for sql/api, and a regular dependency for connector/connect/client/jvm. We remove this dependency in catalyst and connect-server because those should use the actual implementation. - It adds RDD (and the one SparkContext) based method to the shared Scala API. For connect these methods throw an unsupported operation exception. ### Why are the changes needed? We are creating a shared Scala interface for Classic and Connect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. I will add a couple on the connect side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48065 from hvanhovell/SPARK-49569. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

Add shims to support SparkContext and RDD

a29a518

github-actions bot added SQL BUILD DOCS CONNECT labels Sep 10, 2024

hvanhovell commented Sep 10, 2024

View reviewed changes

HyukjinKwon approved these changes Sep 10, 2024

View reviewed changes

dongjoon-hyun reviewed Sep 11, 2024

View reviewed changes

Merge remote-tracking branch 'apache/master' into SPARK-49569

28ee94d

Merge remote-tracking branch 'apache/master' into SPARK-49569

4a433d7

hvanhovell added 6 commits September 24, 2024 14:40

Merge remote-tracking branch 'apache/master' into SPARK-49569

602a5a0

Fix SBT build

91bd6a2

Fix SBT build

7f9993c

Merge remote-tracking branch 'apache/master' into SPARK-49569

f3fff17

Style

42f5364

MiMa/Docs/Implicits

d08b3cb

hvanhovell added 5 commits September 30, 2024 13:46

Exclude shims from runtime classpath

d776b97

Merge remote-tracking branch 'apache/master' into SPARK-49569

8e6b63c

Fix docs

15be1b5

Merge remote-tracking branch 'apache/master' into SPARK-49569

80dab9e

Fix build... again...

82b6562

asfgit closed this in 80ae411 Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

hvanhovell commented Sep 10, 2024

hvanhovell Sep 10, 2024 •

edited

Loading

hvanhovell Sep 10, 2024

hvanhovell commented Sep 10, 2024

dongjoon-hyun left a comment

hvanhovell commented Sep 11, 2024

dongjoon-hyun commented Sep 12, 2024

hvanhovell commented Sep 14, 2024

dongjoon-hyun commented Sep 14, 2024

dongjoon-hyun commented Sep 26, 2024

hvanhovell commented Oct 9, 2024

LuciferYang commented Oct 9, 2024

[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

Conversation

hvanhovell commented Sep 10, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

hvanhovell Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

hvanhovell Sep 10, 2024

Choose a reason for hiding this comment

hvanhovell commented Sep 10, 2024

dongjoon-hyun left a comment

Choose a reason for hiding this comment

hvanhovell commented Sep 11, 2024

dongjoon-hyun commented Sep 12, 2024

hvanhovell commented Sep 14, 2024

dongjoon-hyun commented Sep 14, 2024

dongjoon-hyun commented Sep 26, 2024

hvanhovell commented Oct 9, 2024

LuciferYang commented Oct 9, 2024

hvanhovell Sep 10, 2024 •

edited

Loading