Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49569][CONNECT][SQL] Add shims to support SparkContext and RDD #48065

Closed
wants to merge 14 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR does two things:

  • It adds shims for SparkContext and RDD. These are in a separate module. This module is a compile time dependency for sql/api, and a regular dependency for connector/connect/client/jvm. We remove this dependency in catalyst and connect-server because those should use the actual implementation.
  • It adds RDD (and the one SparkContext) based method to the shared Scala API. For connect these methods throw an unsupported operation exception.

Why are the changes needed?

We are creating a shared Scala interface for Classic and Connect.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests. I will add a couple on the connect side.

Was this patch authored or co-authored using generative AI tooling?

No.

@@ -52,4 +52,7 @@ package object sql {
f(builder)
column(builder.build())
}

private[sql] def throwRddNotSupportedException(): Nothing =
throw new UnsupportedOperationException("RDDs are not supported in Spark Connect.")
Copy link
Contributor Author

@hvanhovell hvanhovell Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this use the error framework.

client.hijackServerSideSessionIdForTesting(suffix)
}

/** @inheritdoc */
override def sparkContext: SparkContext =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this use the error framework.

@hvanhovell
Copy link
Contributor Author

cc @grundprinzip @HyukjinKwon

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make CIs happy, @hvanhovell ?

[error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:83:18: value aggregate is not a member of org.apache.spark.rdd.RDD[Array[String]]
[error]         tokenRDD.aggregate(startType)(inferRowType, mergeRowTypes)

@hvanhovell
Copy link
Contributor Author

@dongjoon-hyun yeah working on it. SBT does not seem to respect maven exclusions...

@dongjoon-hyun
Copy link
Member

Could you rebase once more, @hvanhovell ?

@hvanhovell
Copy link
Contributor Author

@dongjoon-hyun it is a bit more complicated. While SBT compile seems to respect Maven exclusions, test:compile and package do not. Investigating. I'd personally not hold off preview2 because of this.

@dongjoon-hyun
Copy link
Member

Ack! Thank you for sharing the status, @hvanhovell .

@dongjoon-hyun
Copy link
Member

Could you re-trigger the failed CIs?

[info] ClientE2ETestSuite:
[info] org.apache.spark.sql.ClientE2ETestSuite *** ABORTED *** (2 minutes, 48 seconds)
[info]   The code passed to eventually never returned normally. Attempted 1 times over 2.805218503466667 minutes. (RemoteSparkSession.scala:197)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:

@hvanhovell
Copy link
Contributor Author

Merging to master.

@asfgit asfgit closed this in 80ae411 Oct 9, 2024
@LuciferYang
Copy link
Contributor

This PR has caused the Maven daily test build to fail: xxx.

scaladoc error: fatal error: object scala in compiler mirror not found.
Error:  Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.1:doc-jar (attach-scaladocs) on project spark-connect-shims_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :spark-connect-shims_2.13
Error: Process completed with exit code 1.

I am trying to fix it at: #48399

himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
This PR does two things:
- It adds shims for SparkContext and RDD. These are in a separate module. This module is a compile time dependency for sql/api, and a regular dependency for connector/connect/client/jvm. We remove this dependency in catalyst and connect-server because those should use the actual implementation.
- It adds RDD (and the one SparkContext) based method to the shared Scala API. For connect these methods throw an unsupported operation exception.

### Why are the changes needed?
We are creating a shared Scala interface for Classic and Connect.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests. I will add a couple on the connect side.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#48065 from hvanhovell/SPARK-49569.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants