Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to Spark 2.0 #221

Closed
wants to merge 9 commits into from

Conversation

liancheng
Copy link
Contributor

This PR migrates spark-redshift to Spark 2.0.

It is marked as WIP because it depends on databricks/spark-avro#137.

@@ -44,9 +44,9 @@ object SparkRedshiftBuild extends Build {
organization := "com.databricks",
scalaVersion := "2.10.5",
crossScalaVersions := Seq("2.10.5", "2.11.7"),
sparkVersion := "1.4.1",
sparkVersion := "2.0.0-SNAPSHOT",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should change version in title to "Spark 2.0.0-SNAPSHOT".

@clockfly
Copy link

Travis file is not updated, now the build fails.
https://github.com/databricks/spark-redshift/blob/master/.travis.yml

@clockfly
Copy link

Should we use a new SNAPSHOT version considering 0.6.1-SNAPSHOT is marked 6 months ago and this PR introduces an incompatible change? Probably 0.6.2-SNAPSHOT?

@clockfly
Copy link

Should we also update README.md? Since this introduces an incompatible change, and in the README, there is section like:

This library requires Apache Spark 1.4+ and Amazon Redshift 1.0.963+.

@liancheng
Copy link
Contributor Author

@clockfly Thanks for the review, I haven't updated the build stuff since it's still blocked by databricks/spark-avro#137, and Travis build won't pass anyway. Will update them later.

@codecov-io
Copy link

codecov-io commented Jul 6, 2016

Current coverage is 89.05%

Merging #221 into master will decrease coverage by 0.37%

@@             master       #221   diff @@
==========================================
  Files            13         13          
  Lines           681        667    -14   
  Methods         596        582    -14   
  Messages          0          0          
  Branches         85         85          
==========================================
- Hits            609        594    -15   
- Misses           72         73     +1   
  Partials          0          0          

Powered by Codecov. Last updated by a1749cb...ed33941

@liancheng
Copy link
Contributor Author

liancheng commented Jul 6, 2016

Updated version information and dependencies. Now we are using the newly released spark-avro 3.0.0-preview2. This is ready for review now.

@liancheng liancheng changed the title [WIP] Migrate to Spark 2.0 Migrate to Spark 2.0 Jul 6, 2016
@liancheng
Copy link
Contributor Author

@JoshRosen Would you mind to help review this one? Thanks!

@JoshRosen
Copy link
Contributor

In order to trigger the end-to-end integration tests, you need to push this code into a branch in the databricks/spark-redshift repo (since Travis won't decrypt credentials for pull-request jobs). I'm going to push a copy of this to a temporary branch to see if the tests pass there.

This looks good to me, but I'm wondering if we're planning to make a new release for Spark 1.x in the near term. If we do, then it's going to mean that we'll have to make a separate 1.x branch and merge outstanding PRs twice, once into the master branch and once into the 1.x maintenance branch. That's not a huge deal, though.

@@ -22,22 +22,22 @@ This library is more suited to ETL than interactive queries, since large amounts

## Installation

This library requires Apache Spark 1.4+ and Amazon Redshift 1.0.963+.
This library requires Apache Spark 2.0+ and Amazon Redshift 1.0.963+.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is going to be the README when people browse to the repository on GitHub, we should probably have a pointer to a branch containing the Spark 1.x version of the code or a table listing the releases for the Spark 1.x line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I do this after making the branch for Spark 1.x?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new branch at https://github.com/databricks/spark-redshift/tree/branch-1.x for holding the 1.x code.

@JoshRosen
Copy link
Contributor

I'm going to try to help review and merge the other outstanding PRs right now so that we can hopefully make a new 1.x release.

@JoshRosen
Copy link
Contributor

Status update: merged a number of open PRs and am on track to wrap up #157 and then will cut a release. In case this merge-conflicts I'll take care of rebasing.

@liancheng
Copy link
Contributor Author

@JoshRosen Thanks for the review!


You may use this library in your applications with the following dependency information:

**Scala 2.10**
```
groupId: com.databricks
artifactId: spark-redshift_2.10
version: 0.6.0
version: 0.7.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change these versions to 2.0.0-SNAPSHOT. Also, make sure to update version.sbt in the root of this repo.

@yhuai
Copy link

yhuai commented Jul 13, 2016

@liancheng Let's also update README. Thanks!

@JoshRosen
Copy link
Contributor

Since Spark 2.0+ only supports Hadoop 2.x, we can remove the reflection hack at

val method = context.getClass.getMethod("getConfiguration")

Similarly, we can remove the reflection at

// we need to rely on. This class was relocated in Spark 1.5.0, so we need to use reflection

@liancheng liancheng force-pushed the spark-2.0-migration branch from 4cefa43 to cf7d394 Compare July 14, 2016 15:03
@liancheng
Copy link
Contributor Author

@JoshRosen @yhuai Thanks for the review! Comments addressed. Shall we also remove deprecated methods like those defined in RedshiftContext? I can do it in another PR.

@JoshRosen
Copy link
Contributor

I pushed this PR's changes to a branch in the databricks/spark-redshift repository so that the integration tests could run and it looks like they're all failing due to changes in TestHive: https://travis-ci.org/databricks/spark-redshift/jobs/144786105

e.g.:

info] Exception encountered when attempting to run a suite with class name: com.databricks.spark.redshift.DecimalIntegrationSuite *** ABORTED ***
[info]   java.lang.NullPointerException:
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:183)
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:214)
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:122)
[info]   at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:77)
[info]   at com.databricks.spark.redshift.IntegrationSuiteBase$class.beforeEach(IntegrationSuiteBase.scala:122)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.beforeEach(BeforeAndAfterEach.scala:154)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.beforeEach(BeforeAndAfterEach.scala:173)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:253)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.runTest(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)

@JoshRosen
Copy link
Contributor

Let's remove that deprecated stuff in a separate PR, by the way.

@liancheng
Copy link
Contributor Author

The NPE was thrown when loading test tables in TestHiveContext constructor because of missing testing data files. PR apache/spark#14005 added a flag to disable loading test tables. However, this change happened after 2.0.0-RC1 and was only included in 2.0.1-SNAPSHOT. Thus I updated Spark version to 2.0.1-SNAPSHOT so that we can use this flag to avoid the NPE.

@JoshRosen
Copy link
Contributor

@liancheng, can we use 2.0 RC4?

Also, it looks like a filter pushdown test is failing because the assert assumed that the physical plan would push all filters to Redshift while it appears that "NOT NULL" isn't being pushed:

RedshiftRelation implements Spark 1.6+'s unhandledFilters API *** FAILED ***
[info]   Filter should have been eliminated; plan is:
[info]   Filter isnotnull(testbool#814)
[info]   +- Scan RedshiftRelation("PUBLIC"."test_table_438264183817572114") [testbool#814] PushedFilters: [IsNotNull(testbool), EqualTo(testbool,true)] (RedshiftIntegrationSuite.scala:289)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$10$$anonfun$apply$mcV$sp$4.apply(RedshiftIntegrationSuite.scala:289)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$10$$anonfun$apply$mcV$sp$4.apply(RedshiftIntegrationSuite.scala:288)
[info]   at scala.Option.foreach(Option.scala:236)

physicalPlan.collectFirst { case f: execution.Filter => f }.foreach { filter =>

To fix this, I think we can add a pushdown for IsNutNull in

case GreaterThanOrEqual(attr, value) => buildComparison(attr, value, ">=")

@liancheng
Copy link
Contributor Author

@JoshRosen I tried to use RCs before. But RCs are not releases and no artifacts would be published to the central Maven repo. That's why I had to stick to a snapshot release. A preview release should also work but the latest preview release is too old. I'm fixing the filter push-down issue.

@JoshRosen
Copy link
Contributor

LGTM. As long as build against something that's binary-compatible with the final 2.0.0 release I think we'll be fine, especially since we'll publish a non-preview release of spark-redshift as soon as the final 2.0.0 vote passes.

@JoshRosen JoshRosen closed this in 78ecc36 Jul 17, 2016
@JoshRosen JoshRosen added this to the 2.0.0 milestone Jul 17, 2016
@liancheng
Copy link
Contributor Author

@JoshRosen Thanks!

JoshRosen pushed a commit that referenced this pull request Jul 18, 2016
This PR branch is based on #221 and contains its changes. For easier review, please refer to liancheng#1.

This PR cleans up all the deprecated APIs in 1.x:

- `Parameters.overwrite`

  Users should use `SaveMode` instead.

- `SchemaParser`

  Removed since the only API method uses this class is removed.

- `RedshiftDataFrame.saveAsRedshiftTable(parameters: Map[String, String])`

  Users should use `DataFrameWriter` methods instead.

- `RedshiftContext.redshiftTable(parameters: Map[String, String])`

  Users should use `DataFrameReader` methods instead.

- `RedshiftContext.redshiftFile(path: String, schema: String)`

  Removed since we'd like to stop using string schema parsing. The following API method is added to replace it:

  ```scala
  def redshiftFile(path: String, schema: StructType): DataFrame
  ```

Fixes #66.

Author: Cheng Lian <lian@databricks.com>

Closes #239 from liancheng/delete-deprecated-api.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants