Migrate to Spark 2.0 #221

liancheng · 2016-06-27T14:16:17Z

This PR migrates spark-redshift to Spark 2.0.

It is marked as WIP because it depends on databricks/spark-avro#137.

clockfly · 2016-06-28T22:57:28Z

project/SparkRedshiftBuild.scala

@@ -44,9 +44,9 @@ object SparkRedshiftBuild extends Build {
      organization := "com.databricks",
      scalaVersion := "2.10.5",
      crossScalaVersions := Seq("2.10.5", "2.11.7"),
-      sparkVersion := "1.4.1",
+      sparkVersion := "2.0.0-SNAPSHOT",


Probably we should change version in title to "Spark 2.0.0-SNAPSHOT".

clockfly · 2016-06-28T23:01:47Z

Travis file is not updated, now the build fails.
https://github.com/databricks/spark-redshift/blob/master/.travis.yml

clockfly · 2016-06-28T23:07:19Z

Should we use a new SNAPSHOT version considering 0.6.1-SNAPSHOT is marked 6 months ago and this PR introduces an incompatible change? Probably 0.6.2-SNAPSHOT?

clockfly · 2016-06-28T23:11:29Z

Should we also update README.md? Since this introduces an incompatible change, and in the README, there is section like:

This library requires Apache Spark 1.4+ and Amazon Redshift 1.0.963+.

liancheng · 2016-06-29T00:15:01Z

@clockfly Thanks for the review, I haven't updated the build stuff since it's still blocked by databricks/spark-avro#137, and Travis build won't pass anyway. Will update them later.

codecov-io · 2016-07-06T13:43:09Z

Current coverage is 89.05%

Merging #221 into master will decrease coverage by 0.37%

@@             master       #221   diff @@
==========================================
  Files            13         13          
  Lines           681        667    -14   
  Methods         596        582    -14   
  Messages          0          0          
  Branches         85         85          
==========================================
- Hits            609        594    -15   
- Misses           72         73     +1   
  Partials          0          0

Powered by Codecov. Last updated by a1749cb...ed33941

liancheng · 2016-07-06T14:08:42Z

Updated version information and dependencies. Now we are using the newly released spark-avro 3.0.0-preview2. This is ready for review now.

liancheng · 2016-07-06T14:57:39Z

@JoshRosen Would you mind to help review this one? Thanks!

JoshRosen · 2016-07-06T19:04:15Z

In order to trigger the end-to-end integration tests, you need to push this code into a branch in the databricks/spark-redshift repo (since Travis won't decrypt credentials for pull-request jobs). I'm going to push a copy of this to a temporary branch to see if the tests pass there.

This looks good to me, but I'm wondering if we're planning to make a new release for Spark 1.x in the near term. If we do, then it's going to mean that we'll have to make a separate 1.x branch and merge outstanding PRs twice, once into the master branch and once into the 1.x maintenance branch. That's not a huge deal, though.

JoshRosen · 2016-07-06T19:05:08Z

README.md

@@ -22,22 +22,22 @@ This library is more suited to ETL than interactive queries, since large amounts

 ## Installation

-This library requires Apache Spark 1.4+ and Amazon Redshift 1.0.963+.
+This library requires Apache Spark 2.0+ and Amazon Redshift 1.0.963+.


Given that this is going to be the README when people browse to the repository on GitHub, we should probably have a pointer to a branch containing the Spark 1.x version of the code or a table listing the releases for the Spark 1.x line.

Should I do this after making the branch for Spark 1.x?

Created a new branch at https://github.com/databricks/spark-redshift/tree/branch-1.x for holding the 1.x code.

JoshRosen · 2016-07-06T19:23:16Z

I'm going to try to help review and merge the other outstanding PRs right now so that we can hopefully make a new 1.x release.

JoshRosen · 2016-07-07T06:47:20Z

Status update: merged a number of open PRs and am on track to wrap up #157 and then will cut a release. In case this merge-conflicts I'll take care of rebasing.

liancheng · 2016-07-07T10:01:14Z

@JoshRosen Thanks for the review!

JoshRosen · 2016-07-11T19:28:35Z

README.md


 You may use this library in your applications with the following dependency information:

 **Scala 2.10**
 ```
 groupId: com.databricks
 artifactId: spark-redshift_2.10
-version: 0.6.0
+version: 0.7.0


I'd change these versions to 2.0.0-SNAPSHOT. Also, make sure to update version.sbt in the root of this repo.

yhuai · 2016-07-13T17:29:20Z

@liancheng Let's also update README. Thanks!

JoshRosen · 2016-07-13T17:35:48Z

Since Spark 2.0+ only supports Hadoop 2.x, we can remove the reflection hack at

spark-redshift/src/main/scala/com/databricks/spark/redshift/RedshiftInputFormat.scala

Line 106 in cba5eee

val method = context.getClass.getMethod("getConfiguration")

Similarly, we can remove the reflection at

spark-redshift/src/main/scala/com/databricks/spark/redshift/RedshiftJDBCWrapper.scala

Line 97 in 5bc5fab

    
           // we need to rely on. This class was relocated in Spark 1.5.0, so we need to use reflection

liancheng · 2016-07-14T15:53:19Z

@JoshRosen @yhuai Thanks for the review! Comments addressed. Shall we also remove deprecated methods like those defined in RedshiftContext? I can do it in another PR.

JoshRosen · 2016-07-14T18:10:17Z

I pushed this PR's changes to a branch in the databricks/spark-redshift repository so that the integration tests could run and it looks like they're all failing due to changes in TestHive: https://travis-ci.org/databricks/spark-redshift/jobs/144786105

e.g.:

info] Exception encountered when attempting to run a suite with class name: com.databricks.spark.redshift.DecimalIntegrationSuite *** ABORTED ***
[info]   java.lang.NullPointerException:
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.getHiveFile(TestHive.scala:183)
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:214)
[info]   at org.apache.spark.sql.hive.test.TestHiveSparkSession.<init>(TestHive.scala:122)
[info]   at org.apache.spark.sql.hive.test.TestHiveContext.<init>(TestHive.scala:77)
[info]   at com.databricks.spark.redshift.IntegrationSuiteBase$class.beforeEach(IntegrationSuiteBase.scala:122)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.beforeEach(BeforeAndAfterEach.scala:154)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.beforeEach(BeforeAndAfterEach.scala:173)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.beforeEach(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:253)
[info]   at com.databricks.spark.redshift.DecimalIntegrationSuite.runTest(DecimalIntegrationSuite.scala:25)
[info]   at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)

JoshRosen · 2016-07-15T00:07:49Z

Let's remove that deprecated stuff in a separate PR, by the way.

liancheng · 2016-07-15T11:26:57Z

The NPE was thrown when loading test tables in TestHiveContext constructor because of missing testing data files. PR apache/spark#14005 added a flag to disable loading test tables. However, this change happened after 2.0.0-RC1 and was only included in 2.0.1-SNAPSHOT. Thus I updated Spark version to 2.0.1-SNAPSHOT so that we can use this flag to avoid the NPE.

JoshRosen · 2016-07-15T15:15:55Z

@liancheng, can we use 2.0 RC4?

Also, it looks like a filter pushdown test is failing because the assert assumed that the physical plan would push all filters to Redshift while it appears that "NOT NULL" isn't being pushed:

RedshiftRelation implements Spark 1.6+'s unhandledFilters API *** FAILED ***
[info]   Filter should have been eliminated; plan is:
[info]   Filter isnotnull(testbool#814)
[info]   +- Scan RedshiftRelation("PUBLIC"."test_table_438264183817572114") [testbool#814] PushedFilters: [IsNotNull(testbool), EqualTo(testbool,true)] (RedshiftIntegrationSuite.scala:289)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
[info]   at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
[info]   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
[info]   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$10$$anonfun$apply$mcV$sp$4.apply(RedshiftIntegrationSuite.scala:289)
[info]   at com.databricks.spark.redshift.RedshiftIntegrationSuite$$anonfun$10$$anonfun$apply$mcV$sp$4.apply(RedshiftIntegrationSuite.scala:288)
[info]   at scala.Option.foreach(Option.scala:236)

spark-redshift/src/it/scala/com/databricks/spark/redshift/RedshiftIntegrationSuite.scala

Line 288 in a1749cb

physicalPlan.collectFirst { case f: execution.Filter => f }.foreach { filter =>

To fix this, I think we can add a pushdown for IsNutNull in

spark-redshift/src/main/scala/com/databricks/spark/redshift/FilterPushdown.scala

Line 74 in a1749cb

case GreaterThanOrEqual(attr, value) => buildComparison(attr, value, ">=")

liancheng · 2016-07-16T02:10:35Z

@JoshRosen I tried to use RCs before. But RCs are not releases and no artifacts would be published to the central Maven repo. That's why I had to stick to a snapshot release. A preview release should also work but the latest preview release is too old. I'm fixing the filter push-down issue.

JoshRosen · 2016-07-17T18:19:06Z

LGTM. As long as build against something that's binary-compatible with the final 2.0.0 release I think we'll be fine, especially since we'll publish a non-preview release of spark-redshift as soon as the final 2.0.0 vote passes.

liancheng · 2016-07-18T04:31:52Z

@JoshRosen Thanks!

This PR branch is based on #221 and contains its changes. For easier review, please refer to liancheng#1. This PR cleans up all the deprecated APIs in 1.x: - `Parameters.overwrite` Users should use `SaveMode` instead. - `SchemaParser` Removed since the only API method uses this class is removed. - `RedshiftDataFrame.saveAsRedshiftTable(parameters: Map[String, String])` Users should use `DataFrameWriter` methods instead. - `RedshiftContext.redshiftTable(parameters: Map[String, String])` Users should use `DataFrameReader` methods instead. - `RedshiftContext.redshiftFile(path: String, schema: String)` Removed since we'd like to stop using string schema parsing. The following API method is added to replace it: ```scala def redshiftFile(path: String, schema: StructType): DataFrame ``` Fixes #66. Author: Cheng Lian <lian@databricks.com> Closes #239 from liancheng/delete-deprecated-api.

liancheng mentioned this pull request Jun 27, 2016

Hadoop configuration "avro.mapred.ignore.inputs.without.extension" is ignored after upgrading to Spark 2.0.0-SNAPSHOT databricks/spark-avro#136

Closed

clockfly reviewed Jun 28, 2016
View reviewed changes

liancheng changed the title ~~[WIP] Migrate to Spark 2.0~~ Migrate to Spark 2.0 Jul 6, 2016

JoshRosen reviewed Jul 6, 2016
View reviewed changes

JoshRosen reviewed Jul 11, 2016
View reviewed changes

liancheng added 3 commits July 14, 2016 20:59

Migrate to Spark 2.0

95a932d

Updates version information

8dc22ff

Addresses PR comments

cf7d394

liancheng force-pushed the spark-2.0-migration branch from 4cefa43 to cf7d394 Compare July 14, 2016 15:03

Removes unnecessary reflection shim code

307acd8

Avoids loading test tables when constructing TestHiveContext

a82092a

Updates Spark version in .travis.yml

1c25642

liancheng added 3 commits July 16, 2016 15:11

More detailed assertion failure report

d99659a

Cleans up workaround code for older Spark versions

f44099e

Adds filter push-down support for IsNotNull and IsNull

ed33941

This was referenced Jul 17, 2016

Cleans up deprecated APIs liancheng/spark-redshift#1

Closed

Cleans up deprecated APIs #239

Closed

JoshRosen closed this in 78ecc36 Jul 17, 2016

JoshRosen added this to the 2.0.0 milestone Jul 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Spark 2.0 #221

Migrate to Spark 2.0 #221

liancheng commented Jun 27, 2016

clockfly Jun 28, 2016

clockfly commented Jun 28, 2016

clockfly commented Jun 28, 2016

clockfly commented Jun 28, 2016

liancheng commented Jun 29, 2016

codecov-io commented Jul 6, 2016 •

edited

Loading

liancheng commented Jul 6, 2016 •

edited

Loading

liancheng commented Jul 6, 2016

JoshRosen commented Jul 6, 2016

JoshRosen Jul 6, 2016

liancheng Jul 7, 2016

JoshRosen Jul 11, 2016

JoshRosen commented Jul 6, 2016

JoshRosen commented Jul 7, 2016

liancheng commented Jul 7, 2016

JoshRosen Jul 11, 2016

yhuai commented Jul 13, 2016

JoshRosen commented Jul 13, 2016

liancheng commented Jul 14, 2016

JoshRosen commented Jul 14, 2016

JoshRosen commented Jul 15, 2016

liancheng commented Jul 15, 2016

JoshRosen commented Jul 15, 2016

liancheng commented Jul 16, 2016

JoshRosen commented Jul 17, 2016

liancheng commented Jul 18, 2016

Migrate to Spark 2.0 #221

Migrate to Spark 2.0 #221

Conversation

liancheng commented Jun 27, 2016

clockfly Jun 28, 2016

Choose a reason for hiding this comment

clockfly commented Jun 28, 2016

clockfly commented Jun 28, 2016

clockfly commented Jun 28, 2016

liancheng commented Jun 29, 2016

codecov-io commented Jul 6, 2016 • edited Loading

Current coverage is 89.05%

liancheng commented Jul 6, 2016 • edited Loading

liancheng commented Jul 6, 2016

JoshRosen commented Jul 6, 2016

JoshRosen Jul 6, 2016

Choose a reason for hiding this comment

liancheng Jul 7, 2016

Choose a reason for hiding this comment

JoshRosen Jul 11, 2016

Choose a reason for hiding this comment

JoshRosen commented Jul 6, 2016

JoshRosen commented Jul 7, 2016

liancheng commented Jul 7, 2016

JoshRosen Jul 11, 2016

Choose a reason for hiding this comment

yhuai commented Jul 13, 2016

JoshRosen commented Jul 13, 2016

liancheng commented Jul 14, 2016

JoshRosen commented Jul 14, 2016

JoshRosen commented Jul 15, 2016

liancheng commented Jul 15, 2016

JoshRosen commented Jul 15, 2016

liancheng commented Jul 16, 2016

JoshRosen commented Jul 17, 2016

liancheng commented Jul 18, 2016

codecov-io commented Jul 6, 2016 •

edited

Loading

liancheng commented Jul 6, 2016 •

edited

Loading