-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate to Spark 2.0 #221
Migrate to Spark 2.0 #221
Conversation
@@ -44,9 +44,9 @@ object SparkRedshiftBuild extends Build { | |||
organization := "com.databricks", | |||
scalaVersion := "2.10.5", | |||
crossScalaVersions := Seq("2.10.5", "2.11.7"), | |||
sparkVersion := "1.4.1", | |||
sparkVersion := "2.0.0-SNAPSHOT", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should change version in title to "Spark 2.0.0-SNAPSHOT".
Travis file is not updated, now the build fails. |
Should we use a new SNAPSHOT version considering 0.6.1-SNAPSHOT is marked 6 months ago and this PR introduces an incompatible change? Probably 0.6.2-SNAPSHOT? |
Should we also update README.md? Since this introduces an incompatible change, and in the README, there is section like:
|
@clockfly Thanks for the review, I haven't updated the build stuff since it's still blocked by databricks/spark-avro#137, and Travis build won't pass anyway. Will update them later. |
Current coverage is 89.05%@@ master #221 diff @@
==========================================
Files 13 13
Lines 681 667 -14
Methods 596 582 -14
Messages 0 0
Branches 85 85
==========================================
- Hits 609 594 -15
- Misses 72 73 +1
Partials 0 0
|
Updated version information and dependencies. Now we are using the newly released spark-avro 3.0.0-preview2. This is ready for review now. |
@JoshRosen Would you mind to help review this one? Thanks! |
In order to trigger the end-to-end integration tests, you need to push this code into a branch in the This looks good to me, but I'm wondering if we're planning to make a new release for Spark 1.x in the near term. If we do, then it's going to mean that we'll have to make a separate 1.x branch and merge outstanding PRs twice, once into the master branch and once into the 1.x maintenance branch. That's not a huge deal, though. |
@@ -22,22 +22,22 @@ This library is more suited to ETL than interactive queries, since large amounts | |||
|
|||
## Installation | |||
|
|||
This library requires Apache Spark 1.4+ and Amazon Redshift 1.0.963+. | |||
This library requires Apache Spark 2.0+ and Amazon Redshift 1.0.963+. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is going to be the README when people browse to the repository on GitHub, we should probably have a pointer to a branch containing the Spark 1.x version of the code or a table listing the releases for the Spark 1.x line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I do this after making the branch for Spark 1.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a new branch at https://github.com/databricks/spark-redshift/tree/branch-1.x for holding the 1.x code.
I'm going to try to help review and merge the other outstanding PRs right now so that we can hopefully make a new 1.x release. |
Status update: merged a number of open PRs and am on track to wrap up #157 and then will cut a release. In case this merge-conflicts I'll take care of rebasing. |
@JoshRosen Thanks for the review! |
|
||
You may use this library in your applications with the following dependency information: | ||
|
||
**Scala 2.10** | ||
``` | ||
groupId: com.databricks | ||
artifactId: spark-redshift_2.10 | ||
version: 0.6.0 | ||
version: 0.7.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change these versions to 2.0.0-SNAPSHOT
. Also, make sure to update version.sbt
in the root of this repo.
@liancheng Let's also update README. Thanks! |
Since Spark 2.0+ only supports Hadoop 2.x, we can remove the reflection hack at spark-redshift/src/main/scala/com/databricks/spark/redshift/RedshiftInputFormat.scala Line 106 in cba5eee
Similarly, we can remove the reflection at spark-redshift/src/main/scala/com/databricks/spark/redshift/RedshiftJDBCWrapper.scala Line 97 in 5bc5fab
|
4cefa43
to
cf7d394
Compare
@JoshRosen @yhuai Thanks for the review! Comments addressed. Shall we also remove deprecated methods like those defined in |
I pushed this PR's changes to a branch in the e.g.:
|
Let's remove that deprecated stuff in a separate PR, by the way. |
The NPE was thrown when loading test tables in |
@liancheng, can we use 2.0 RC4? Also, it looks like a filter pushdown test is failing because the assert assumed that the physical plan would push all filters to Redshift while it appears that "NOT NULL" isn't being pushed:
spark-redshift/src/it/scala/com/databricks/spark/redshift/RedshiftIntegrationSuite.scala Line 288 in a1749cb
To fix this, I think we can add a pushdown for IsNutNull in
|
@JoshRosen I tried to use RCs before. But RCs are not releases and no artifacts would be published to the central Maven repo. That's why I had to stick to a snapshot release. A preview release should also work but the latest preview release is too old. I'm fixing the filter push-down issue. |
LGTM. As long as build against something that's binary-compatible with the final 2.0.0 release I think we'll be fine, especially since we'll publish a non-preview release of |
@JoshRosen Thanks! |
This PR branch is based on #221 and contains its changes. For easier review, please refer to liancheng#1. This PR cleans up all the deprecated APIs in 1.x: - `Parameters.overwrite` Users should use `SaveMode` instead. - `SchemaParser` Removed since the only API method uses this class is removed. - `RedshiftDataFrame.saveAsRedshiftTable(parameters: Map[String, String])` Users should use `DataFrameWriter` methods instead. - `RedshiftContext.redshiftTable(parameters: Map[String, String])` Users should use `DataFrameReader` methods instead. - `RedshiftContext.redshiftFile(path: String, schema: String)` Removed since we'd like to stop using string schema parsing. The following API method is added to replace it: ```scala def redshiftFile(path: String, schema: StructType): DataFrame ``` Fixes #66. Author: Cheng Lian <lian@databricks.com> Closes #239 from liancheng/delete-deprecated-api.
This PR migrates spark-redshift to Spark 2.0.
It is marked as WIP because it depends on databricks/spark-avro#137.