Simplify use of transactions when performing overwrites and when creating new tables #157

JoshRosen · 2016-01-17T07:48:33Z

This patch aims to significantly simplify our use of transactions when performing overwrites and when creating new tables; it addresses a few shortcomings in the existing code: previously, we didn't issue the CREATE TABLE and COPY commands as part of the same transaction when creating new tables and when performing overwrites we created a temporary table outside of a transaction, opening the possibility for that table to be leaked in rare corner-cases.

To address this, this patch refactors the main write logic so that it uses a JDBC connection with auto-commit disabled. Explicitly issuing commit() lets us significantly simplify the code and reduces duplication between the branches for handling different SaveModes.

One important change: previously, usestagingtable would literally create a separate staging table, but now we will simply delete the existing table and re-create it in a transaction; this frees us of the burden of having to find a unique name for the staging table. This should be safe due to Redshift's serializable snapshot isolation semantics. When usestagingtable is disabled, this now means that we simply issue an extra commit() after the original table's deletion so that the original table's resources are freed (which lets us write the new rows without having to maintain both the old and new tables at the same time).

/cc @jaley, who wrote the original staging-table-handling code; it would be good if you could confirm whether the changes here will undermine the performance benefits / semantics of the old usestagingtable=false path.

Fixes #153 and #154.

JoshRosen · 2016-01-17T07:50:06Z

src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala

@@ -48,6 +47,8 @@ import org.apache.spark.sql.types._
 *     non-empty. After the write operation completes, we use this to construct a list of non-empty
 *     Avro partition files.
 *
+ *   - Using JDBC, start a new tra


TODO: this entire block comment needs to be updated.

JoshRosen · 2016-01-17T07:52:04Z

I'd love feedback on whether the tests here are adequate; my gut says that we need a test which counts the number of times that commit() is called when not using a staging table. I can add this tomorrow.

JoshRosen · 2016-01-17T23:29:50Z

src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala

      }
+      log.info(s"Loading new Redshift data to: $table")
+      doRedshiftLoad(conn, data, params, creds, manifestUrl)
+      conn.commit()


Does this need to handle InterruptedException? i.e. can this block for a really long time?

It has been my experience that virtually any Redshift command, COMMIT included, can block for several minutes under certain circumstances. I think the most likely cause is that the cluster has WLM parameters configured to put the connected client on a limited pool of some type, such that all commands will be queued when all slots are taken by other queries.

I have a feeling that means though, if you plan to send a ROLLBACK command in response to the interruption, that will also block for many minutes...

I'm not sure that it's safe for us to wrap this in order to catch InterruptedException since I don't think that it's safe to call .rollback() while the other thread is in the middle of executing commit(). Therefore, I'm going to leave this as-is for now and will revisit later if this turns out to be a problem in practice.

codecov-io · 2016-01-18T00:01:18Z

Current coverage is `88.52%`

Merging #157 into master will decrease coverage by -0.47% as of a2e241b

@@            master    #157   diff @@
======================================
  Files           13      13       
  Stmts          645     636     -9
  Branches       142     138     -4
  Methods          0       0       
======================================
- Hit            574     563    -11
  Partial          0       0       
- Missed          71      73     +2

Review entire Coverage Diff as of a2e241b

Powered by Codecov. Updated on successful CI builds.

JoshRosen · 2016-01-25T02:45:23Z

Ping @jaley, do you have any cycles to help review this? I think that you might be a good person to look at this because you have a lot of the context RE: the original implementation of this functionality.

jaley · 2016-01-25T09:30:27Z

Hey @JoshRosen ,

Sure. Sorry for being slow to notice this - been playing catch up after a long vacation. I'll take a look in the next day or so if that's cool with you?

JoshRosen · 2016-01-25T21:39:38Z

Sure, that's fine.

jaley · 2016-01-26T11:23:30Z

src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala

+    } catch {
+      case NonFatal(e) =>
+        try {
+          conn.rollback()


Might a log.info or log.warn be good here, to inform anyone testing their application code that the load failed and what they're now waiting for is a rollback to finish?

jaley · 2016-01-26T11:28:24Z

Do we currently run the integration test with just the Redshift JDBC driver? Now that we're using the driver to manage transactions, I wonder if we might start to see different behaviour between the Redshift and Postgres drivers? Amazon's recommendation is to use their official driver, but I think we've previously been lucky and users could safely make use of either for the small set of commands we rely on. This change effectively exercises different code paths in the driver, so it might be worth making sure that if anything no-longer works about the Postgres driver, we're at least able to communicate that clearly to users and explicitly state that it's not supported any more. (Hopefully it's still fine anyway!)

jaley · 2016-01-26T11:31:29Z

The change looks good to me. I wonder if perhaps the useStagingTable parameter name is now a little confusing? This is certainly the equivalent behaviour, and so in the interest of maintaining compatibility it should certainly be supported, but I suspect if we were choosing a name for that now, we probably wouldn't refer to staging tables.

I also wonder if it really communicates the fact that if you set this to false, you save disk space but you run the risk of losing your existing data? If not, that was always the case, this change just caused me to think about it some more!

emlyn · 2016-01-26T11:35:01Z

src/test/scala/com/databricks/spark/redshift/RedshiftSourceSuite.scala

    }
    mockRedshift.verifyThatConnectionsWereClosed()
+    mockRedshift.verifyThatCommitWasNotCalled()


Should this also verify that rollback was called?

ssimeonov · 2016-05-31T03:09:21Z

@JoshRosen what your plan for merging this PR? I've now observed a number of situations where Spark thinks a COPY operation has failed (due to a connection failure) yet it succeeds in Redshift leading to duplicated data when using SaveMode.Append. Putting everything in an end-to-end transaction with an explicit COMMIT should help a lot.

…-more-places

codecov-io · 2016-07-07T00:19:19Z

Current coverage is 89.42%

Merging #157 into master will decrease coverage by 0.49%

@@             master       #157   diff @@
==========================================
  Files            13         13          
  Lines           685        681     -4   
  Methods         604        596     -8   
  Messages          0          0          
  Branches         81         85     +4   
==========================================
- Hits            616        609     -7   
- Misses           69         72     +3   
  Partials          0          0

Powered by Codecov. Last updated by 7ae3b9a...b2804bf

…-more-places

JoshRosen · 2016-07-08T22:41:03Z

@jaley:

The change looks good to me. I wonder if perhaps the useStagingTable parameter name is now a little confusing? This is certainly the equivalent behaviour, and so in the interest of maintaining compatibility it should certainly be supported, but I suspect if we were choosing a name for that now, we probably wouldn't refer to staging tables.

I think that we should just deprecate this parameter. If a user wants to avoid the performance penalty of using a transaction for overwrites then I don't think it's unreasonable to suggest that they drop the old table themselves.

I also wonder if it really communicates the fact that if you set this to false, you save disk space but you run the risk of losing your existing data? If not, that was always the case, this change just caused me to think about it some more!

Yep, this is the old behavior. It looks like you documented this in one of the original commits that added this functionality: https://github.com/databricks/spark-redshift/blame/v0.5.0/README.md#L280

JoshRosen · 2016-07-10T20:10:01Z

I'm going to go ahead and merge this now and will address any minor comments in folllowups. I'll give this a test with both drivers and then will begin work on packaging a new release.

Mega-simplification of overwrite logic

77323b3

JoshRosen added the bug label Jan 17, 2016

JoshRosen added this to the 0.6.1 milestone Jan 17, 2016

JoshRosen reviewed Jan 17, 2016
View reviewed changes

Update README.

deb2345

JoshRosen reviewed Jan 17, 2016
View reviewed changes

Update comments and fix error message query.

a2e241b

jaley reviewed Jan 26, 2016
View reviewed changes

emlyn reviewed Jan 26, 2016
View reviewed changes

emlyn mentioned this pull request Feb 12, 2016

Use metadata to set column comments and encoding #178

Closed

JoshRosen added 3 commits June 13, 2016 15:18

Merge remote-tracking branch 'origin/master' into use-transactions-in…

3f4c1b9

…-more-places

Update expected queries in preactions unit test.

3a6e9fd

Merge remote-tracking branch 'origin/master' into use-transactions-in…

8edc588

…-more-places

JoshRosen mentioned this pull request Jul 7, 2016

Migrate to Spark 2.0 #221

Closed

JoshRosen added 3 commits July 8, 2016 14:41

Merge remote-tracking branch 'origin/master' into use-transactions-in…

d7af870

…-more-places

Log exception before attempting rollback.

41c080c

Verify that rollback is called.

9045bda

Deprecate useStagingTable=False.

b2804bf

JoshRosen closed this in e5d825f Jul 10, 2016

JoshRosen deleted the use-transactions-in-more-places branch July 10, 2016 20:10

This was referenced Jul 11, 2016

Create temporary table in transaction when performing overwrites #154

Closed

Overwrite using staging table fails when table has dependencies #213

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify use of transactions when performing overwrites and when creating new tables #157

Simplify use of transactions when performing overwrites and when creating new tables #157

JoshRosen commented Jan 17, 2016

JoshRosen Jan 17, 2016

JoshRosen commented Jan 17, 2016

JoshRosen Jan 17, 2016

jaley Jan 26, 2016

JoshRosen Jul 8, 2016

codecov-io commented Jan 18, 2016

JoshRosen commented Jan 25, 2016

jaley commented Jan 25, 2016

JoshRosen commented Jan 25, 2016

jaley Jan 26, 2016

JoshRosen Jul 8, 2016

jaley commented Jan 26, 2016

jaley commented Jan 26, 2016

emlyn Jan 26, 2016

JoshRosen Jul 8, 2016

ssimeonov commented May 31, 2016 •

edited

Loading

codecov-io commented Jul 7, 2016 •

edited

Loading

JoshRosen commented Jul 8, 2016 •

edited

Loading

JoshRosen commented Jul 10, 2016

Simplify use of transactions when performing overwrites and when creating new tables #157

Simplify use of transactions when performing overwrites and when creating new tables #157

Conversation

JoshRosen commented Jan 17, 2016

Choose a reason for hiding this comment

JoshRosen commented Jan 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 18, 2016

Current coverage is 88.52%

JoshRosen commented Jan 25, 2016

jaley commented Jan 25, 2016

JoshRosen commented Jan 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaley commented Jan 26, 2016

jaley commented Jan 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ssimeonov commented May 31, 2016 • edited Loading

codecov-io commented Jul 7, 2016 • edited Loading

Current coverage is 89.42%

JoshRosen commented Jul 8, 2016 • edited Loading

JoshRosen commented Jul 10, 2016

Current coverage is `88.52%`

ssimeonov commented May 31, 2016 •

edited

Loading

codecov-io commented Jul 7, 2016 •

edited

Loading

JoshRosen commented Jul 8, 2016 •

edited

Loading