Consider using CSV when writing data back into Redshift #73

JoshRosen · 2015-09-04T23:52:40Z

According to some benchmarks published at http://www.overfitted.com/blog/?p=367, it seems that Redshift is significantly faster when loading data from CSV than when loading from Avro. We should benchmark this ourselves and should consider whether it makes sense to automatically pick the CSV format depending on which data types are being used.

JoshRosen · 2015-09-24T05:59:16Z

@jaley, do you remember why you chose to use Avro as the format for writing to Redshift, as opposed to a format like CSV or JSON? Was the concern primarily file size?

jaley · 2015-09-24T06:28:55Z

Not so much file size, as S3 storage is so cheap and unlikely to cause problems. It was more to do with:

Past experiences using delimited files in Redshift have been causing us a lot of pain. In theory, it should all just work fine if you escape the data, but something always screws it up unexpectedly. Some bad characters will mess up the parser state and it just goes completely out of whack, rejecting rows etc.
Performance should have been way better than it is really. Outside of Redshift, parsing Avro is obviously much quicker than parsing text delimited files, especially if the Avro files are compressed with snappy or similar. I don't know what's going on to make it so slow in Redshift right now, but hopefully Amazon can address it in a future update.

With hindsight, it's not such a clear win, but I'd recommend a few things for the checklist before we switch spark-redshift over by default.

Thorough integration tests that generate pretty diverse text data using a CSV library. We need to find a robust configuration for both the CSV library and the Redshift parameters to make sure anything can get though.
Performance benchmarks that cover the full ETL use-case, including writing the data to S3, probably using compression too.
Confirmation from AWS support that Avro is unlikely to overtake CSV in terms of performance in Redshift. It would be annoying to finish all this work just to have a future update take some of the motivation away.

JoshRosen · 2015-09-30T21:15:19Z

@jaley, thanks for the clarification; all of those are excellent points

I'm going to put this on hold for now and revisit if enough users complain about loading performance.

JoshRosen · 2015-11-01T06:16:10Z

Closing this as "Won't do" for now. I don't think that we have the engineering resources to maintain and test two separate write paths right now.

Bennyelg · 2016-07-08T10:54:01Z

Hey, I understand @jaley points but I think it's missing the main purpose.
CSV's is the most common file format and even more likely in any other database around the web.
OLAP and OLTP.
Adding a feature like this will ease pain of many other which implement it in bad workaround.
the performance of redshift-avro on copies is worse than anything I have seeing in years.

I think this minor feature should be doable. since %20 effort will gain us (users) %80 result.

JoshRosen · 2016-07-13T17:21:37Z

Yeah, I think it might make sense to prioritize this now that several users have asked for it. It's going to be a little trickier to support this in the 1.x line if we choose to use spark-csv since then I think we'd risk inheriting bugs in older spark-csv versions. It'd be easiest to support this for Spark 2.0+ since then we could rely on Spark 2.0+'s built-in CSV data source.

Given that we don't need support for reading CSVs, though, it might be simpler to just add a CSV write path which is independent of spark-csv and which uses write escaping options that are specific to what Redshift supports; this would make it easy to add this feature to the 1.x line without introducing new dependencies.

emlyn · 2016-08-10T10:14:08Z

What's the status on this now? Should it be reopened?

JoshRosen · 2016-08-25T02:54:00Z

Yeah, I think we should re-open this and target it for the Spark 2.0-compatible versions of this library.

timchan-lumoslabs · 2016-09-12T22:08:27Z

When will this get worked on? My 14.7 GB load took about 2 hours.

JoshRosen · 2016-10-21T00:18:27Z

For folks following this issue, note that I've submitted a PR to merge this as an experimental feature: #288

This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests. Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch. As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more. Fixes #73. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com> Author: Emlyn Corrin <emlyn@swiftkey.com> Closes #288 from JoshRosen/use-csv-for-writes.

JoshRosen · 2016-10-25T00:03:20Z

Fixed by #288, which will be included in the next preview release.

vishooo · 2019-01-03T11:26:20Z

Hi I am trying to use below code
dataFrame.write.format("com.databricks.spark.redshift").option("url", url).option("tempdir", tempDir).option("tempFormat", "CSV").option("dbtable","test_csv1").option("aws_iam_role", iam).mode('overwrite').save()

But it is still storing data in avro format in s3 temp location. I am suing spark-redshift 2.11 connector

vishooo · 2019-01-03T11:42:00Z

I tried to run below command to save data in temp location in s3 in csv format. But it still stores data in avro format. Can you please tell me where it went wrong.
I am using pyspark 2.3.2 and spark redshift connector 2.11
dataFrame.write.format("com.databricks.spark.redshift").option("url", url).option("tempdir", tempDir).option("tempFormat", "CSV").option("dbtable","test_csv1").option("aws_iam_role", iam).mode('overwrite').save()

Update README.md

JoshRosen added the enhancement label Sep 4, 2015

JoshRosen mentioned this issue Sep 9, 2015

Option to enable Avro compression #27

Closed

JoshRosen mentioned this issue Sep 24, 2015

Avro schema only file causes Redshift stl_load_errors: "Invalid AVRO file found. Unexpected end of AVRO file." #96

Closed

JoshRosen added the stale / awaiting update label Sep 30, 2015

JoshRosen closed this as completed Nov 1, 2015

JoshRosen added the wontfix label Nov 1, 2015

JoshRosen mentioned this issue Apr 26, 2016

Support for overriding file serialization format in RedshiftWriter #206

Open

JoshRosen mentioned this issue Jul 13, 2016

How to specify parallel thread for redshift. #236

Closed

JoshRosen reopened this Aug 25, 2016

JoshRosen removed stale / awaiting update wontfix labels Aug 25, 2016

JoshRosen added this to the 2.1.0 milestone Oct 18, 2016

JoshRosen mentioned this issue Oct 20, 2016

Add option to use CSV as an intermediate data format during writes #288

Closed

JoshRosen closed this as completed Oct 25, 2016

munk pushed a commit to ActionIQ-OSS/spark-redshift that referenced this issue May 4, 2021

Merge pull request databricks#73 from wseaton/patch-1

e9e6a65

Update README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using CSV when writing data back into Redshift #73

Consider using CSV when writing data back into Redshift #73

JoshRosen commented Sep 4, 2015

JoshRosen commented Sep 24, 2015

jaley commented Sep 24, 2015

JoshRosen commented Sep 30, 2015

JoshRosen commented Nov 1, 2015

Bennyelg commented Jul 8, 2016

JoshRosen commented Jul 13, 2016

emlyn commented Aug 10, 2016

JoshRosen commented Aug 25, 2016

timchan-lumoslabs commented Sep 12, 2016

JoshRosen commented Oct 21, 2016

JoshRosen commented Oct 25, 2016

vishooo commented Jan 3, 2019

vishooo commented Jan 3, 2019

Consider using CSV when writing data back into Redshift #73

Consider using CSV when writing data back into Redshift #73

Comments

JoshRosen commented Sep 4, 2015

JoshRosen commented Sep 24, 2015

jaley commented Sep 24, 2015

JoshRosen commented Sep 30, 2015

JoshRosen commented Nov 1, 2015

Bennyelg commented Jul 8, 2016

JoshRosen commented Jul 13, 2016

emlyn commented Aug 10, 2016

JoshRosen commented Aug 25, 2016

timchan-lumoslabs commented Sep 12, 2016

JoshRosen commented Oct 21, 2016

JoshRosen commented Oct 25, 2016

vishooo commented Jan 3, 2019

vishooo commented Jan 3, 2019