Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using CSV when writing data back into Redshift #73

Closed
JoshRosen opened this issue Sep 4, 2015 · 13 comments
Closed

Consider using CSV when writing data back into Redshift #73

JoshRosen opened this issue Sep 4, 2015 · 13 comments

Comments

@JoshRosen
Copy link
Contributor

According to some benchmarks published at http://www.overfitted.com/blog/?p=367, it seems that Redshift is significantly faster when loading data from CSV than when loading from Avro. We should benchmark this ourselves and should consider whether it makes sense to automatically pick the CSV format depending on which data types are being used.

@JoshRosen
Copy link
Contributor Author

@jaley, do you remember why you chose to use Avro as the format for writing to Redshift, as opposed to a format like CSV or JSON? Was the concern primarily file size?

@jaley
Copy link
Contributor

jaley commented Sep 24, 2015

Not so much file size, as S3 storage is so cheap and unlikely to cause problems. It was more to do with:

  • Past experiences using delimited files in Redshift have been causing us a lot of pain. In theory, it should all just work fine if you escape the data, but something always screws it up unexpectedly. Some bad characters will mess up the parser state and it just goes completely out of whack, rejecting rows etc.
  • Performance should have been way better than it is really. Outside of Redshift, parsing Avro is obviously much quicker than parsing text delimited files, especially if the Avro files are compressed with snappy or similar. I don't know what's going on to make it so slow in Redshift right now, but hopefully Amazon can address it in a future update.

With hindsight, it's not such a clear win, but I'd recommend a few things for the checklist before we switch spark-redshift over by default.

  • Thorough integration tests that generate pretty diverse text data using a CSV library. We need to find a robust configuration for both the CSV library and the Redshift parameters to make sure anything can get though.
  • Performance benchmarks that cover the full ETL use-case, including writing the data to S3, probably using compression too.
  • Confirmation from AWS support that Avro is unlikely to overtake CSV in terms of performance in Redshift. It would be annoying to finish all this work just to have a future update take some of the motivation away.

@JoshRosen
Copy link
Contributor Author

@jaley, thanks for the clarification; all of those are excellent points

I'm going to put this on hold for now and revisit if enough users complain about loading performance.

@JoshRosen
Copy link
Contributor Author

Closing this as "Won't do" for now. I don't think that we have the engineering resources to maintain and test two separate write paths right now.

@Bennyelg
Copy link

Bennyelg commented Jul 8, 2016

Hey, I understand @jaley points but I think it's missing the main purpose.
CSV's is the most common file format and even more likely in any other database around the web.
OLAP and OLTP.
Adding a feature like this will ease pain of many other which implement it in bad workaround.
the performance of redshift-avro on copies is worse than anything I have seeing in years.

I think this minor feature should be doable. since %20 effort will gain us (users) %80 result.

@JoshRosen
Copy link
Contributor Author

Yeah, I think it might make sense to prioritize this now that several users have asked for it. It's going to be a little trickier to support this in the 1.x line if we choose to use spark-csv since then I think we'd risk inheriting bugs in older spark-csv versions. It'd be easiest to support this for Spark 2.0+ since then we could rely on Spark 2.0+'s built-in CSV data source.

Given that we don't need support for reading CSVs, though, it might be simpler to just add a CSV write path which is independent of spark-csv and which uses write escaping options that are specific to what Redshift supports; this would make it easy to add this feature to the 1.x line without introducing new dependencies.

@emlyn
Copy link
Contributor

emlyn commented Aug 10, 2016

What's the status on this now? Should it be reopened?

@JoshRosen
Copy link
Contributor Author

Yeah, I think we should re-open this and target it for the Spark 2.0-compatible versions of this library.

@timchan-lumoslabs
Copy link

When will this get worked on? My 14.7 GB load took about 2 hours.

@JoshRosen
Copy link
Contributor Author

For folks following this issue, note that I've submitted a PR to merge this as an experimental feature: #288

JoshRosen added a commit that referenced this issue Oct 25, 2016
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests.

Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch.

As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more.

Fixes #73.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com>
Author: Emlyn Corrin <emlyn@swiftkey.com>

Closes #288 from JoshRosen/use-csv-for-writes.
@JoshRosen
Copy link
Contributor Author

Fixed by #288, which will be included in the next preview release.

@vishooo
Copy link

vishooo commented Jan 3, 2019

Hi I am trying to use below code
dataFrame.write.format("com.databricks.spark.redshift").option("url", url).option("tempdir", tempDir).option("tempFormat", "CSV").option("dbtable","test_csv1").option("aws_iam_role", iam).mode('overwrite').save()

But it is still storing data in avro format in s3 temp location. I am suing spark-redshift 2.11 connector

@vishooo
Copy link

vishooo commented Jan 3, 2019

I tried to run below command to save data in temp location in s3 in csv format. But it still stores data in avro format. Can you please tell me where it went wrong.
I am using pyspark 2.3.2 and spark redshift connector 2.11
dataFrame.write.format("com.databricks.spark.redshift").option("url", url).option("tempdir", tempDir).option("tempFormat", "CSV").option("dbtable","test_csv1").option("aws_iam_role", iam).mode('overwrite').save()

munk pushed a commit to ActionIQ-OSS/spark-redshift that referenced this issue May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants