-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using CSV when writing data back into Redshift #73
Comments
@jaley, do you remember why you chose to use Avro as the format for writing to Redshift, as opposed to a format like CSV or JSON? Was the concern primarily file size? |
Not so much file size, as S3 storage is so cheap and unlikely to cause problems. It was more to do with:
With hindsight, it's not such a clear win, but I'd recommend a few things for the checklist before we switch
|
@jaley, thanks for the clarification; all of those are excellent points I'm going to put this on hold for now and revisit if enough users complain about loading performance. |
Closing this as "Won't do" for now. I don't think that we have the engineering resources to maintain and test two separate write paths right now. |
Hey, I understand @jaley points but I think it's missing the main purpose. I think this minor feature should be doable. since %20 effort will gain us (users) %80 result. |
Yeah, I think it might make sense to prioritize this now that several users have asked for it. It's going to be a little trickier to support this in the 1.x line if we choose to use Given that we don't need support for reading CSVs, though, it might be simpler to just add a CSV write path which is independent of |
What's the status on this now? Should it be reopened? |
Yeah, I think we should re-open this and target it for the Spark 2.0-compatible versions of this library. |
When will this get worked on? My 14.7 GB load took about 2 hours. |
For folks following this issue, note that I've submitted a PR to merge this as an experimental feature: #288 |
This patch adds new options to allow CSV to be used as the intermediate data format when writing data to Redshift. This can offer large performance benefits because Redshift's Avro reader can be very slow. This patch is based on #165 by emlyn and incorporates changes from me in order to add documentation, make the new option case-insensitive, improve some error messages, and add tests. Using CSV for writes also allows us to write to tables whose column names are unsupported by Avro, so #84 is partially addressed by this patch. As a hedge, I've marked this feature as "Experimental" and I'll remove that label after it's been tested in the wild a bit more. Fixes #73. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <rosenville@gmail.com> Author: Emlyn Corrin <Emlyn.Corrin@microsoft.com> Author: Emlyn Corrin <emlyn@swiftkey.com> Closes #288 from JoshRosen/use-csv-for-writes.
Fixed by #288, which will be included in the next preview release. |
Hi I am trying to use below code But it is still storing data in avro format in s3 temp location. I am suing spark-redshift 2.11 connector |
I tried to run below command to save data in temp location in s3 in csv format. But it still stores data in avro format. Can you please tell me where it went wrong. |
Update README.md
According to some benchmarks published at http://www.overfitted.com/blog/?p=367, it seems that Redshift is significantly faster when loading data from CSV than when loading from Avro. We should benchmark this ourselves and should consider whether it makes sense to automatically pick the CSV format depending on which data types are being used.
The text was updated successfully, but these errors were encountered: