-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017
Closed
noslowerdna
wants to merge
7
commits into
delta-io:master
from
noslowerdna:data-frame-writer-options
Closed
[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017
noslowerdna
wants to merge
7
commits into
delta-io:master
from
noslowerdna:data-frame-writer-options
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…options Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
noslowerdna
changed the title
Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone …
[#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone…
Mar 23, 2022
zsxwing
reviewed
Mar 23, 2022
core/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
noslowerdna
changed the title
[#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone…
[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option
Mar 24, 2022
scottsand-db
approved these changes
Mar 24, 2022
friendly ping @zsxwing for review |
zsxwing
reviewed
Mar 30, 2022
core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
…-options # Conflicts: # core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
zsxwing
approved these changes
Apr 4, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. LGTM! By the way, I edited the PR description to explain what's the final change.
3 tasks
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this pull request
Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes delta-io#781 Closes delta-io#1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this pull request
Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes delta-io#781 Closes delta-io#1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Today, parquet supports the maxRecordsPerFile option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,
The above code will generate 20 parquet files and each one contains 5 rows.
This is missing in Delta. This PR adds the support for Delta by passing the
maxRecordsPerFile
option from Delta to ParquetFileFormat.Note: today both Delta and parquet support the SQL conf
spark.sql.files.maxRecordsPerFile
to control the file size. This PR is just adding theDataFrameWriter
option support to mimic the parquet format behavior.Fixes #781