Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

Closed

Conversation

noslowerdna
Copy link
Contributor

@noslowerdna noslowerdna commented Mar 22, 2022

Today, parquet supports the maxRecordsPerFile option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the maxRecordsPerFile option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf spark.sql.files.maxRecordsPerFile to control the file size. This PR is just adding the DataFrameWriter option support to mimic the parquet format behavior.

Fixes #781

…options

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
@noslowerdna noslowerdna changed the title Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone … [#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone… Mar 23, 2022
@scottsand-db scottsand-db self-requested a review March 23, 2022 19:24
Andrew Olson added 2 commits March 24, 2022 16:18
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
@noslowerdna noslowerdna changed the title [#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone… [#781] Add support for Spark DataFrameWriter maxRecordsPerFile option Mar 24, 2022
@scottsand-db
Copy link
Collaborator

friendly ping @zsxwing for review

Andrew Olson added 3 commits March 30, 2022 15:40
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
…-options

# Conflicts:
#	core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala
Signed-off-by: Andrew Olson <noslowerdna@gmail.com>
@noslowerdna noslowerdna requested a review from zsxwing April 4, 2022 18:05
Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. LGTM! By the way, I edited the PR description to explain what's the final change.

@vkorukanti vkorukanti closed this in 3fe6f7a Apr 8, 2022
jbguerraz pushed a commit to jbguerraz/delta that referenced this pull request Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

```
spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)
```

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior.

Fixes delta-io#781

Closes delta-io#1017

Co-authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
jbguerraz pushed a commit to jbguerraz/delta that referenced this pull request Jul 6, 2022
Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

```
spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)
```

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior.

Fixes delta-io#781

Closes delta-io#1017

Co-authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lacking support for DataFrameWriter's "maxRecordsPerFile" option?
3 participants