[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

noslowerdna · 2022-03-22T19:29:31Z

Today, parquet supports the maxRecordsPerFile option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example,

spark.range(100)
          .write
          .format("parquet")
          .option("maxRecordsPerFile", 5)
          .save(path)

The above code will generate 20 parquet files and each one contains 5 rows.

This is missing in Delta. This PR adds the support for Delta by passing the maxRecordsPerFile option from Delta to ParquetFileFormat.

Note: today both Delta and parquet support the SQL conf spark.sql.files.maxRecordsPerFile to control the file size. This PR is just adding the DataFrameWriter option support to mimic the parquet format behavior.

Fixes #781

…options Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

core/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

scottsand-db · 2022-03-25T18:32:37Z

friendly ping @zsxwing for review

core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

…-options # Conflicts: # core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

zsxwing

Thanks for the contribution. LGTM! By the way, I edited the PR description to explain what's the final change.

Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes delta-io#781 Closes delta-io#1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944

Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone …

5113327

…options Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

noslowerdna mentioned this pull request Mar 22, 2022

Make sure that writeOptions is used when passed to the writeFiles function #652

Closed

Resolve scala 2.13 compilation error

cc0d310

noslowerdna changed the title ~~Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone …~~ [#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone… Mar 23, 2022

scottsand-db self-requested a review March 23, 2022 19:24

zsxwing reviewed Mar 23, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/files/TransactionalWrite.scala Outdated Show resolved Hide resolved

Andrew Olson added 2 commits March 24, 2022 16:18

Remove support for timeZone option

020c165

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

Add unit test

a47338d

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

noslowerdna changed the title ~~[#781] Add support for Spark DataFrameWriter maxRecordsPerFile and timeZone…~~ [#781] Add support for Spark DataFrameWriter maxRecordsPerFile option Mar 24, 2022

scottsand-db requested a review from zsxwing March 24, 2022 21:34

scottsand-db approved these changes Mar 24, 2022

View reviewed changes

zsxwing reviewed Mar 30, 2022

View reviewed changes

core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala Outdated Show resolved Hide resolved

Andrew Olson added 3 commits March 30, 2022 15:40

Add test cases as suggested

c9d092f

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

Merge remote-tracking branch 'upstream/master' into data-frame-writer…

be14b5a

…-options # Conflicts: # core/src/test/scala/org/apache/spark/sql/delta/DeltaOptionSuite.scala

Need to add DeltaSQLCommandTest trait to DeltaOptionSuite

d548e40

Signed-off-by: Andrew Olson <noslowerdna@gmail.com>

noslowerdna requested a review from zsxwing April 4, 2022 18:05

zsxwing approved these changes Apr 4, 2022

View reviewed changes

zsxwing added the waiting for merge label Apr 4, 2022

vkorukanti closed this in 3fe6f7a Apr 8, 2022

scottsand-db mentioned this pull request May 28, 2022

[SUPPORT] How to set max file size while creating delta tables? #1139

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

noslowerdna commented Mar 22, 2022 •

edited by zsxwing

Loading

scottsand-db commented Mar 25, 2022

zsxwing left a comment

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

Conversation

noslowerdna commented Mar 22, 2022 • edited by zsxwing Loading

scottsand-db commented Mar 25, 2022

zsxwing left a comment

Choose a reason for hiding this comment

noslowerdna commented Mar 22, 2022 •

edited by zsxwing

Loading