Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

noslowerdna · 2021-09-17T21:14:38Z

It doesn't appear that the "maxRecordsPerFile" DataFrameWriter option, e.g. df.write.option("maxRecordsPerFile", 10000) is supported when using the "delta" format. However the behavior can still be achieved by setting the spark.sql.files.maxRecordsPerFile configuration property in the SparkConf.

If verified and not considered a bug, could this be a simple enhancement to implement?

The text was updated successfully, but these errors were encountered:

scottsand-db · 2021-10-07T22:21:37Z

Thanks for bringing this to our attention. We will look into this.

zsxwing · 2021-10-08T05:17:33Z

This is related to #652

Today, parquet supports the [maxRecordsPerFile](apache/spark#16204) option to limit the max number of records written per file so that users can control the parquet file size to avoid humongous files. For example, ``` spark.range(100) .write .format("parquet") .option("maxRecordsPerFile", 5) .save(path) ``` The above code will generate 20 parquet files and each one contains 5 rows. This is missing in Delta. This PR adds the support for Delta by passing the `maxRecordsPerFile` option from Delta to ParquetFileFormat. Note: today both Delta and parquet support the SQL conf `spark.sql.files.maxRecordsPerFile` to control the file size. This PR is just adding the `DataFrameWriter` option support to mimic the parquet format behavior. Fixes delta-io#781 Closes delta-io#1017 Co-authored-by: Andrew Olson <aolson1@cerner.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com> GitOrigin-RevId: 02af2c40457fe0acc76a31687e4fd6c47f3f2944

scottsand-db added acknowledged This issue has been read and acknowledged by Delta admins bug Something isn't working labels Oct 7, 2021

noslowerdna mentioned this issue Jan 5, 2022

Make sure that writeOptions is used when passed to the writeFiles function #652

Closed

noslowerdna mentioned this issue Mar 22, 2022

[#781] Add support for Spark DataFrameWriter maxRecordsPerFile option #1017

Closed

vkorukanti closed this as completed in 3fe6f7a Apr 8, 2022

scottsand-db mentioned this issue Apr 25, 2022

[Feature Request] re-use parquet write options #1088

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

noslowerdna commented Sep 17, 2021

scottsand-db commented Oct 7, 2021

zsxwing commented Oct 8, 2021

Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

Lacking support for DataFrameWriter's "maxRecordsPerFile" option? #781

Comments

noslowerdna commented Sep 17, 2021

scottsand-db commented Oct 7, 2021

zsxwing commented Oct 8, 2021